Environment Transfer for Distributed Systems
AA COUSTIC E NVIRONMENT T RANSFER FOR D ISTRIBUTED S YSTEMS
Chunheng Jiang
Department of Computer ScienceRensselaer Polytechnic Institute, Troy, NY, USA [email protected]
Jae-wook Ahn
IBM ResearchYorktown Heights, NY, USA [email protected]
Nirmit Desai
IBM ResearchYorktown Heights, NY, USA [email protected]
January 7, 2021 A BSTRACT
Collecting sufficient amount of data that can represent various acoustic environmental attributes isa critical problem for distributed acoustic machine learning. Several audio data augmentation tech-niques have been introduced to address this problem but they tend to remain in simple manipulationof existing data and are insufficient to cover the variability of the environments. We propose amethod to extend a technique that has been used for transferring acoustic style textures betweenaudio data. The method transfers audio signatures between environments for distributed acousticdata augmentation. This paper devises metrics to evaluate the generated acoustic data, based onclassification accuracy and content preservation. A series of experiments were conducted using Ur-banSound8K dataset and the results show that the proposed method generates better audio data withtransferred environmental features while preserving content features. K eywords Acoustic machine learning, data augmentation, distributed environment transfer
In distributed computing, insuring proper training data for machine learning is an important issue. Even for the generalmachine learning, it is a well-known challenge to acquire sufficient amount of data and avoid overfitting by maximizingthe generalizability of the data. The distributed environments makes this problem even more difficult, as the individualenvironment may have different data distribution to be included in the models, while they share common attributes atthe same time.Distributed acoustic machine learning models are not exceptions. Among the various problem domains in distributedcomputing, acoustic machine learning is useful in various applications, including classification of different kinds ofsounds in city streets [28], detecting anomalies in industrial settings [1, 14], identifying health problems by observingbody sounds [34], and tracing species in danger by tracking their sounds [17].In addition to the variability of the problem domain, the training data for acoustic machine learning models are sensitiveto environmental changes. The same object can create different acoustic signatures in different environments. Forexample, a motor can produce different sounds depending on the location where it is installed in a factory and whenthe sounds are captured during a day. Even in the same location, the direction or distance from sound capturing devices(i.e. microphones) can result in different sounds. If one wants to detect anomalies of motors by observing their sounds,they need to consider all these possible variations resulted from different environments. The most straightforward waymay be to manually collect as much data as possible that can represent all the possible situations. However, it is an1 a r X i v : . [ c s . S D ] J a n nrealistic assumption in many problem domains where it is not allowed to capture audio samples in many locationsfor a long period of time. In fact, when one needs to collect abnormal sounds, it is difficult to expect when theanomalies would happen and the frequency is usually very low.In order to address this problem, we need an efficient method to generate acoustic data adapted to the environmentalchanges that can complement the lack of data in multiple environments. Traditionally, the machine learning commu-nity has devised various methods to face the overfitting issue, such as regularization [15], transfer learning [22, 25],and data augmentation [32]. Transfer learning has been acknowledged to be an effective method to expand the amountof training data by reusing a pre-trained model and transferring knowledge learned from one environment to anotheras a starting point [8, 21]. Data augmentation is a suite of techniques that enhance the size and quality of trainingdatasets such that better deep learning models can be built using them [32].This paper attempts to use a style transfer based data augmentation technique to transfer environmental features togenerate new acoustic data. Several data augmentation techniques for audio data have been introduced. They rangefrom simple noisy injection to more sophisticated room simulation approaches. These simple data augmentationtechniques may be suitable to specific use cases where simple variations of training data would be sufficient but theyare not able to meet our requirement: newly generated data should adapt to a new environment while keeping thenature of the acoustic features of objects of interests.Style transfer was originally developed for generating a new image [5] that resembles the texture of a “style” imagewhile maintaining the structure of objects in a “content” image. A well-known example is to convert a photo (content)with the brush touch of Vincent van Gogh (style). The resulting image looks like a painting of the content objectdrawn with the brush and the technique used in the style image. This style transfer techniques have tried other problemdomains such as text and audio. When applied to text data, original texts are transformed between different times [12],sentences are revised [20], or the sentiments appearing in texts are transferred [16]. Recently, this technique hasbeen used for transferring styles of a source audio [10] and generating a new audio that resembles styles or texturesof the original style audio data. However, the attempts mostly aimed to generate new audio samples that copy theacoustic texture of style audios and were limited to limited utilities such as varying the timbre of musical instruments.We believe that acoustic style transfer can be effectively used for data augmentation in distributed environments, bytransferring acoustic environment style to another environment, and address the data sparsity problem. This paperprovides an experiment that attempts to find the optimal way for style transfer based acoustic data augmentation. Thispaper proposes a pair of evaluation criterion on the quality of style transfer from the needs of data augmentation. Oneis prediction accuracy based, another one is similarity or distance based. We implement a wide random convolutionalneural network as the neural style transfer model, and carry out a series of experiments on UrbanSound8K. The styletransfer model produces better style transfer than the baseline sound mixing approach. Meanwhile, the content arewell-preserved by our transfer model. Also, some related hyper-parameters are examined, and both the loss functionand the architecture of the transfer model affect our transfer performance. With respect to the objective of this study, we found several audio data augmentation techniques from the ones thatfocus on a single feature to the ones that consider more complicated aspects of the sound generation. We review eachof them in this section and introduce the style transfer techniques that are more directly related to our method.
Simpler acoustic data augmentation techniques that manipulate a specific dimension have been widely used for neuralnetwork based speech recognition [18, 23, 24, 30]. Following is a list of the techniques and a brief their descriptions.1. Noise Injection – Add random noises into original sounds.2. Shifting Time – Shift audio to left or right with a random seconds. Silence is added to the shifted space.3. Changing Pitch – Change pitch randomly.4. Changing Speed – Stretch time series by a fixed rate.Salamon et al. [27] has reviewed these techniques and discussed their shortcomings despite their wide availability.Moreover, simple audio manipulation is not able to separate content and background audio features and cannot variatethe audio data that could sound differently under various settings.Another approach that is fits better to our motivation is room simulation . It can simulate recordings of arbitrary micro-phone arrays within an echoic room. It supports research related to developing and experimenting with multichannel2 ontent Audio S p e c t r o g r a m Transfer env. Featuresto Source Object(Minimize Style Loss) Reconstruct Audio(Gri ffi n-Lim Algorithm)Environment Audio S p e c t r o g r a m Train a CNN with env.Specific parametersExtract env. Features(CNN Low Level Layers,Gram Matrix)
Figure 1:
Environment transfer for audio framework: the environment audio and content audio are trans-ferred to spectrogram and each feature is extracted using a CNN. The environment feature is transferredto the content audio so that the resulting audio can represent the change of the content audio in a differ-ent environment. The new strategy (the new convolutional filter configuration) to enhance the method forenvironment audio takes places in the lower row of the diagram. microphone arrays and higher order ambisonic playback. it models both specular and diffuse reflections in a shoeboxtype environment [29, 36]. Even though room simulation is an interesting approach and it takes multiple microphonesinto consideration, it still assumes a single environment (a room) and considers limited number of ad-hoc features:e.g., distance between sound source and mics only.
Image stye transfer [4–6] generates new images by transferring style features from one image (called style image) toanother (called content image). The visual styles such as brush strokes and surface texture are extracted from the styleimage and applied to the content image using a neural network. The texture extraction is done by using convolutionfilters and gram matrix and the transfer is completed using a neural network that minimizes an error function estimatingthe difference between the original and the transferred content. This technique has been able to convert photos thatcontain real-world objects (e.g., a building) to a picture that resembles that the objects are painted with the materialsor the techniques (e.g., Vincent van Gogh painting style) used in the style image, and were able to create images thatlook quite realistic in the sense of the visual styles.Acoustic style transfer is motivated from the original image style transfer. Based on the fact that acoustic signalscan be easily converted to visual images such as spectrograms and the spectrograms can represent the audial featuresvisually, they applied the image style transfer techniques onto the acoustic spectrogram [10]. In the spectrogram, theaudio signals are converted/visualized as heat-maps where the horizontal axis represent frames (or time), vertical axisrepresent frequencies, and the color or intensity of the pixels represent the strength of the signal in the given time andfrequency. The existing acoustic style transfer approaches use a similar method to extract the style features but adaptedto audio by defining a vertically narrow convolutional feature maps or adopts an auto-encoder based approach [26].With the maps, the transfer model can capture acoustic signature across multiple frequency, ranging within a shortamount of time. The style features are obtained by feeding style sound to pre-defined VGG models [33] or a randomnoise layers [36], and then calculating a gain matrix of the activations. The use of random noise as filters was provento work with image style transfer, as the textures share visual elements with simple random noises.
We implemented an acoustic environment transfer system based on the common style transfer method [6,10]. The maindifference between the previous work (style transfer) and the current work (environment transfer) is what is transferredfrom the source to the target sound. The style transfer focuses on the features such as visual textures or audio timbrethat changes the feeling of the generated the data. On the other hand, the environment transfer focuses on transferringacoustic environment where a content object generates sound in different environments. If the environment is different3he sound that the object makes will be different. This acoustic environment will not only include simple backgroundsound but also include specific environment features that could affect the content sound such as echo or reverb. Wehypothesize that these environmental features could be captured as acoustic styles . Therefore, while we adopt the coreidea from the acoustic style transfer, we need to adapt the feature extraction procedure that can represent differentenvironments.Figure 1 shows the overall architecture of acoustic environment transfer. In order to extract two features: (1) envi-ronment features and (2) content features, the source audio files are translated to spectrograms. Spectrogram is a 2Dvisual representation of frequencies of a given signal with time, while the color represents the magnitude or amplitude(see Figure2). Spectrograms contain rich information of the audio signal, including the frequency, the time and theamplitude. Therefore, it is often used as a visualization tool for representing audio signals such as music and speech.Also, it is widely used for audio classification [7, 11] and style transfer [10, 26].The spectrograms are converted to low-level features by the convolution layer. There have been discussions how todefine the convolution layer: use pre-defined model such as VGG-16 or random noise. VGG-16 has been successfullyused by image style transfer [6, 33] but [10] argued that it was not as robust when applied to acoustic style transfer,because it was trained with images while spectrograms show different visual elements such as abstract-looking wavesand shades and even reported random noises produce similar results. This claim also holds for transferring specificimages where the styles are mostly material textures that are visually similar to random noises [36].In order to implement acoustic style transfer more adapted to environment transfer, we propose variable convolutionalfilter configurations. The previous work defined the filters that can capture the features within a short frame. Fromour experience with the use cases such as (Section 4), we learned that it is critical to find out the optimal window size(frame size) for sampling sounds and training models, and we believe it is still true for capturing environmental soundfeatures. If the windows size is too large, then the data can include redundant or repeating features that would lead toperformance degradation and if the size is too small it will lose critical features including temporal patterns that spansover the predefined window. Therefore, in this study we vary the configuration of the convolution filters with respectto the frame size and attempt to find out the best set-up for environment sound transfer.After the low-level features are collected through the convolutional network, they are converted to gram matrices andare used for the transfer stage. Given the representation x of an input audio signal (waveform or spectrogram), aconvolutional neural network architecture is used to extract statistics that characterize stationary sound textures. Let F (cid:96) = [ f (cid:96),k ] N (cid:96) k =1 be activation vector on layer (cid:96) with N (cid:96) nodes. Following the practice in [6], we used the Gram matrix G (cid:96) = F T(cid:96) F (cid:96) as the style loss statistics, and minimized a two-fold loss function L ( x , x c , x s ) = α L c ( x , x c ) + L s ( x , x s ) (1)to transfer the target style, where x c , x s and x are the content, the style and the generated signals, respectively; L c ( x , x c ) = (cid:80) (cid:96) ∈C (cid:107) F (cid:96) ( x ) − F (cid:96) ( x c ) (cid:107) /N (cid:96) and L s ( x , x s ) = (cid:80) (cid:96) ∈S (cid:107) G (cid:96) ( x ) − G (cid:96) ( x s ) (cid:107) F /N (cid:96) are the content andstyle loss, respectively; α > controls how much penalty will be exercised over the deviation of the generated audiofrom the content loss, C and S are the indices for content and style layers. The size of S varies between differentneural network architectures. For example, |S| = 5 in VGG-16, |S| = 8 in SoundNet [2] and |S| = 1 in thewide-shallow-random network by Ulyanov and Lebedev [35]. In our simulation, we have |S| = 1 .The transfer model generates a new spectrogram image, from which we reconstruct the missing phase and generate awaveform audio with the Griffin-Lim algorithm [9].Figure 2 shows example spectrograms of a style audio, content audio, style-transferred audio (from the style to thecontent), and sound mixing audio of the style and content audio 5.3.1. The spectrogram generated by our proposedmethod is clearly distinguished from the others. It captures the features from both the original style and the contentaudio (left column) while not being dominated by either of them. Meanwhile, the audio generated by audio synthesisis just an addition of two separate audio. We will discuss more about the difference in the later sections.Our system is built upon the training of three convolutional neural networks: a style transfer model to transfer thesource style into the target, an audio classifier to predict the class of the sound clips, and an AutoEncoder for em-bedding features. Both the classifier and the AutoEncoder are trained to evaluate the quality of our style transfermodel.
The style transfer model is a shallow, wide, random single-layer neural network with 4096 convolutional filters.4 H z Content Transferred6425610244096 H z TimeStyle Mixed
Figure 2: Comparison between style transfer and audio mixing: with a content audio (dog barking) and a style audio(street music) (left column), example spectrograms of our neural style transfer model generation (upper right) and thesimple sound mixing (lower right) are shown. The transferred sound shows the six acoustic signals in a column format(six dog barks) and the attributes from the style audio (repeatedly playing street music in the background). The mixedaudio shows the content signal is simply added on top of the background.
The model contains four convolutional layers of × (with ReLU activation functions, 32, 32, 64, and 64 channels,respectively), two pooling layers, followed by a flatten layer of size 6656 and two fully connected layers. Threedropout layers are applied after the pooling layers and the first dense layer with a drop rate of 0.15, 0.2 and 0.5,respectively. The AutoEncoder contains an encoder and a decoder, the encoder maps the input to a latent space, and the decodermaps the latent representation to the input space. The encoder has two convolutional layers of × (with ReLUactivation functions, 16 and 8 channels, respectively), and each of which is followed by a pooling layer. The de-coder contains three convolutional layers (with two ReLU activation functions and a softmax, 8, 16 and 2 channels,respectively). The input shape is of × × , and the latent dimension is × × . Among various cases that need to train a model with a limited amount of training dat, we present an example usecase scenario of the acoustic environment transfer in order to help the understanding of readers. In an industrialmanufacturing site, an accident can happen any time and it is challenging to manually monitor the situation 24 hoursa day, depending on the nature of the manufacturing. Some sites would require limited number of staffs during andafter work hours. Moreover, visual inspection can be inherently impossible if the accident happens inside of facilities.Automated acoustic monitoring and inspection can be effective in those sites and acoustic anomaly detectors or audioclassifiers are need to be trained. However, it is not trivial to collect the training data, especially if it is anomaly databecause anomaly itself assumes that it does not happen frequently. It will increase the performance of the anomalydetector if we can generate new data based on the limited training data by allowing the variability of the anomalystate within the problem space. Conventional data augmentation technique will be helpful by simply manipulating thepitch or length of the sounds but we still need to incorporate another variable: acoustic environment. The importanceof the environmental sound is more important in the aforementioned industrial sites, where background noise existsalmost all the time that can continually change over time and can introduce abrupt acoustic change depending onmanufacturing process changes. These background sound variation can greatly affect the performance of the acousticmodels, especially when the training data is limited. We could improve the models if we can transfer the variousenvironment sounds (relatively easily collected) to the limited number of anomaly sample and the resulting soundswill be able to cover anomalies happening in multiple situations.5
Study Design
We ask two questions in this study. We hypothesize that the acoustic style transfer method can be successfully usedto transfer acoustic environment features between distributed environments and augment data for application areassuch as sound classification and anomaly detection. We have discussed simpler data augmentation methods but theysimply manipulate the source data with respect to several basic features and there is no attribute that could be used torepresent different environments. Rather than those data augmentation methods, we adopted an audio mixing methodthat combines the content and environment audio. It is a common method to combine two audios. We will discussabout it in more detail in Section 5.3.1.The second question is to discover what is a better way between environment transfer methods. We configured oursystem either: (1) with same settings with the default ‘style transfer’ system (Section 5.3.2) (2) with variable convolu-tional filter sizes in order to better capture the sound signatures from various environments (Section 5.3.3). We wouldlike to prove if this assumption leads to better results.1. RQ1 – Is the acoustic environment transfer method better than the simple baseline audio mixing in order toadapt content sound in distributed environments?2. RQ2 – Is the optimal acoustic environment transfer method optimized for environment transfer better than theexisting acoustic style transfer that is generic to audio data manipulation rather than environment transfer?
UrbanSound8K [28] is used in our experiments. It is comprised of 8732 sound clips of up to 4 seconds in durationtaken from the field recordings. The clips span ten environmental sound classes: air conditioner, car horn, childrenplaying, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. Also, each sound clip isassigned a subjective salience label in order to indicate a given sound is a foreground (salience=1) or backgroundone (salience=2). This is an interesting attribute particularly for environmental sound analysis. Common audio namelabels does not deliver enough information to determine if the sound is foreground or background. For example,‘Siren’ sounds could be considered either background sound (e.g., an ambulance is going by in a distant location whilea dog is barking) or a foreground sound (e.g., the sound clip is about an ambulance sound recorded directly). We foundthat the background sounds are good candidates that include environmental sounds (e.g., background air conditionerworking sound or background street music) while the foreground sounds are good content sounds (e.g., drilling or gunshot sounds in the foreground) to which the environment sounds are transferred. We summarize related statistics inTable 1.Additionally, each sound clip is tagged with its recording file name, and annotated with the start and end time in theoriginal recording sound. Because of the variation presented in the time duration, we concatenate these the short clipsaccording to the time annotations, and produce 933 foreground recordings, 404 background recordings. From which,we pair 1000 sound clips of 4-second length from different classes, where the foreground sound clip as the targetcontent and the background sound clip as the target style. These 1000 pairs are used to evaluate our style transfermodel. Our audio classifier is trained on 80% randomly selected sound clips, and tested on the remaining 20% clips.The model selection is made based on 10% validation set in the training set to avoid overfitting. Similarly, we train theAutoEncoder network with 80% randomly selected clips as the training set, and the remaining 20% as the testing set.Table 1: Statistics of UrbanSound8K by classes and salience.Class Foreground Background TotalAir Conditioner 569 431 1000Car Horn 153 276 429Children Playing 588 412 1000Dog Bark 645 355 1000Drilling 902 98 1000Engine Idling 916 84 1000Gun Shot 304 70 374Jackhammer 731 269 1000Siren 269 660 929Street Music 625 375 10006 .3 Experiment conditions
We define three conditions: sound mixing without using style transfer, default style transfer without convolutionalfilter variation, and style transfer with the filter variation.
This method overlays a sound on top of another. It may be the first method one can intuitively consider when trying toadapt a content to a new environment. However, simple mixing does not distinguish the environment and the contentand the limitation clearly appears in the result sound. In addition to the simple subjective observation, we will exploreit quantitatively in the experiment.
This is the default style transfer method described in Section 3. The same set-up is used as in [10] that was built forgeneric sound style transfer, rather than specific to environmental sounds.
We compare different filter sizes, hoping they can capture more features specific to environmental sounds, rather thanacoustic textures that would define simple timbre variations.
It is hard to define the notion of content and style in audio style transfer, and no universal agreement on the definitionof content and style has been reached, even in well-developed visual style transfer. However, there are some widelyaccepted principles, e.g., the visual style refers mostly to the space-invariant intra-patch statistics , i.e., to the textureat several spatial scales, and to the distribution of colors a.k.a. the color palette; the visual content represents the broad structure of the scene , that is, its semantic and geometric layout [10]. For acoustic scenarios, it usually stronglydepends on the context.Therefore, it is not surprising that the previous studies relied on qualitative and subjective assessments [10]. However,the goal of environmental audio transfer in this study is specifically focusing on data augmentation across environ-ments: generate new acoustic data adapted from one environment to another. To make the evaluation easier, wedevelop a set of evaluation criterion on generic environmental acoustic style transfer from the perspective of dataaugmentation.The purpose of data augmentation is to generate some rare data that are not observed in the given dataset. Conventionaldata augmentation approaches are based on random perturbations over existing samples. The neural style transferbased framework is proposed as an alternative. The framework generates new data from pairs of target audio clips,one contributes the content, another one supplies the style.There is very few previous work on the evaluation of style transfer. Fu et al. [3] defined two evaluation metrics for style transfer in text : the transfer strength indicates how the newly generated data can be accurately classified asthe target [3, 31], and the content preservation rate measures the similarity between the source embedding and thetarget embedding. Text style transfer is similar to style transfer in speech [13], and the preserved content should bemeaningful and recognizable. Therefore, it requires high accuracy in classification.Our goal is to ensure the semantic content well-preserved, whilst the target style is transferred in the newly generateddata. For the applications of classification, these transferred data should be able to cheat the state-of-the-art classifier F trained from the dataset without these instances. With these transferred data included, we retrain the model for anew classifier F . It would be more robust and have better generalization. Therefore, the value of the style transfermodel can be estimated using the prediction accuracy increase with the transferred data included. Let P and P bethe accuracy of F and F on the same testing set, and P − P is the value of the style transfer model. Equivalently,we can evaluate the performance of the same classifier F on the generated data from our transfer model, and thesynthesized data from the simple sound mixing. Given the same pairs of content and style audios, the transfer modelgenerates D t and the simple mixing model produces D m , Suppose F gives the prediction accuracy of P t and P m ,respectively. The lower of the prediction accuracy P t relative to P m , the better of the style transfer model. The ideaforms a contrast to the proposed transfer strength in text style transfer [3]. The text style transfer requires the generateddata recognizable, and it is very sensitive to style noise. But our purpose is different, we are concentrating on dataaugmentation. If we can generate “high quality” data to beat F , we succeed.7o be of high quality, we impose additional constraints over the transferred data. Similar to the content preservationproposed in [3], we require the content well-preserved in the transferred data. To meet this end, we obtain the repre-sentative features from raw audio using an AutoEncoder. With these features, we can calculate the similarity betweenthe generated data and source audios. The more similar, the more preserved and the better. We evaluate the performance of our style transfer model on 1000 pairs of foreground and background sound clips of4-second length from different classes in UrbanSound8K.To answer
RQ1 , we select the simple sound mixing as the baseline, and compare its performance with that of the styletransfer model, based on our proposed evaluation criterion.There are many hyper-parameters in the framework, some of them may greatly affect the performance of our styletransfer model, therefore impact the comparison results. Hence, we study the impacts from two aspects, one is theobjective function, another is the network architecture. The scalar factor α in the joint loss function and the filtersize of the convolutional layer are selected as representative parameters to study. The tuning of these parameters cantherefore answer RQ2 for an optimal style transfer.According to Figure 3, as the factor α increases from 0 to 0.9, the recognizing ability of the trained classifier withrespect to the content is negatively affected by the transferred style. Put it in other words, the generated data lossessome structure information. Meanwhile, the target style has an obvious increasing trend in the generated data. When α = 0 . , we achieve same the prediction accuracy on content and style. It indicates that the prediction uncertaintyis high, we succeed to beat the classifier. At the same time, the accuracy on the transferred data with respect to thecontent and style is less than that of the synthesized data from the baseline. The difference in accuracy between thebaseline and the generated data from the style transfer model indicates the value of the transfer model.The impacts of the filter size on the prediction accuracy is very stable, either similar to the baseline on the contentprediction, or lower than the baseline on the style prediction. Again, the lower the prediction, the better of the model,and the style transfer model wins out.The preservation in content is measured with embedding features of the two involved inputs and the generated data.With the AutoEncoder, we are able to extract an embedding feature for each input audio, then we calculate the distance d ( x , x c ) between the generated data and the content audio, and the distance d ( x , x s ) of the generated data to the targetstyle audio, where x , x c and x s are the embedding features of the generated data, the content audio and the style audio,respectively. The simple audio mixing is selected as the baseline approach. The related distances d ( z , x c ) and d ( z , x s ) are compared against those for the generated data. As illustrated in Figure 4, the generated data are most similar toboth the content and the target style audios when α = 0 . or the filter size is large. The distance ratios of the generatedaudios to the baseline are less than one, which indicates that the content preservation is better than the simple mixingapproach. To design a reliable evaluation metric for acoustic style transfer, we propose a pair of evaluation criterion to generatediverse data for the purpose of data augmentation. One is built on the classification accuracy on the transferreddata, another one relies on the similarity between the transferred data and the input content and target style. Theclassification performance on the transferred data includes the prediction accuracy on the content, and the accuracyon the style. The generated data is expected to beat the classifier that trained from dataset without these transferredinstances included. Style transfer can introduce diverse environmental scenes to the training set. Once our modelcan accurately recognize these transferred data, it will be more robust and have a better generalization than the onetrained from the content or style only audios. Another objective is maximize content preservation. The input semanticcontent, if well-preserved in the generated data, will keep complete in structure. For speech and music style transfer,the content must be strictly preserved, otherwise the neural transfer model probably generates some utter meaninglesssounds [19].Moreover, we examined two crucial hyper-parameters regarding their impacts on the quality of the style transfer basedon our proposed criterion. The penalty factor α over content deviation affects how much content is preserved in thetransferred data. The higher the value, the more content preserved. Meanwhile, the amount of style transferred will benegatively affected. The trade-off is accurately depicted by the prediction accuracy change on style and content in thetransferred data. To some extend, it confirms a dependency relation between the content and style. It indicates that theacoustic content and style are not as separable as in the visual applications [6].8he neural style transfer model itself may exhibit another cause of the trade-off. Therefore, we also examined the sizeof the convolutional filter. Relatively, the filter size does not affect the transfer as much as α .Our style transfer experiments are built on a single layer, wide random CNN. Therefore, the layer index set C forcontent equals to the index set S for style. It causes some dependency between the style and content representationsfor sure. To have a deep understanding of the relation between acoustic content and style, we need a comprehensivestudy of the hyper-parameters in the neural style transfer model, other parameters, e.g., an architecture different fromCNN, the number of layers, may also greatly influence the transfer quality. References [1] J.-w. Ahn, K. Grueneberg, B. J. Ko, W.-H. Lee, E. Morales, S. Wang, X. Wang, and D. Wood. Acoustic anomalydetection system: Demo abstract. In
Proceedings of the 17th Conference on Embedded Networked Sensor Sys-tems , SenSys ’19, page 378–379, New York, NY, USA, 2019. Association for Computing Machinery.[2] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In
Advances in neural information processing systems , pages 892–900, 2016.[3] Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan. Style transfer in text: Exploration and evaluation. 2018.[4] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information ProcessingSystems 28 , pages 262–270. Curran Associates, Inc., 2015.[5] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.
CoRR , abs/1508.06576, 2015.[6] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[7] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter.Audio set: An ontology and human-labeled dataset for audio events. In , pages 776–780. IEEE, 2017.[8] I. Goodfellow, Y. Bengio, and A. Courville.
Deep learning . MIT press, 2016.[9] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform.
IEEE Transactions onAcoustics, Speech, and Signal Processing , 32(2):236–243, 1984.[10] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. P´erez. Audio style transfer. In , pages 586–590, April 2018.[11] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous,B. Seybold, et al. Cnn architectures for large-scale audio classification. In , pages 131–135. IEEE, 2017.[12] H. Jhamtani, V. Gangal, E. Hovy, and E. Nyberg. Shakespearizing modern language using copy-enrichedsequence-to-sequence models, 2017.[13] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al. Transferlearning from speaker verification to multispeaker text-to-speech synthesis. In
Advances in neural informationprocessing systems , pages 4480–4490, 2018.[14] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo,M. Yasuda, and N. Harada. Description and discussion on dcase2020 challenge task2: Unsupervised anomaloussound detection for machine condition monitoring, 2020.[15] J. Kukaˇcka, V. Golkov, and D. Cremers. Regularization for deep learning: A taxonomy, 2017.[16] J. Li, R. Jia, H. He, and P. Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer,2018.[17] M. A. McDonald, S. L. Mesnick, and J. A. Hildebrand. Biogeographic characterization of blue whale songworldwide: using song to identify populations.
Journal of cetacean research and management , 8(1):55–65,2006.[18] B. McFee, E. J. Humphrey, and J. P. Bello. A software framework for musical data augmentation. In
ISMIR ,volume 2015, pages 248–254, 2015.[19] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: Anunconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837 , 2016.920] J. Mueller, D. Gifford, and T. Jaakkola. Sequence to better sequence: Continuous revision of combinatorialstructures. In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on MachineLearning , volume 70 of
Proceedings of Machine Learning Research , pages 2536–2544, International ConventionCentre, Sydney, Australia, 06–11 Aug 2017. PMLR.[21] E. S. Olivas, J. D. M. Guerrero, M. Martinez-Sober, J. R. Magdalena-Benedito, L. Serrano, et al.
Handbookof Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques: Algorithms,Methods, and Techniques . IGI Global, 2009.[22] S. J. Pan and Q. Yang. A survey on transfer learning.
IEEE Transactions on knowledge and data engineering ,22(10):1345–1359, 2009.[23] G. Parascandolo, H. Huttunen, and T. Virtanen. Recurrent neural networks for polyphonic sound event detectionin real life recordings. In , pages 6440–6444, 2016.[24] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple dataaugmentation method for automatic speech recognition.
Interspeech 2019 , Sep 2019.[25] L. Y. Pratt. Discriminability-based transfer between neural networks. In
Advances in neural information pro-cessing systems , pages 204–211, 1993.[26] D. Ramani, S. Karmakar, A. Panda, A. Ahmed, and P. Tangri. Autoencoder based architecture for fast & realtime audio style transfer.
CoRR , abs/1812.07159, 2018.[27] J. Salamon and J. P. Bello. Deep convolutional neural networks and data augmentation for environmental soundclassification.
IEEE Signal Processing Letters , 24(3):279–283, 2017.[28] J. Salamon, C. Jacoby, and J. P. Bello. A dataset and taxonomy for urban sound research. In
Proceedings of the22nd ACM International Conference on Multimedia , MM ’14, pages 1041–1044, New York, NY, USA, 2014.Association for Computing Machinery.[29] S. M. Schimmel, M. F. Muller, and N. Dillier. A fast and accurate “shoebox” room acoustics simulator. In , pages 241–244, 2009.[30] J. Schl¨uter and T. Grill. Exploring data augmentation for improved singing voice detection with neural networks.In
ISMIR , pages 121–126, 2015.[31] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from non-parallel text by cross-alignment, 2017.[32] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning.
Journal of BigData , 6(1):60, 2019.[33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[34] Z. Syed, D. Leeds, D. Curtis, F. Nesta, R. A. Levine, and J. Guttag. A framework for the analysis of acousticalcardiac signals.
IEEE Transactions on Biomedical Engineering , 54(4):651–662, 2007.[35] D. Ulyanov and V. Lebedev. Audio texture synthesis and style transfer, 2016. Accessed: 2020-06-18.[36] I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge. Texture synthesis using shallow convolutionalnetworks with random filters.
CoRR , abs/1606.00021, 2016.10 .0 0.2 0.4 0.6 0.8 ® A cc u r ac y Trans ContentTrans StyleMixed ContentMixed Style4 6 8 10Filter Width0.40.50.6 A cc u r ac y Trans ContentTrans StyleMixed ContentMixed Style
Figure 3: Impacts of α (top) and the filter width (bottom) on the transferred data and the classifier prediction perfor-mance. 11 .0 0.2 0.4 0.6 0.8 ® d ( x; x c ) =d ( z; x c ) d ( x; x s ) =d ( z; x s ) d ( x; x c ) =d ( z; x c ) d ( x; x s ) =d ( z; x s ) Figure 4: Impact of α (top) and the filter width (bottom) on the transferred data and the distances d ( x , x c ) of thetransferred data to the target content and the target style. Here, x c and x s are the embedding features of the contentaudio and the style audio, respectively; z is the feature map of the synthesis from simple sound mixing approach, and x is the representative feature vector of transferred audio; d ( · , · ) is the Euclidean distance. The denominators d ( z , x c ) and d ( z , x c ))