Efficient Neural Networks for Real-time Analog Audio Effect Modeling
EEfficient Neural Networks for Real-timeAnalog Audio Effect Modeling
Christian J. Steinmetz and Joshua D. Reiss
Centre for Digital Music, Queen Mary University of London, London, [email protected]
Abstract —Deep learning approaches have demonstrated suc-cess in the task of modeling analog audio effects such as distortionand overdrive. Nevertheless, challenges remain in modeling morecomplex effects, such as dynamic range compressors, along withtheir variable parameters. Previous methods are computationallycomplex, and noncausal, prohibiting real-time operation, whichis critical for use in audio production contexts. They addition-ally utilize large training datasets, which are time-intensive togenerate. In this work, we demonstrate that shallower temporalconvolutional networks (TCNs) that exploit very large dilationfactors for significant receptive field can achieve state-of-the-artperformance, while remaining efficient. Not only are these modelsfound to be perceptually similar to the original effect, they achievea 4x speedup, enabling real-time operation on CPU, and can betrained using only 1% of the data from previous methods.
I. I
NTRODUCTION
Audio effects provide the ability to adjust perceptualattributes of audio signals such as loudness, timbre, pitch,spatialization, or rhythm, and form a core component of thetools used by audio engineers [1]. While a significant amountof processing in audio production is performed digitally, thereis a rich history of analog equipment that remains in highdemand for its unique sonic signature. As a result, there hasbeen significant interest in virtual analog modeling [2]–[10], thetask of constructing digital models to emulate analog devices.Digital models of analog equipment have a number ofadvantages. These include the ability to run multiple instancesof the effect concurrently, a reduction in physical spacecompared to their analog counterparts, and most notably,a reduction in the cost of production compared to analoghardware [11]. Therefore, not only does virtual analog modelinghelp preserve the heritage of vintage audio hardware, italso enables a larger audience to have access to such tools.Beyond modeling analog effects, neural audio effects also haveapplications in automatic multitrack mixing [12], where moreefficient implementations could enable advancements.Due to the complex nonlinear behavior of many of theseaudio signal processing devices, traditional system identifi-cation techniques for linear systems, such as the impulseresponse, are inadequate, and instead advanced modelingapproaches are required. Over the past decade, a number ofapproaches have been investigated, such as wave digital filters(WDF) [2] and Volterra series [13] for modeling vacuum-tube amplifiers, Wiener-Hammerstein models for modelingdistortion circuits [4], and state-space models for dynamic rangecompressors [14]. While these methods have achieved some success, they may require knowledge of the underlying analogcircuit, be computationally complex, or otherwise unable tocapture some types of nonlinearities. For these reasons, there isgrowing interest in the application of deep learning to overcomethese limitations in a data-driven manner.Thus far, applications of neural networks have focused mostlyon modeling vacuum-tube amplifiers [15]–[18] and distortioncircuits [7], [8], [10], [19], with some demonstrating theability to run in real-time on CPU. In contrast, dynamic rangecompressors [20] pose a greater challenging in the modelingtask due to their time-dependant nonlinearities, and have sofar seen less attention. While Martínez Ramírez et al. [9]briefly address the 1176N compressor, they do not considermodeling the parameters of the compressor and only utilizeelectric guitar and bass signals. Hawley et al. [21], [22]model the LA-2A compressor and its parameters using diversecontent, but while they capture the overall characteristics ofthe device, their model exhibits artifacts, is noncausal, andwith over 4 M parameters, is not capable of real-time operation.Recently, temporal convolutional networks (TCNs) have beenshown to be successful in audio effect modeling [23]. Whilethey demonstrate promising results in modeling the LA-2Acompressor, these models are also noncausal, computationallyexpensive, and utilize a significant amount of training data.We aim to address these limitations with a more efficientformulation of the TCN that employs causal convolutions. Werealize that while computation across the temporal dimensioncan be parallelized in the TCN, computations through the depthof the network are sequential. Therefore, shallower networksprovide greater efficiency, but often lack sufficient receptivefield. We find that by using rapidly growing dilation factors,larger than previously used, we can construct shallow networksthat achieve comparable receptive field with greater efficiency,producing minimal impact on performance. Our contributionsare summarized as follows: • We employ causal convolutions with rapidly growing dila-tion factors to construct shallower networks for modelingan analog dynamic range compressor. • These models run in real-time on both GPU and CPU,and we characterize the trade-off between feedforwardand recurrent models for real-time audio processing. • We demonstrate that only 1% of the data from previousapproaches is sufficient for training these models. • In a listening test, our real-time models produce resultsperceptually similar to the original effect. a r X i v : . [ ee ss . A S ] F e b aveform Processed waveformProcessed waveformParameters Audio Effect = Neural network
Fig. 1. Block diagram of the audio effect modeling task where a neuralnetwork g is used to emulate a parametric audio effect f . II. A
UDIO EFFECT MODELING
In this task, we consider an audio effect f , a function thattakes as input an audio signal x ∈ R N of N samples, and aset of P parameter values φ ∈ R P , e.g. the value of the knobsof the device shown in Fig. 1. This function then produces acorresponding processed version of the signal y ∈ R N . Ouraim is then to construct a model g that takes as input x and φ ,and produces a signal ˆy that is perceptually indistinguishablefrom y . We formalize this goal across the space of deviceparameters and input signals as shown in Eq. 1. We intendto construct g as a neural network and employ modern deeplearning methods to learn to emulate the original audio effect f ,using pairs of input-output ( x , y ) recordings from the device. g ( φ , x ) = f ( φ , x ) , ∀ φ , x . (1)While there have been a number of works that apply neuralnetworks for the audio effect modeling task, many of theseworks consider only a single configuration of the device, i.e.they optimize g without conditioning at only a single valueof φ [15], [17], [24]. Other approaches consider multipleparameterizations, but during training use only a subset of inputsignal types x , e.g. only electric guitar and bass signals [7], [8],[16]. While such an approach may be applicable in the case ofmodeling guitar amplifiers and distortion effects, for generalaudio processing devices, like the dynamic range compressor,we cannot make such strong assumptions. Instead, we considermodeling the device across all configurations φ and all realisticaudio signals x . This complicates the modeling process andlikely requires we collect more data, build larger models withgreater expressivity, and train for a longer period. All of thesedetails present unique challenges, especially when we aim todeploy these models in a real-time audio application on CPU.III. R ELATED WORK
A collection of architectures have been proposed for theaudio effect modeling task. These can be divided into threecategories, recurrent neural networks (RNNs), and their variants(LSTM, GRU, vanilla RNN), temporal convolutional networks(TCNs), also known as the feedfoward WaveNet, and archi-tectures that combine both elements. Simple RNNs have beenshown to be effective in modeling nonlinear effects like thoseproduced from vacuum-tube amplifiers and guitar distortioneffects, often within perceptual tolerances [7], [10], [15]–[17].These formulations process the signal on a sample-by-sample
Peak LimitLSTMLinear x t h t ... y t Peak LimitLSTMLinear x t+1 h t+1 y t+1 ... Peak LimitLSTMLinear x t+2 h t+2 y t+2 ... Fig. 2. General LSTM with conditioning based on Wright et al. [7] The inputat each time step is a vector of 3 elements: the current input sample, alongwith the conditioning parameters for the limit and peak reduction controls. basis. While many approaches consider only a single configu-ration of the device parameters, Wright et al. [7] demonstrateddevice parameters can be modeled by appending them directlyto the input sequence as conditioning as shown in Fig. 2. Whilethese models are easy to implement and are capable of runningin real-time in optimized C++ implementations, they havewell known disadvantages, namely the inability to parallelizecomputations across the temporal dimension, overall slowerconvergence, and empirical limits to their memory [25]. So farthese issues have not yet restricted the application of RNNs formodeling distortion-based effects. However, they may causeissues when scaling these methods for more complex nonlinearmodeling tasks, like that of the dynamic range compressor,which we will examine in Sec. VI.In contrast to RNN-based approaches, Damskägg et al. [8]proposed the application of TCNs to the same task of modelingguitar distortion effects. The TCN is a generalization ofconvolutional networks applied to sequence modeling (dilated1-dimensional convolution + nonlinearity). Interestingly, yetmaybe somewhat unsurprisingly, these models resemble Wiener-Hammerstein models [26], a traditional statistical approach tomodeling nonlinear systems. TCNs were popularized in theautoregressive WaveNet [27] for speech synthesis, but havenow been adapted for many tasks, including both causal andnoncausal [28] feedforward formulations.The general architecture for the TCN in the context ofaudio effect modeling is shown in Fig. 3, based upon theimplementation from [12]. It consists of residual blocks, com-posed of 1-dimensional convolutions with increasing dilationfactors, followed by batch normalization, a conditional affinetransformation (FiLM) [29], and a PReLU [30] nonlinearity.A common approach to obtaining a large receptive field is tostack multiple blocks with a dilation factor that grows as apower of 2 as the depth of the network increases, such thatthe dilation of the l -th layer is given by d l = 2 l − .Since no padding is applied other than to the input sequence,the output of each convolution will be smaller than the input.This requires we crop the residual connections. The residualconnections are implemented with 1x1 grouped convolutions,which enables only scaling of the features. Care must betaken to perform cropping of the residual connections correctlydepending on the causality of the model. In the noncausal case, onv1dFiLMPReLU 1x1+MLP Peak Limit TCN BlockTCN BlockTCN BlockTCN BlockLinearLinearLinearLinear Input waveform CropOutput waveform BatchNorm Fig. 3. General TCN architecture featuring a global conditioning module (3layer MLP) that generates embeddings for each FiLM operation in the TCNprocessing pipeline as a function of the limit and peak reduction controls. Thecontents of the TCN block are shown in the dashed block on the right. we employ a central crop across the temporal dimension, whilein the causal case, this crop selects the last N samples, where N is the number of time-steps at the output of the convolution.The traditional gated convolution [31] for conditioning isreplaced with FiLM, which performs a conditional affinetransformation of each channel activation x c , with scaling γ c ,and bias β c parameters. The output of this operation at eachlayer is given by F ( x c , γ c , β c ) = γ c x c + β c , (2)where c is the channel index. Batch normalization, (as shown inEq. 3 without its own affine transformation) is applied beforethis operation, not only to stabilize training, but to ensurethe conditioning information is applied optimally, since theactivations will be normalized first. y = x − E[ x ] (cid:112) Var[ x ] + (cid:15) (3)In order to generate these scaling and bias parameters, a 3layer multilayer perceptron (MLP) projects the device controlparameters (limit and peak reduction controls in the case ofthe LA-2A compressor) into a 32-dimensional embedding,shown in Fig. 3 Left. This embedding is then passed toa linear layer at each block, which uniquely adapts theconditioning. Note that the γ c and β c parameters will adapt atinference based on the device control parameters, extendingthe expressivity of the model. This form of conditioning is notonly simpler, but has demonstrated superior performance asa conditioning mechanism in convolutional networks in manydomains, including audio [32]–[34].IV. E FFICIENT
TCN S In the design of a TCN for real-time operation, we firstconsider the requirement for noncausality, which imparts alower-bound on the latency our system can achieve. In the caseof the noncausal TCN-324-N model [35], the input receptivefield is split evenly between the past and present samples, suchthat a delay of ≈
150 ms is required for adequate “look-ahead” ... ...
11 2 3 ... noncausalno padding noncausalw/padding causalw/paddinga) b) c)
Fig. 4. Illustration of padding approaches for an input signal of N sampleswith a convolutional kernel of size 3. a) Noncausal TCN formulation where nopadding is applied [12], b) Standard “same” padding, which is also noncausal,c) Truly causal padding, as we employ. in a real-time context. Even so, this additionally assumes themodel can operate in real-time on the provided hardware.While noncausality may aid in the modeling task, we shouldbe able to accurately model this device with a causal modelsince the analog LA-2A is itself causal. We propose to do soby adopting causal convolutions, which are a common featureof TCNs [25]. Causal convolutions are achieved by padding theinput on the left with r − samples, where r is the receptivefield of the model. In the case of the TCN, the receptive field atlayer l is given by r l = r l − + ( k − · d , where k is the kernelsize, and d is the dilation factor. This padding is illustratedin Fig. 4c for a convolution with a kernel size of 3. In orderto achieve causality, the output must be a function only ofcurrent and previous time-step input values. Notice that the“same” padding employed in deep learning frameworks willnot be causal (Fig. 4b), as will no padding (Fig. 4a).Nevertheless, causality does not necessarily produce a modelcapable of real-time operation. The model must additionallybe able to process a buffer of N samples in less than N · f s seconds, where f s is the sample rate. To reduce thecomputational complexity of the TCN, we acknowledge thatwhile computation across the temporal dimension can beparallelized, computation through the depth of the networkcannot. This imposes a run-time constraint that is a functionof the model depth, as well as other factors like the number ofconvolutional channels, the hardware platform (CPU/GPU), andthe convolution implementation. Therefore, one straightforwardroute to decreasing the run-time involves simply constructingshallower networks. Unfortunately, this comes at the cost of asmaller receptive field, which is critical for the modeling task.In order to rectify this, we propose a simple adjustment. Wecan achieve a comparable receptive field with fewer layers byusing dilation factors that grow more rapidly in comparisonto the base 2 convention, d l = 2 l − . While less common,some recent speech synthesis models employ more aggressivedilation patterns, such as d l = 3 l − [36], [37]. Though this willachieve a larger receptive field, it will not produce a significantdifference when using only a few layers, as we desire for thisefficient application. Therefore, we propose to use even largerdilation growth, d l = 10 l − , which enables only four layersto achieve the same receptive field as previous methods. Toour knowledge, there has been no investigation of models thatutilize dilation factor growth at this rate.. L OSS FUNCTIONS
Audio effect modeling tasks generally consider a loss in thetime-domain between the target output y and the predictionfrom the model ˆ y . Commonly, this is the mean absoluteerror (MAE) computed on the time-domain signal (Eq. 4).Related variants such as the error-to-signal ratio [7], [18] orthe log-cosh [21] have also been employed. (cid:96) MAE (ˆ y, y ) = 1 N N (cid:88) i =1 | ˆ y i − y i | (4)These losses are convenient, yet enforce strict adherence tothe time-domain target and its absolute phase. Nevertheless,the goal is to produce an audio signal that is perceptuallysimilar to the original effect. Therefore, exactly following thetime-domain output is not required, and such losses may placea stronger constraint than required during training [38].This motivates the use of a spectral magnitude loss, which islargely agnostic to small phase shifts. The Short-time Fouriertransform (STFT) loss proposed by Arik et al [39] is nowcommonplace in audio tasks, and addresses this need. It iscomposed of two terms, the spectral convergence (cid:96) SC (Eq. 5),and spectral log-magnitude (cid:96) SM (Eq. 6), where || · || F is theFrobenius norm, || · || is the L norm, and N is the numberof STFT frames. The overall STFT loss (cid:96) STFT is defined as thesum of these two terms, as shown in Eq. 7. (cid:96) SC (ˆ y, y ) = (cid:107) | STFT ( y ) | − | STFT (ˆ y ) | (cid:107) F (cid:107) | STFT ( y ) | (cid:107) F (5) (cid:96) SM (ˆ y, y ) = 1 N (cid:107) log ( | STFT ( y ) | ) − log ( | STFT (ˆ y ) | ) (cid:107) (6) (cid:96) STFT (ˆ y, y ) = (cid:96) SC (ˆ y, y ) + (cid:96) SM (ˆ y, y ) . (7)Previous investigation into loss functions for the audioeffect modeling task found that optimizing a time-domainloss produced better results when evaluated with time-domainmetrics, and slightly worse results with frequency-domainmetrics, with the opposite being true when optimizing afrequency-domain loss [35]. Therefore, to balance this, weemploy a loss that combines both the MAE and the STFT lossterms as shown in Eq. 8. (cid:96) MAE+STFT (ˆ y, y ) = (cid:96) MAE (ˆ y, y ) + (cid:96) SC (ˆ y, y ) + (cid:96) SM (ˆ y, y ) . (8)This approach is not unique, and a similar approach wastaken in SignalTrain, which combined the time-domain logcoshloss with a logarithmic weighted frequency error [21]. Addition-ally, the use of a pre-emphasis filter has also been proposed toencourage audio effect models to better capture high frequencycontent [40]. In our case, we found the application of the STFTloss sufficient to rectify this for modeling the LA-2A. Thisis likely due to the fact that this loss function places a largerweight on the high frequency content due to the linear spacingof the STFT frequency bins. VI. E XPERIMENTS
A. Dataset
We consider the SignalTrain dataset introduced by Haw-ley et al. [21]. This dataset provides approximately 20 hoursof input-output recordings at f s = 44 . kHz from an analogLA-2A dynamic range compressor. It covers a diverse rangeof audio content including individual instruments, loops, andcomplete musical pieces, in addition to tones and noise bursts.This particular compressor features two control parametersthat are varied in the dataset: a binary switch that places thedevice in either “compress” or “limit” mode, as well as acontinuous peak reduction parameter that controls the amountof compression as a function of the input level, known as the“threshold” on other compressors. They provide audio processedby the compressor at 40 different parameter configurations. Weutilize the train/val/test split from the original dataset, whichensures that each subset contains unseen audio. B. Models
We re-implement the TCN from [35], which we denote TCN-324-N. This model has 10 layers with a dilation pattern given by d l = 2 l − , where each layer includes 32 channels. This modelis noncausal and achieves a receptive field of 324 ms at f s = d l = 10 l − , enabling the use of fewer layers.In order to observe the impact of the receptive field, we trainvariants of the efficient TCNs with receptive fields of 101 ms(TCN-100), 302 ms (TCN-300), and 1008 ms (TCN-1000). Toobserve the need for noncausality, we train each model in bothcausal and noncausal formulations. Models ending in “-N” arenoncausal, while those ending in “-C” are causal. We alsoinvestigate the amount of training data required. We train theTCN-300-C model with subsets of the dataset that contain only10% and 1% of the training data by splitting the training setby the parameter configurations, and randomly sampling anequal amount of audio from each of these configurations. C. Training
All models are trained with a batch size of 32 and inputs of65536 samples ( ≈ p = 0 . . We employ Adam with an initiallearning rate of e − , decreasing the learning rate by a factorof 10 after the validation loss has not improved for 10 epochs.Additionally, we use automatic mixed precision to decreasetraining time and memory consumption. We make the code forthese experiments available . (Version 1.1) https://zenodo.org/record/3824876 https://github.com/csteinmetz1/micro-tcnodel k (cid:96) d c Params R.f. RT (CPU/GPU) MAE STFT LUFS (dB)TCN-324-N [12] 15 10 2 32 162 k 324 ms 0.5x / 17.1x 1.70e-2 0.587 0.520TCN-100-N 5 4 10 32 26 k 101 ms 4.2x / 37.1x 1.58e-2 0.768 1.155TCN-300-N 13 4 10 32 51 k 302 ms 1.8x / 37.3x
TABLE IM
ODEL PERFORMANCE ON THE S IGNAL T RAIN TEST SET . M
ODELS ENDING WITH -N ARE NONCAUSAL , AND THOSE ENDING IN -C ARE CAUSAL . k IS THEKERNEL SIZE , l IS THE NUMBER OF LAYERS , d IS THE DILATION GROWTH FACTOR , AND c IS THE NUMBER OF CONVOLUTIONAL CHANNELS . R
EAL - TIMEFACTOR (RT)
IS REPORTED ON BOTH
CPU
AND
GPU
WITH A FRAME SIZE OF
SAMPLES .Model c Params. RT (CPU/GPU) MAE STFT dB LUFSTCN-324-N 32 162 k 0.5x / 17.1x 1.70e-2
TCN-324-N 16 47 k 1.3x / 17.1x 4.38e-2 0.796 1.305TCN-324-N* 8 16 k 2.2x / 17.1x 5.29e-2 1.143 1.315TCN-300-C 32 51 k 2.2x / 33.4x
ARIANTS OF THE
TCN-324
MODEL USING FEWER CONVOLUTIONALCHANNELS . ∗ T HIS MODEL DIVERGED DURING TRAINING . VII. E
VALUATION
We consider three different metrics for the objective eval-uation of the models. The first two are components of thetraining objective, the MAE of the time-domain signal (Eq. 4),and the STFT loss (Eq. 7). As a perceptually informed metric,we define the loudness error as the absolute error between theloudness of the prediction and target signals computed usingthe ITU-R BS.1770 recommendation [41], given by (cid:96)
LUFS (ˆ y, y ) = | G (ˆ y ) − G ( y ) | , (9)where G ( · ) is the ITU-R BS.1770 loudness algorithm. Withthis metric, we can measure to what degree the perceivedloudness was captured by the model, which is correlated withthe application of the correct gain reduction.Results comparing our causal and efficient TCNs to previousapproaches are shown in Table I. The model hyperparameters, k kernel size, l number of layers, and d dilation growth factorare reported, along with the number of model parameters andthe receptive field in milliseconds achieved at f s = Model Data Per config. Total MAE STFT dB LUFSTCN-324-N 100% 30 min 19.5 hr 1.70e-2 0.587
TCN-300-C 100% 30 min 19.5 hr 1.44e-2 0.603 0.761TCN-300-C 10% 3.0 min 1.9 hr
MODEL WITH VARYING AMOUNT OF TRAINING DATA . The most significant finding is that our efficient TCNs, whichemploy very large dilation growth factors and are shallowerthan the TCN-324-N model, have comparable receptive fieldand performance while using a third of the parameters andproviding up to four times faster run-time on CPU.In order to investigate the efficacy of larger dilation factors,and to demonstrate that merely scaling down the width ofthe TCN-324-N model does not provide comparable accuracyand efficiency, we train narrower variants of the TCN-324-Nmodel with fewer convolutional channels, as shown in Table II.We find that while scaling down the width of these modelsdoes increase the real-time factor, it comes at the cost ofperformance, with the TCN-300-C significantly outperformingthese variants.Notably, the LSTM-32 model achieves the best performanceacross both the STFT and LUFS metrics, but an order ofmagnitude worse with respect to the time-domain performance(MAE). This demonstrates the major advantage of recurrentmodels, namely that they are able to achieve an adaptivereceptive field in a parameter efficient manner. In this case,the LSTM-32 uses 32x fewer parameters than the TCN-324model. Nevertheless, while this class of models is parameterefficient, processing across the temporal dimension cannotbe parallelized, leading to significantly slower training andinference times. In this case, the LSTM-32 model is not capableof real-time operation on CPU in the PyTorch implementation,and additionally required over 8 times longer to train (108 hr)compared to the TCN-300-C model (13 hr).
A. Data efficiency
While the SignalTrain dataset provides 20 hours of record-ings from the LA-2A, we investigate the requirement for sucha large dataset in this task. We split the original training datasetinto random subsets, with a balanced number of examples for
Frame size0.010.11.0101001000 R e a l - t i m e f a c t o r TCN-324-NTCN-300-CLSTM-32
Fig. 5. RT of models on GPU (solid) and CPU (dashed) at different framesizes. RT greater than 1 is required for real-time operation. each parameter configuration. In the 10% subset there is a totalof 1.9 hours of audio with 3 minutes of audio per configurationof the compressor. Furthermore, the 1% subset results in atotal of just 11 minutes of audio in total, with only 18 secondsper configuration. Results for the TCN-300-C model trainedwith these subsets are compared against the version trainedwith the complete dataset in Table III.We find that reducing the size of the training dataset doesnot harm performance. Surprisingly, there is an improvement inperformance using the smaller training subsets. We hypothesizethis could be due to some special characteristics of the randomsubset that was selected, for example, selecting more sampleswith tones and noise bursts, which could be more informative.Nevertheless, these results are promising, indicating thatmodeling related analog audio effects could be achieved withsignificantly smaller datasets, which greatly lowers the burdenin creating such datasets. This agrees with the findings fromprevious work on modeling much simpler effects, such asdistortion [7] and guitar amplifiers [16].
B. Compute efficiency
We further investigate the run-time of these models in ablock-based implementation that aims to mimic a standardaudio effect. The real-time factor (RT) is defined asRT := N · f s t , (10)where N is number of samples processed at a sampling rateof f s , and t is the time taken to process those N samples. ForGPU, measurements are performed on a RTX 3090, and forCPU, measurements are performed on a 2018 MacBook Pro.Results are shown in Fig. 5, with solid lines representing run-time on GPU, and dashed lines representing run-time on CPU.In this block-based formulation, the TCN models require abuffer of past samples such that we pass an input of N + r − samples, where N is the number of desired output samplesand r is the receptive field of the model in samples.For the LSTM, the real-time factor on both GPU and CPUis constant with respect to the frame size, which is due tothe inability to parallelize computations across the temporaldimension. In our PyTorch implementation, we found the Song Piano A Piano B EGtr AcGtr020406080100 R a t i n g MUSHRA
Reference LSTM-32 TCN-300-C SignalTrain
Fig. 6. Ratings of the five passages from the MUSHRA-style listening studyingwith 18 participants after post-screening. The TCN-300-C model used in theevaluation was trained with only 1% of the training dataset.
LSTM was close, but not able to achieve real-time operation.On the other hand, we find the real-time factor for the TCNmodel is proportional to the frame size, with larger framesizes producing greater real-time factors as a result of greaterparallelization, both on CPU and GPU. This enables real-timeoperation on CPU at frame sizes down to 1024 samples, whichwe found also to be the case in our implementation of themodel in a JUCE plugin. These results indicate that on GPU,the TCN-300-C model achieves real-time operation with framesizes down to 64 samples, and can take advantage of a 1000xspeedup when operating on longer sequences, unlike the LSTM.This understanding of recurrent and convolutional modelscan help guide the model design process for modeling effects.In cases where very low latency is required using small framesizes ( ≤
64 samples), assuming a recurrent model of sufficientsize can run in real-time on the target platform, these modelsprovide a good option. On the other hand, convolutionalmodels demonstrate a clear advantage in that larger framesizes will provide significant speedup, useful in offline usecases, e.g. rendering a mixdown. Note also that these resultsare something of a worse-case scenario, since optimized C++implementations may achieve a speedup compared to thePyTorch implementations used in our analysis [7], [8], [10].
C. Listening test
While objective metrics enable comparison of relativeperformance, they provide little insight into the degree to whicha model emulates the device from a perceptual standpoint.To further evaluate their performance, we carried out amultistimulus listening test, similar to MUSHRA [42]. Fivepassages from the test set are used, each around 12 secondsin duration. We processed these stimuli using SignalTrain, theLSTM-32 model, and our proposed causal TCN-300-C modeltrained with 1% of the dataset. These listening examples aremade available online . We did not include a low qualityanchor as there is no clear choice in the case of dynamic rangecompression [43], [44]. We used the webMUSHRA interface[45], which enabled the listening study to be performedremotely with participants using their own playback system. https://csteinmetz1.github.io/tcn-audio-effects/ e enlisted 19 participants, all of whom reported experiencewith audio engineering and were familiar with the LA-2Acompressor. We performed a post-screening analysis to assessthe quality of the participants, and removed ratings from oneparticipant who assigned the reference a score of less than 50in 4 of the 5 passages.Results from the remaining 18 participants are presented inFig. 6. Both the LSTM-32 and TCN-300-C perform similarlyacross the passages, yet appear slightly below the reference.In contrast, it is clear that participants noticed the strong noiseartifacts produced by the SignalTrain model. Interestingly, itappears some participants struggled to differentiate betweenthe reference and LSTM-32 and TCN-300-C models, as theyrated the reference lower than these models in some cases. Thisis evident from the wide range of the ratings for the reference.To formalize these observations, we perform the Kruskal-Wallis H-test, which indicates a difference in the median ratingof the models ( F = 186 . , p = 3 . · − ). A post hocanalysis using Conover’s test of multiple comparisons revealsthere is a significant difference in the ratings for the referenceand the LSTM-32 ( p adj = 3 . · − ) and TCN-300-C ( p adj =6 . · − ) models. This indicates, that while challenging,listeners likely could perceive a small difference among themodels in comparison to the reference. Nevertheless, while ourobjective metrics indicated the LSTM-32 model was superior, itappears there is no significant difference in the median ratingsbetween the LSTM-32 and TCN-300-C ( p adj = 0 . ). Theseresults appear to agree with comments from participants, whereboth the LSTM-32 and TCN-300-C models very closely capturethe character of the LA-2A without imparting artifacts, butdiffer in cases of strong gain reduction, letting some transientspass through more so than the analog LA-2A.VIII. D ISCUSSION
TCNs are well-suited for audio effect modeling since nodownsampling in time occurs throughout the network. Thiscontrasts with autoencoding approaches, like SignalTrain,that must accurately reconstruct high frequency informationin the decoder, which can be challenging. While sample-based LSTMs also perform no downsampling, and providecomparable accuracy, they are an order of magnitude slower.Additionally, they do not provide significant speedup on GPU,like the parallelizable TCNs. The most significant caveat of theTCN is that it must employ a sufficiently large receptive field.This may not be feasible for some effects with low frequencyoscillators, such as phaser or chorus effects [46], [47].Our investigations only serve to demonstrate that it ispossible for this formulation of the TCN models to achievesufficient accuracy in the modeling task while also achievingreal-time operation. Further investigation is needed to determinewhich kinds of audio signals are most informative, and therequired density of the parameter sampling to achieve sufficientperformance. Future work could consider more advancedtechniques to designing efficient neural networks such asquantization, distillation, and pruning, along with optimizedimplementations for target hardware, i.e. general purpose CPUs. IX. C
ONCLUSION
We demonstrated TCNs employing causal convolutions withrapidly growing dilation factors can enable shallow networks toachieve sufficient receptive field in a compute-efficient manner.This causal and efficient TCN formulation was effective inmodeling an analog dynamic range compressor, ultimatelyenabling real-time operation on CPU. Furthermore, we foundthat only 1% of the examples from the previously proposeddataset was necessary for adequate performance. We conducteda listening study to evaluate our efficient model and found itachieves a high level of perceptual similarity to the originaleffect, outperforming previous methods. We additionally pro-vided a C++ plugin of the pre-trained model and demonstratedits ability for real-time operation on CPU. Future work involvesoptimizations in platform specific implementations for furtherefficiency in real-time operation, as well as investigating howTCNs with rapidly growing dilation factors generalize to otheraudio effects and related audio signal processing tasks.A
CKNOWLEDGEMENT
This work is supported by the EPSRC UKRI Centrefor Doctoral Training in Artificial Intelligence and Music(EP/S022694/1). R
EFERENCES[1] T. Wilmering, D. Moffat, A. Milo, and M. B. Sandler, “A history ofaudio effects,”
Applied Sciences , vol. 10, no. 3, 2020.[2] M. Karjalainen and J. Pakarinen, “Wave digital simulation of a vacuum-tube amplifier,” in
IEEE Int. Conf. on Acoustics Speech and SignalProcessing (ICASSP) , 2006.[3] D. T. Yeh, J. S. Abel, and J. O. Smith, “Automated physical modeling ofnonlinear audio circuits for real-time audio effects—Part I: Theoreticaldevelopment,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 18, no. 4, pp. 728–737, 2009.[4] F. Eichas, S. Möller, and U. Zölzer, “Block-oriented modeling ofdistortion audio effects using iterative minimization,” in
Int. Conf. onDigital Audio Effects (DAFx) , 2015.[5] F. Eichas, E. Gerat, and U. Zölzer, “Virtual analog modeling of dynamicrange compression systems,” in , 2017.[6] E. Gerat, F. Eichas, and U. Zölzer, “Virtual analog modeling of a UREI1176LN dynamic range control system,” in , 2017.[7] A. Wright, E.-P. Damskägg, V. Välimäki et al. , “Real-time black-boxmodelling with recurrent neural networks,” in
Int. Conf. on Digital AudioEffects (DAFx) , 2019.[8] E.-P. Damskägg, L. Juvela, V. Välimäki et al. , “Real-time modelingof audio distortion circuits with deep learning,” in
Sound and MusicComputing Conf. (SMC) , 2019.[9] M. A. Martínez Ramírez, E. Benetos, and J. D. Reiss, “Deep learning forblack-box modeling of audio effects,”
Applied Sciences , vol. 10, no. 2,p. 638, 2020.[10] J. Chowdhury, “A comparison of virtual analog modelling techniquesfor desktop and embedded implementations,” arXiv:2009.02833 , 2020.[11] V. Valimaki, F. Fontana, J. O. Smith, and U. Zolzer, “Introduction to thespecial issue on virtual analog audio effects and musical instruments,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 18,no. 4, pp. 713–714, 2010.[12] C. J. Steinmetz, J. Pons, S. Pascual, and J. Serrà, “Automatic multitrackmixing with a differentiable mixing console of neural audio effects,” arXiv:2010.10291 , 2020.[13] S. Orcioni, A. Terenzi, S. Cecchi, F. Piazza, and A. Carini, “Identificationof volterra models of tube audio devices using multiple-variance method,”
Journal of the Audio Engineering Society (JAES) , vol. 66, no. 10, pp.823–838, 2018.14] O. Kröning, K. Dempwolf, and U. Zölzer, “Analysis and simulation of ananalog guitar compressor,” in
Int. Conf. on Digital Audio Effects (DAFx) ,2011.[15] J. Covert and D. L. Livingston, “A vacuum-tube guitar amplifier modelusing a recurrent neural network,” in
Proc. of IEEE SoutheastCon , 2013.[16] T. Schmitz and J.-J. Embrechts, “Nonlinear real-time emulation of atube amplifier with a long short time memory neural-network,” in , 2018.[17] Z. Zhang, E. Olbrych, J. Bruchalski, T. J. McCormick, and D. L.Livingston, “A vacuum-tube guitar amplifier model using long/short-term memory networks,” in
Proc. of IEEE SoutheastCon , 2018, pp.1–5.[18] E.-P. Damskägg, L. Juvela, E. Thuillier, and V. Välimäki, “Deep learningfor tube amplifier emulation,” in
ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 471–475.[19] M. A. M. Ramírez and J. D. Reiss, “Modeling nonlinear audio effectswith end-to-end deep neural networks,” in
IEEE Int. Conf. on Acoustics,Speech and Signal Processing (ICASSP) , 2019, pp. 171–175.[20] D. Giannoulis, M. Massberg, and J. D. Reiss, “Digital dynamic rangecompressor design—a tutorial and analysis,”
Journal of the AudioEngineering Society , vol. 60, no. 6, pp. 399–408, 2012.[21] S. Hawley, B. Colburn, and S. I. Mimilakis, “Profiling audio compressorswith deep neural networks,” in , 2019.[22] W. Mitchell and S. H. Hawley, “Exploring quality and generalizability inparameterized neural audio effects,” in , 2020.[23] C. J. Steinmetz, “Learning to mix with neural audio effects in thewaveform domain,” Master’s thesis, Universitat Pompeu Fabra, 2020,https://doi.org/10.5281/zenodo.4091203.[24] M. A. M. Ramírez and J. D. Reiss, “End-to-end equalization withconvolutional neural networks,” in
Int. Conf. on Digital Audio Effects(DAFx) , 2018.[25] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation ofgeneric convolutional and recurrent networks for sequence modeling,” arXiv:1803.01271 , 2018.[26] S. A. Billings and S. Fakhouri, “Identification of systems containinglinear dynamic and static nonlinear elements,”
Automatica , vol. 18, no. 1,pp. 15–26, 1982.[27] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generativemodel for raw audio,” arXiv:1609.03499 , 2016.[28] D. Rethage, J. Pons, and X. Serra, “A WaveNet for speech denoising,” in
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) ,2018, pp. 5069–5073.[29] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:Visual reasoning with a general conditioning layer,” in , 2018.[30] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification,” in
Proc. of the IEEE Int. Conf. on Computer Vision (ICCV) , 2015, pp.1026–1034. [31] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves,and K. Kavukcuoglu, “Conditional image generation with PixelCNNdecoders,” in
Proc. of the Int. Conf. on Neural Information ProcessingSystems (NeurIPS) , 2016, pp. 4797–4805.[32] G. Meseguer-Brocal and G. Peeters, “Conditioned-U-Net: Introducinga control mechanism in the U-Net for multiple source separations,” in
Proc. of the Int. Society for Music Information Retrieval Conf. (ISMIR) ,2019.[33] N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separationby speaker clustering,” arXiv:2002.08933 , 2020.[34] D. Petermann, P. Chandna, H. Cuesta, J. Bonada, and E. Gómez Gutiérrez,“Deep learning based source separation applied to choir ensembles,” in
Proc. of the Int. Society for Music Information Retrieval Conf. (ISMIR) ,2020.[35] C. J. Steinmetz and J. D. Reiss, “auraloss: Audio focused loss functionsin PyTorch,” in
Digital Music Research Network One-day Workshop(DMRN+15) , 2020.[36] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi-bandMelGAN: Faster waveform generation for high-quality text-to-speech,” arXiv:2005.05106 , 2020.[37] Q. Tian, Y. Chen, Z. Zhang, H. Lu, L. Chen, L. Xie, and S. Liu, “TFGAN:Time and frequency domain based generative adversarial network forhigh-fidelity speech synthesis,” arXiv:2011.12206 , 2020.[38] S.-W. Fu, C.-F. Liao, and Y. Tsao, “Learning with learned loss function:Speech enhancement with quality-net to improve perceptual evaluationof speech quality,”
IEEE Signal Processing Letters , vol. 27, pp. 26–30,2019.[39] S. Ö. Arık, H. Jun, and G. Diamos, “Fast spectrogram inversion usingmulti-head convolutional neural networks,”
IEEE Signal ProcessingLetters , vol. 26, no. 1, pp. 94–98, 2018.[40] A. Wright and V. Välimäki, “Perceptual loss function for neural modelingof audio systems,” in
IEEE Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP) , 2020, pp. 251–255.[41] “Algorithms to measure audio programme loudness and true-peakaudio level,” International Telecommunications Union, Recommendation,October 2015.[42] “Method for the subjective assessment of intermediate quality level of au-dio systems,” International Telecommunications Union, Recommendation,2015.[43] J. A. Maddams, S. Finn, and J. D. Reiss, “An autonomous method formulti-track dynamic range compression,” in
Proc. of the Int. Conf. onDigital Audio Effects (DAFx) , 2012.[44] Z. Ma, B. De Man, P. D. Pestana, D. A. Black, and J. D. Reiss, “Intelligentmultitrack dynamic range compression,”
Journal of the Audio EngineeringSociety , vol. 63, no. 6, pp. 412–426, 2015.[45] M. Schoeffler, S. Bartoschek, F.-R. Stöter, M. Roess, S. Westphal,B. Edler, and J. Herre, “webMUSHRA—A comprehensive frameworkfor web-based listening tests,”
Journal of Open Research Software , vol. 6,no. 1, 2018.[46] V. Wright, Alec; Välimäki, “Neural modelling of LFO modulated timevarying effects,” in
Int. Conf. on Digital Audio Effects (DAFx) , 2020.[47] M. A. M. Ramírez, E. Benetos, and J. D. Reiss, “A general-purpose deeplearning approach to model time-varying audio effects,” in