[PDF] Learning to Denoise Historical Music

Abstract

We propose an audio-to-audio neural network model that learns to denoise old music recordings. Our model internally converts its input into a time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting complex spectrogram using a convolutional neural network. The network is trained with both reconstruction and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the quality and details of the original music.

Full PDF

LLEARNING TO DENOISE HISTORICAL MUSIC

Yunpeng Li Beat Gfeller Marco Tagliasacchi Dominik Roblek

Google {yunpeng,beatg,mtagliasacchi,droblek}@google.com

ABSTRACT

We propose an audio-to-audio neural network model thatlearns to denoise old music recordings. Our model inter-nally converts its input into a time-frequency representa-tion by means of a short-time Fourier transform (STFT),and processes the resulting complex spectrogram using aconvolutional neural network. The network is trained withboth reconstruction and adversarial objectives on a syn-thetic noisy music dataset, which is created by mixingclean music with real noise samples extracted from quietsegments of old recordings. We evaluate our method quan-titatively on held-out test examples of the synthetic dataset,and qualitatively by human rating on samples of actualhistorical recordings. Our results show that the proposedmethod is effective in removing noise, while preservingthe quality and details of the original music.

1. INTRODUCTION

Archives of historical music recordings are an impor-tant means for preserving cultural heritage. Most suchrecords, however, were created with outdated equipment,and stored on analog media such as phonograph recordsand wax cylinders. The technological limitation of therecording process and the subsequent deterioration of thestorage media inevitably left their marks, manifested bythe characteristic crackling, clicking, and hissing noisesthat are typical in old records. While “remastering” em-ployed by the recording industry can substantially improvethe sound quality, it is a time-consuming process of man-ual labor. The focus of this paper is an automated methodthat learns from data to remove noise and restore music.Audio denoising has a long history in signal process-ing [1]. Traditional methods typically use a simpliﬁedstatistical model of the noise, whose parameters are es-timated from the noisy audio. Examples of these tech-niques are spectral noise subtraction [2, 3], spectral mask-ing [4, 5], statistical methods based on Wiener ﬁltering [6]and Bayesian estimators [7, 8]. Many of these approaches,however, focus on speech. Moreover, they often makesimplifying assumptions about the structure of the noise, c (cid:13) Y. Li, B. Gfeller, M. Tagliasacchi, and D. Roblek. Li-censed under a Creative Commons Attribution 4.0 International License(CC BY 4.0).

Attribution:

Y. Li, B. Gfeller, M. Tagliasacchi, and D.Roblek, “Learning to Denoise Historical Music”, in

Proc. of the 21st Int.Society for Music Information Retrieval Conf.,

Montréal, Canada, 2020. which makes them less effective on non-stationary real-world noise.Recent advances in deep learning saw the emergenceof data-driven methods that do not make such a priori as-sumptions about noise. Instead they learn an implicit noisemodel from training examples, which typically consist ofpairs of clean and noisy versions of the same audio in a su-pervised setup. Crucial challenges facing the adoption ofthe deep learning paradigm for our task are: i) can we de-sign a model powerful enough for the complexity of music,yet simple and fast enough to be practical, and ii) how canwe train such a model, given that we have no clean groundtruth for historical recordings? In this paper, we addressthese issues and show that it is indeed feasible to build aneffective and efﬁcient model for music denoising.

Sparse linear regression with structured priors is used in [9]to denoise music from synthetically added white Gaussiannoise, obtaining large SNR improvements on a “glock-enspiel” excerpt, and on an Indian polyphonic song. [10]considers the problem of removing artifacts of perceptualcoding audio compression with low bit-rates. That work,which uses LSTMs, is the ﬁrst successful application ofdeep learning for this type of music audio restoration. Notethat in contrast to our work, aligned pairs of original andcompressed audio samples are readily available. Statisticalmethods are applied in [11] to denoise Greek Folk musicrecorded in outdoor festivities. In [12], the author appliesstructured sparsity models to two speciﬁc audio recordingsthat were digitized from wax cylinders, and describes theresults qualitatively. In [13], the authors describe how toﬁll in gaps (at known positions) of several seconds in mu-sic audio, using self-similar parts from the recording itself.Our method is also related to audio super-resolution,also known as bandwidth extension. This is the process ofextending audio from low to higher sample rates, which re-quires restoring the high frequency content. In [14,15] twoapproaches which work for music are described. On pianomusic, for example, [15] obtains an SNR of 19.3 when up-sampling a low-pass ﬁltered audio from 4kHz to 16kHz.Many existing denoising approaches focus on speechinstead of music [16–19]. Given that these two domainshave very different properties, it is not clear a priori howwell such methods transfer to the music domain. Never-theless, our work is inspired by recent approaches that usegenerative adversarial networks (GANs) to improve thequality of audio [18, 20, 21]. For example, [21] obtains a r X i v : . [ ee ss . A S ] A ug igniﬁcant improvements denoising speech and applausesounds that have been decoded at a low bit-rate, using awave-to-wave convolutional architecture.In this paper, we present a method to remove noise fromhistorical music recordings, using two sources of audio: i)a collection of historical music recordings to be restored,for which no clean reference is available, and ii) a separatecollection of music of the same genre that contains high-quality recordings. We focus on classical music, for whichboth public domain historical recordings as well as mod-ern digital recordings are available. This paper makes thefollowing contributions: • We provide a fully automated approach that suc-ceeds in removing noise from historical recordings,while preserving the musical content in high quality.Quality is measured in terms of SNR and subjectivescores inspired by MUSHRA [22], and examples onreal historical recordings are provided . • Our approach employs a new architecture that trans-forms audio in the time domain, using a multi-scaleapproach, combined with STFT and inverse STFT.As this architecture is able to output high-qualitymusic, it may be a useful architecture for other tasksthat involve the transformation of music audio. • We provide an efﬁcient and fully automated methodto extract noise segments (without music) from acollection of historical music recordings. This is akey ingredient of our approach, as it allows us to cre-ate synthetic pairs of audio samples.The rest of this paper is organized as follows. Our ap-proach is described in detail in Section 2, and experimentalresults are given in Section 3. We conclude in Section 4.

2. METHOD

Our model is an audio-to-audio generator learned frompaired examples with both reconstruction and adversarialobjectives.

For training, we use time-aligned pairs of examples, where clean music is used as targets, and noisymusic as inputs to the generator. We take a data-driven ap-proach to generate noisy audio from clean references. Wesynthesize noisy samples by simulating the degradationprocess affecting the historical recordings, namely apply-ing band-pass ﬁltering, followed by additive mixing withnoise samples extracted from “quasi-silence” segments ofhistorical recordings.Speciﬁcally, we scan the noisy historical recordingslooking for low-energy segments in the time domain,which corresponds to pauses in the musical scores. Tothis end, we compute the rolling standard deviation fromthe raw audio samples with a window size equal to 100ms. Then, we estimate an adaptive threshold τ based on the q -th quantile of the standard deviations and keep the seg-ments that satisfy the following two conditions: i) thelocal standard deviation is below τ , and ii) the segmenthas a minimum duration of T . Intuitively, the value of q is selected based on a trade-off between the number ofextracted segments and the need of extracting noise-onlysegments. In our experiments, we set q = 0 . % and T = 100 ms. In this way, from 801 different recordings,we are able to extract around 8900 noise samples.From each of these short noise segments, we need togenerate noise samples having the same length as the cleanaudio references. We do this by replicating the noise seg-ment in time, using overlap-and-add (OLA) with an over-lap equal to 20% of the segment length. Given the short du-ration of most noise segments, this operation alone wouldlead to periodic noise patterns which differ from the noisecharacteristics found in historical recordings. Therefore,we alter each noise segment replica before the OLA syn-thesis step in two ways: i) applying a random perturbationto the phase of the noise segment (adding Gaussian noise ∼ N (0 , . to the phase of the STFT); ii) applying a ran-dom shift in time (with wraparound). We found that thesesimple operations produce longer noise samples with audi-tory characteristics similar to the ones encountered in thehistorical recordings, avoiding artiﬁcial periodic patterns.Finally, we create time-aligned pairs of examples by: i) applying band-pass ﬁltering with cut-off frequencies randomly sampled in [50Hz, 150Hz] and[5kHz, 10kHz], respectively; ii) mixing a randomly se-lected noise sample with a gain in the range [10dB, 30dB]. The generator processes the audio in the time-frequencydomain. It ﬁrst computes the STFT of the input, the realand imaginary components of which are then fed as a 2-channel image to a 2D convolutional U-Net [23] followedby an inverse STFT back to the time domain. Finally theoutput is added back to the input, making the model aresidual generator.The U-Net in our generator is a symmetric encoder-decoder network with skip-connections, where the archi-tecture of the decoder layers mirrors that of the encoderand the skip-connections run between each encoder blockand its mirrored decoder block. Each encoder block isa 3 × × × × × × -block (N=64, S=(1, 2))STFTInputConv2D (k=(7, 7), n=32)E-block (N=64, S=(1, 2))E-block (N=128, S=(2, 2))E-block (N=128, S=(1, 2))E-block (N=256, S=(2, 2))E-block (N=512, S=(2, 2))Conv2D (k=(3, 3), n=128) Conv2D (k=(3, 3), n=512)D-block (N=256, S=(2, 2))Maybe down-sample D-block (N=128, S=(2, 2))D-block (N=128, S=(1, 2))D-block (N=64, S=(2, 2))D-block (N=64, S=(1, 2))D-block (N=32, S=(1, 2))Conv2D (k=(7, 7), n=2)Inverse STFTUp-sampleOuput D-block (N, S=(S , S )) Conv2D (k=(3, 3), n=N/2)TransposedConv2D(k=(S +2, S +2),n=N/2, stride=S) Proj.Upsamp. E-block (N, S=(S , S )) Conv2D (k=(3, 3), n=N)Conv2D (k=(S +2, S +2),n=N, stride=S) Figure 1 . Generator architecture. Dashed-line com-ponents are included on a need-to-have basis: Up/down-sampling of the input/output audio is needed for process-ing at coarser resolutions in a multi-scale setup; The lin-ear projection (by 1x1 convolution) in the decoder block ispresent only when the output of the block has a differentnumber of channels from its input.is followed by a linear projection using 1x1 convolutionwhen the output has a different number of channels fromthe input. We do not include a shortcut in the encoderblock, since it already shares the same input with a U-Netskip connection and therefore only needs to produce theresidual complementary to the skip path. The architectureof the generator is shown in Figure 1.We use two discriminators for the adversarial objective,one in the waveform domain and one in the STFT domain.The STFT discriminator has the same architecture as theencoder module of the generator. For the waveform dis-criminator, we use the same architecture as MelGAN [25]except that we only double (instead of quadruple) the num-ber of channels in the down-sampling layers. We foundthis light-weight version to be sufﬁcient in our setup, andthat using the full version had no additional beneﬁt. Bothdiscriminators are fully convolutional. Hence the wave-form discriminator produces a 1D output spanning the timedomain, and the STFT discriminator has a 2D output span-ning the time-frequency domain.We use weight normalization [26] and ELU activa-tion [27] in the generator, while layer normalization [28]and Leaky ReLU activation [29] with α = 0 . are used inthe discriminator. In the generator, the STFT is represented by a 2-channelimage, where the channels are the real and imaginary com-ponents. We also explored a polar representation, wherethe channels are the modulus and the phase; additionallywe experimented with processing only the modulus chan-nel and reusing the original phase, as is done in [30]. Nev-ertheless, we found the real/imaginary representation toperform better in our experiments. Furthermore, we tried aligning the phase so that thephase in each frame is coherent with a global reference(e.g., the ﬁrst frame) rather than its local STFT window.Again, we observed no advantage in doing so, which sug-gests that the neural network is capable of internally han-dling the phase offsets. Unlike [30], we do not convertSTFT to logarithmic scale as we found it be detrimental toperformance (even with various smoothing and normaliza-tion schemes).

We can further stack multiple copies of the generator de-scribed above, each with its own separate parameters, in acoarse-to-ﬁne fashion: The generators at earlier stages pro-cess the audio at reduced temporal resolutions, whereas thelater-stage generators focus on restoring ﬁner details. Thisis equivalent to halving the sampling rate in each scale.This type of multi-scale generation scheme is routinelyused in computer vision and graphics to produce high-resolution images (e.g., [31]).Let K be the total number of scales, then generator G k at scale k ( k ∈ { , . . . , K − } ) down-samples its input bya factor of k before computing the STFT and up-samplesthe output residual (after computing the inverse STFT) bythe same factor to match the resolution of the input. Theoverall generator G is the composite of G ◦ · · · ◦ G K − .Compared with simply stacking U-Nets all at the orig-inal input resolution, as done in [32], the beneﬁt of themulti-scale approach is two-fold: i) the asymptotic compu-tational complexity is constant with respect to the numberof scales, as opposed to linear in [32], due to exponentiallydecreasing input sizes at coarser levels; ii) the intermediateoutputs of the generator correspond to the input audio pro-cessed at lower resolutions, which allows us to meaning-fully impose multi-scale losses on the intermediate outputsin addition to the ﬁnal output. We will describe how thiscan be accomplished in the next section. The generator can be trained using the reconstruction lossbetween the denoised output and the clean target. This canbe further complemented with an adversarial loss, given bydiscriminators trained simultaneously with the generator, apractice often used in audio enhancement (e.g., [18,20,30],among others). In the case of our multi-scale generator, weuse the same number of waveform and STFT discrimina-tors as generator scales. This way, there is one discrimina-tor of both types for each of the (down-sampled) interme-diate outputs and ﬁnal output in each domain. For the ad-versarial loss, we use the hinge loss averaged over multiplescales. Since the discriminators are convolutional, this lossis further averaged over time for the waveform discrimina-tor and over time-frequency bins for the STFT discrimina-tor. Similarly, the reconstruction loss is also imposed onthe outputs at each scale.More formally, let ( x, y ) denote a training example,where x is the noisy input and y is the clean target, and k ∈ { , . . . , K − } denote the scale index. Hence y k is thelean audio down-sampled to scale k , and ˆ y k represents theintermediate output of the generator G k ◦ · · · ◦ G K − ( x ) down-sampled to the same scale. Note that for the ﬁnestscale k = 0 at full resolution, y = y is simply the origi-nal clean audio and ˆ y (cid:44) ˆ y = G ( x ) is the ﬁnal output ofthe generator. Thus the L reconstruction loss in the STFTdomain can be written as L rec G = E ( x,y ) (cid:34)(cid:88) k (cid:107) ω k − ˆ ω k (cid:107) S STFT k (cid:35) , (1)where 2D complex tensors ω k and ˆ ω k denote the STFT ofdown-sampled clean audio y k and generator output ˆ y k forscale k , respectively, and S STFT k is the total number of time-frequency bins in ω k and ˆ ω k . We ﬁnd this STFT-basedreconstruction loss to perform better than either imposingper-sample losses directly in the waveform domain or us-ing losses computed from the internal “feature” layers ofdiscriminators (e.g. [25]).For the adversarial loss, let t denote the temporal indexover all T k logits of the waveform discriminator at scale k (recalling that the discriminators are fully convolutional)and let s denote the index over all S k logits of the STFTdiscriminator. Then discriminator losses in the wave andSTFT domains can be written as, respectively, L wave D = E y (cid:88) k,t T k max(0 , − D wave k,t ( y k ))  + E x (cid:88) k,t T k max(0 , D wave k,t (ˆ y k ))  (2) L STFT D = E y (cid:88) k,s S k max(0 , − D STFT k,s ( y k ))  + E x (cid:88) k,s S k max(0 , D STFT k,s (ˆ y k ))  , (3)and the corresponding adversarial loss for the generator isgiven by L adv G = L adv, wave G + L adv, STFT G = E x (cid:88) k,t T k max(0 , − D wave k,t (ˆ y k )) + (cid:88) k,s S k max(0 , − D STFT k,s (ˆ y k ))  . (4)The overall generator loss is a weighted sum of the ad-versarial loss and the reconstruction loss, i.e., L G = L rec G + λ · L adv G . (5)We set the weight of the adversarial loss λ to . inall our experiments, except those where we do not use dis-criminators (which corresponds λ =0). We train the modelwith TensorFlow for 400,000 steps using the ADAM [33] optimizer, with a batch size of 16 and a constant learningrate of 0.0001 with β = 0 . and β = 0 . . For the STFT,we use a window size of 2048 and a hop size of 512 whenthere is only a single scale. For each added scale we halvethe STFT window size and hop size everywhere . This waythe STFT window at the coarsest scale has a receptive ﬁeldof 2048 samples at the original resolution, whereas ﬁnerlevels have smaller receptive ﬁelds and hence focus moreon higher frequencies.Our model has around 9 million parameters per scale inthe generator. At inference-time, it takes less than half asecond for every second of input audio on a modern CPUand more than an order of magnitude faster on GPUs.

3. EXPERIMENTS

We evaluate our model on a dataset of synthetically gener-ated noisy-clean pairs, using both objective and subjectivemetrics. In addition, we also provide a subjective evalua-tion on samples from real historical recordings, for whichthe clean references are not available.

Our data is derived from two sources: i) digitized histori-cal music recordings from the Public Domain Project [34],and ii) a collection of classical music recordings of CD-quality. The historical recordings are used in two ways: i)to extract realistic noise from relatively silent portions ofthe audio, as described in Section 2.1; and ii) to evaluatedifferent methods based on the human-perceived subjec-tive quality of their outputs. The modern recordings areused for mixing with the extracted noise samples to cre-ate synthetic noisy music, as well as serving as the cleanground truth. We additionally ﬁlter our data to retain onlyclassical music, as it is by far the most represented genrein historical recordings. The resulting dataset consists ofpairs of clean and noisy audio clips, both monophonic and5 seconds long, sampled at 44.1kHz. The total duration ofthe clean clips is 460h.

We quantitatively evaluate the performance of differentmethods on a held-out test set of 1296 examples from thesynthetic noisy music dataset. For the neural network mod-els, whose training is stochastic, we repeat the training pro-cess 10 times for each model and report the mean for eachmetric and its standard error.

Evaluation metrics:

Objective metrics such as thesignal-to-noise ratio (SNR) faithfully measure the differ-ence between two waveforms on a per-sample basis, butthey often do not correlate well with human-perceived re-construction quality. Therefore, we additionally measurethe

VGG distance between the ground truth and the de-noised output, which is deﬁned as the L distance be-tween their respective embeddings computed by a VGGishnetwork [35]. The embedding network is pre-trainedfor multi-label classiﬁcation tasks on the YouTube-100Mdataset, in which labels are assigned automatically based SNR (dB) - ∆ VGG1 scale ± ± ± ± ± ± Table 1 . Performance of our model with different num-bers of scales K in terms of SNR gain ( ∆ SNR) and VGGdistance reduction (- ∆ VGG). Higher is better.on a combination of metadata (title, description, com-ments, etc.), context, and image content for each video.Hence we expect the VGG distance to focus more onhigher-level features of the audio and less on per-samplealignment. Note that the same embedding used by Frechétaudio distance (FAD) [36], which measures the distancebetween two distributions . However, FAD does not com-pare the content of individual audio samples, and is hencenot applicable to denoising.We report the SNR gain ( ∆ SNR) and VGG distancereduction (- ∆ VGG) of the denoised output relative to thenoisy input, averaged over the test set. For reference, thenoisy input has an average SNR of 14.4dB and VGG dis-tance of 2.09. Table 1 shows the performance of our modelwith different numbers of scales. We use K = 2 scales forthe rest of our experiments. We evaluate variants of ourproposed model in an ablation study and compare with al-ternative approaches and well-established signal process-ing baselines: • Ours, λ =0 : Our model trained with only reconstruc-tion loss. • Ours, λ =0.01 : Our model trained with both adver-sarial and reconstruction losses. • Ours, bypass phase : Same as above, except thatthe phase of the noisy input is reused and only themodulus of the STFT is processed by the U-Net (asa single-channel image). This is similar to the ap-proach of [30], but trained and evaluated for musicdenoising instead of speech. • MelGAN-UNet : A 1D-convolutional waveform-domain generator inspired by MelGAN [25], wherethe decoder is the same as the generator of MelGANand the encoder mirrors the decoder. • DeepFeature generator : The 1D-convolutionalwaveform-domain generator of [17], which does notuse U-Net but rather a series of 1D convolutions withexponentially increasing dilation sizes. Unlike U-Net, the temporal resolution and number of channelsremain unchanged in all layers of this network. • log-MMSE : A short-time spectral amplitude esti-mator for speech signals which minimizes the mean-square error of the log-spectra [37]. In our imple-mentation, the estimation of the noise spectrum isbased on low-energy frames across the whole clip,rather than considering the frames at the start of the audio clip. We use this deviation from the standardimplementation as it gives better SNR results. • Wiener : A linear time-invariant ﬁlter that minimizesthe mean-square error. We adopted the SciPy [38]implementation and used default parameters, as dif-ferent parameter settings did not improve the results.For waveform-domain generators, we tried waveform-domain losses – including reconstruction losses in the “fea-ture space” of discriminator internal layers [17, 25] – aswell as STFT-domain losses, and found the former to workbetter with the DeepFeature generator while the latter gavebetter results for the MelGAN-UNet generator. The resultsshown for these generators are those obtained with the bet-ter loss variant. We also divide the test set into three sub-sets, each containing the same number of examples, withlow noise (avg. 19.8dB SNR), medium noise (avg. 14.2dBSNR), and high noise (avg. 9.4dB SNR), and compute thesame metrics on each subset as well as on the full test set.The results in Table 2 show that, for all noise levels,our model consistently outperforms the signal processingbaselines and the waveform-domain neural network mod-els, which have proven highly successful in speech en-hancement but are not adequate for the complexity of mu-sic signals. The signal-processing baselines (log-MMSEand Wiener ﬁltering) are hardly able to improve upon thenoisy input at all. This is not too surprising given the non-Gaussian, non-white nature of the real-world noise in theevaluation data. Comparing the results among the variantsof our model, we further make the following observations: • Using adversarial losses does not help in terms ofSNR, as is evident from the top two rows of Table 2.The SNR decrease is small but signiﬁcant. The ad-versarially trained variant, however, scores better onthe high-level feature oriented VGG distance metric,which is in line with past observations [18, 25] • It is advantageous to take both the modulus and thephase into account when processing the STFT spec-trogram, as the “bypass-phase” variant which reusesthe input phase produces consistently worse resultsacross all noise levels. This shows that the proposedmodel is able to reconstruct the ﬁne-grained phasecomponent of the original clean music.

In the previous section we compared results by means ofobjective quality metrics, which can be quantitatively com-puted from pairs of noisy-clean examples. These metricscan be conveniently used to systematically run an evalua-tion over a large number of samples. However, it is dif-ﬁcult to come up with an objective metric that correlateswith quality as perceived by human listeners. Indeed, theSNR and VGG distance metrics do not agree in our quan-titative evaluation – the proposed model is better in termsof VGG distance, but worse in terms of SNR compared toits counterpart without discriminator. We now describe our

SNR (dB) - ∆ VGGnoise level noise levellow medium high all low medium high allOurs, λ =0 ± ± ± ± ± ± ± ± Ours, λ =0.01 2.2 ± ± ± ± ± ± ± ± Ours, bypass phase 2.1 ± ± ± ± ± ± ± ± MelGAN-UNet 1.7 ± ± ± ± ± ± ± ± DeepFeature generator -0.7 ± ± ± ± ± ± ± ± log-MMSE -1.4 -0.2 0.1 -0.5 -0.15 -0.04 0.01 -0.07Wiener 0.1 0.1 0.1 0.1 0.01 0.02 0.01 0.01 Table 2 . Performance of different variants of our model and alternative approaches, evaluated on subsets of examples withdifferent noise levels as well as on the full test set. s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 all sample s c o r e d i ff e r e n c e logMMSE=0.01=0 Figure 2 . Average score differences for the historicalrecordings dataset, relative to the original noisy sample.subjective evaluation which we ran in order to identify themethod that performs best when judged by humans.Following recent work on low-bitrate audio improve-ment [21], we use a score inspired by MUSHRA [22] forour subjective evaluation. Each rater assigned a score be-tween 0 and 100 to each sample. The main difference toactual MUSHRA scores is that since no clean reference ex-ists for historical recordings, we do not include an explicitreference in the rated samples (although we do include theclean sample in the synthetic dataset evaluation).We perform our evaluation on 10 samples of historicalrecordings, and separately on 10 samples from the syn-thetic dataset, using 11 human raters. As in the objec-tive evaluation, each sample is 5 seconds long. We eval-uate the following four versions for each sample: (i) Orig-inal historic audio example, (ii) denoised example usingour model with λ =0.01, (iii) denoised example using ourmodel with λ =0, (iv) denoised example using log-MMSE.For the synthetic dataset, we use the four versionsabove, but instead of the historic audio we use the syn-thetically noisiﬁed one. We do not include Wiener ﬁlter-ing as a competing baseline here since we noticed that itproduces outputs that are consistently near-identical to thenoisy input, and hence including it in the subjective eval-uation would provide little value. We use the originalnoisy audio as the reference from which to compute scoredifferences for the historical recordings, and the synthet- s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 all sample s c o r e d i ff e r e n c e logMMSE=0.01=0clean Figure 3 . Average score differences for the syntheticdataset, relative to the noisy sample.ically noisiﬁed sample as the reference for the syntheticdata. The results are shown in Figure 2 for the historicalrecordings, and in Figure 3 for the synthetic dataset. Errorbars are 95% conﬁdence intervals, assuming a Gaussiandistribution of the mean. Both of our methods signiﬁcantlyimprove the historical recordings, by around 50 points onaverage. In comparison, the logMMSE baseline only im-proves by an average of 16 points. We also performed aWilcoxon signed-rank test between our λ =0.01 and λ =0models, to ﬁnd that the difference is statistically signiﬁcant(p-value < . × − ). On the synthetic data, again the λ =0 model outperforms the λ =0.01 variant, with a p-value < . × − . On the other hand, there is no signiﬁcantdifference between the mean score differences of the λ =0model and the clean sample (p-value = . ).

4. CONCLUSION

We presented a learning-based method for automated de-noising and applied it to restoration of noisy historical mu-sic recordings, matching a high quality bar: Judged by hu-man listeners on actual historical records, our method im-proves audio quality by a large margin and strongly out-performs existing approaches on a MUSHRA-like qualitymetric. On artiﬁcially noisiﬁed music, it even attains aquality level that listeners found to be statistically indis-tinguishable from the ground truth. . REFERENCES [1] S. H. Godsill and P. J. Rayner,

Digital Audio Restora-tion: A Statistical Model Based Approach , 1st ed.Berlin, Heidelberg: Springer-Verlag, 1998.[2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhance-ment of speech corrupted by acoustic noise,” in

IEEEInternational Conference on Acoustics, Speech, andSignal Processing (ICASSP) , vol. 4, 1979, pp. 208–211.[3] S. Kamath and P. Loizou, “A multi-band spectral sub-traction method for enhancing speech corrupted bycolored noise,” in

IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) ,vol. 4, 05 2002.[4] A. M. Reddy and B. Raj, “Soft mask methods forsingle-channel speaker separation,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 15,no. 6, pp. 1766–1776, 2007.[5] E. M. Grais and H. Erdogan, “Single channel speechmusic separation using nonnegative matrix factoriza-tion and spectral masks,” in

International Conferenceon Digital Signal Processing (DSP) , 2011, pp. 1–6.[6] P. Scalart and J. V. Filho, “Speech enhancement basedon a priori signal to noise estimation,” in , vol. 2, 1996,pp. 629–632.[7] H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speechdenoising and dereverberation using probabilistic mod-els,” in

Advances in Neural Information ProcessingSystems 13 , 2001, pp. 758–764.[8] P. C. Loizou, “Speech enhancement based on percep-tually motivated bayesian estimators of the magnitudespectrum,”

IEEE Transactions on Speech and AudioProcessing , vol. 13, no. 5, pp. 857–869, 2005.[9] C. Fevotte, B. Torresani, L. Daudet, and S. J. Godsill,“Sparse linear regression with structured priors and ap-plication to denoising of musical audio,”

IEEE Trans-actions on Audio, Speech, and Language Processing ,vol. 16, no. 1, pp. 174–185, 2008.[10] J. Deng, B. W. Schuller, F. Eyben, D. Schuller,Z. Zhang, H. Francois, and E. Oh, “Exploit-ing time-frequency patterns with LSTM-RNNsfor low-bitrate audio restoration,”

Neural Com-puting and Applications , vol. 32, no. 4, pp.1095–1107, 2020. [Online]. Available: https://doi.org/10.1007/s00521-019-04158-0[11] N. Bassiou, C. Kotropoulos, and I. Pitas, “Greek folkmusic denoising under a symmetric α -stable noise as-sumption,” in , 2014, pp. 18–23. [12] V. Mach, “Denoising phonogram cylinders recordingsusing structured sparsity,” in , 2015, pp.314–319.[13] N. Perraudin, N. Holighaus, P. Majdak, and P. Bal-azs, “Inpainting of long audio segments with similaritygraphs,” IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. PP, pp. 1–1, 02 2018.[14] V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio superresolution using neural networks,” in , 2017.[15] S. Birnbaum, V. Kuleshov, Z. Enam, P. Koh, and S. Er-mon, “Temporal ﬁlm: Capturing long-range sequencedependencies with feature-wise modulations,” in

Proc.33rd Annual Conference on Neural Information Pro-cessing Systems (NeurIPS 2019) , 2019.[16] M. Michelashvili and L. Wolf, “Audio denoising withdeep network priors,”

CoRR , vol. abs/1904.07612,2019. [Online]. Available: http://arxiv.org/abs/1904.07612[17] F. G. Germain, Q. Chen, and V. Koltun, “Speech de-noising with deep feature losses,” 2018.[18] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN:Speech enhancement generative adversarial network,”in

Interspeech 2017, 18th Annual Conference of theInternational Speech Communication Association , 082017, pp. 3642–3646.[19] D. Rethage, J. Pons, and X. Serra, “A wavenet forspeech denoising,” in

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,2018, pp. 5069–5073.[20] C. Donahue, B. Li, and R. Prabhavalkar, “Ex-ploring speech enhancement with generative ad-versarial networks for robust speech recognition,”

CoRR , vol. abs/1711.05747, 2017. [Online]. Available:http://arxiv.org/abs/1711.05747[21] A. Biswas and D. Jia, “Audio codec enhancementwith generative adversarial networks,” in

IEEE Inter-national Conference on Acoustics, Speech, and SignalProcessing (ICASSP)

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , N. Navab,J. Hornegger, W. M. Wells, and A. F. Frangi, Eds.Cham: Springer International Publishing, 2015, pp.234–241.24] A. Odena, V. Dumoulin, and C. Olah, “De-convolution and checkerboard artifacts,”

Distill ,2016. [Online]. Available: http://distill.pub/2016/deconv-checkerboard[25] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z.Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, andA. Courville, “MelGAN: Generative adversarial net-works for conditional waveform synthesis,” 2019.[26] T. Salimans and D. P. Kingma, “Weight normalization:A simple reparameterization to accelerate training ofdeep neural networks,” in

Advances in Neural Informa-tion Processing Systems 29 , D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, Eds. CurranAssociates, Inc., 2016, pp. 901–909.[27] S. H. Djork-Arné Clevert, Thomas Unterthiner, “Fastand accurate deep network learning by exponential lin-ear units (elus),” in

ICLR , 2016.[28] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-ization,” arXiv preprint arXiv:1607.06450 , 2016.[29] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectiﬁernonlinearities improve neural network acoustic mod-els,” in

ICML Workshop on Deep Learning for Audio,Speech and Language Processing , 2013.[30] S. Abdulatif, K. Armanious, K. Guirguis, J. T. Sajeev,and B. Yang, “Aegan: Time-frequency speech denois-ing via generative adversarial networks,” 2019.[31] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progres-sive growing of gans for improved quality, stability,and variation,” in . OpenReview.net, 2018. [Online]. Avail-able: https://openreview.net/forum?id=Hk99zCeAb[32] K. Armanious, C. Yang, M. Fischer, T. Küstner,K. Nikolaou, S. Gatidis, and B. Yang, “Medgan:Medical image translation using GANs,”

CoRR , vol.abs/1806.06397, 2018. [Online]. Available: http://arxiv.org/abs/1806.06397[33] D. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”

International Conference on LearningRepresentations , 12 2014.[34] “Public domain project,” http://pool.publicdomainproject.org, [Online; accessed February-2020]. [Online]. Available: http://pool.publicdomainproject.org[35] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F.Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt,R. A. Saurous, B. Seybold, M. Slaney, R. Weiss,and K. Wilson, “CNN architectures for large-scaleaudio classiﬁcation,” in

International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,2017. [Online]. Available: https://arxiv.org/abs/1609.09430 [36] K. Kilgour, M. Zuluaga, D. Roblek, and M. Shariﬁ,“Fréchet audio distance: A reference-free metricfor evaluating music enhancement algorithms,” in

Interspeech 2019, 20th Annual Conference of theInternational Speech Communication Association,Graz, Austria, 15-19 September 2019 , G. Kubinand Z. Kacic, Eds. ISCA, 2019, pp. 2350–2354. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2219[37] Y. Ephraim and D. Malah, “Speech enhancement usinga minimum mean-square error log-spectral amplitudeestimator,”

IEEE Transactions on Acoustics, Speech,and Signal Processing , vol. 33, no. 2, pp. 443–445,1985.[38] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber-land, T. Reddy, D. Cournapeau, E. Burovski, P. Pe-terson, W. Weckesser, J. Bright, S. J. van der Walt,M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov,A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey,˙I. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Lax-alde, J. Perktold, R. Cimrman, I. Henriksen, E. A.Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,F. Pedregosa, P. van Mulbregt, and S. . . Contribu-tors, “SciPy 1.0: Fundamental Algorithms for Scien-tiﬁc Computing in Python,”