[PDF] CDPAM: Contrastive learning for perceptual audio similarity

Abstract

Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

Full PDF

CCDPAM: CONTRASTIVE LEARNING FOR PERCEPTUAL AUDIO SIMILARITY

Pranay Manocha , Zeyu Jin , Richard Zhang , Adam Finkelstein Princeton University, USA Adobe Research, USA

ABSTRACT

Many speech processing methods based on deep learning requirean automatic and differentiable audio metric for the loss function. TheD

PAM approach of Manocha et al. [1] learns a full-reference metrictrained directly on human judgments, and thus correlates well withhuman perception. However, it requires a large number of humanannotations and does not generalize well outside the range of pertur-bations on which it was trained. This paper introduces C

DPAM – ametric that builds on and advances D

PAM . The primary improvementis to combine contrastive learning and multi-dimensional representa-tions to build robust models from limited data. In addition, we collecthuman judgments on triplet comparisons to improve generalizationto a broader range of audio perturbations. C

DPAM correlates wellwith human responses across nine varied datasets. We also showthat adding this metric to existing speech synthesis and enhancementmethods yields signiﬁcant improvement, as measured by objectiveand subjective tests.

Index Terms — perceptual similarity, audio quality, deep metric,speech enhancement, speech synthesis

1. INTRODUCTION

Humans can easily compare and recognize differences between au-dio recordings, but automatic methods tend to focus on particularqualities and fail to generalize across the range of audio perception.Nevertheless, objective metrics are necessary for many automatic pro-cessing tasks that cannot incorporate a human in the loop. Traditionalobjective metrics like P

ESQ [2] and V

ISQOL [3] are automatic and in-terpretable, as they rely on complex handcrafted rule-based systems.Unfortunately, they can be unstable even for small perturbations –and hence correlate poorly with human judgments.Recent research has focused on learning speech quality fromdata [1, 4, 5, 6, 7, 8]. One approach trains the learner on subjectivequality ratings, e.g., MOS scores on an absolute (1 to 5) scale [4, 5].Such “no-reference” metrics support many applications but do notapply in cases that require comparison to ground truth, for examplefor learned speech synthesis or enhancement [9, 10].Manocha et al. [1] propose a “full-reference” deep perceptualaudio similarity metric (D

PAM ) trained on human judgments. Astheir dataset focuses on just-noticeable differences (JND), thelearned model correlates well with human perception, even for smallperturbations. However, the D

PAM model suffers from a naturaltension between the cost of data acquisition and generalizationbeyond that data. It requires a large set of human judgments tospan the space of perturbations in which it can robustly compareaudio clips. Moreover, inasmuch as its dataset is limited, the metricmay generalize poorly to unseen speakers or content. Finally, becausethe data is focused near JNDs, it is likely to be less robust to largeaudio differences. To ameliorate these limitations, this paper introduces C

DPAM : acontrastive learning-based multi-dimensional deep perceptual audiosimilarity metric. The proposed metric builds on D

PAM using threekey ideas: (1) contrastive learning, (2) multi-dimensional represen-tation learning, and (3) triplet learning. Contrastive learning is aform of self-supervised learning that augments a limited set of humanannotations with synthetic (hence unlimited) data: perturbations tothe data that should be perceived as different (or not). We use multi-dimensional representation learning to separately model content sim-ilarity (e.g. among speakers or utterances) and acoustic similarity(e.g. among recording environments). The combination of contrastivelearning and multi-dimensional representation learning allows C

DPAM to better generalize across content differences (e.g. unseen speakers)with limited human annotation [11, 12]. Finally, to further improverobustness to large perturbations (well beyond JND), we collect adataset of judgments based on triplet comparisons, asking subjects:“Is A or B closer to reference C?”We show that C

DPAM correlates better than D

PAM with MOSand triplet comparison tests across nine publicly available datasets.We observe that C

DPAM is better able to capture differences dueto subtle artifacts recognizable to humans but not by traditionalmetrics. Finally, we show that adding C

DPAM to the loss functionyields improvements to the existing state of the art models for speechsynthesis [9] and enhancement [10]. The dataset, code, and resultingmetric, as well as listening examples, are available here: https://pixl.cs.princeton.edu/pubs/Manocha 2021 CCL/

2. RELATED WORK2.1. Perceptual audio quality metrics P ESQ [2] and V

ISQOL [3] were some of the earliest models toapproximate the human perception of audio quality. Although useful,these had certain drawbacks: (i) sensitivity to perceptually invarianttransformations [13]; (ii) narrow focus (e.g. telephony); and (iii) non-differentiable, making it impossible to be optimized for as anobjective in deep learning. To overcome the last concern, researcherstrain differentiable models to approximate P

ESQ [7, 14]. The approachof Fu et al. [7] uses GANs to model P

ESQ , whereas the approach ofZhang et al. [14] uses gradient approximations of P

ESQ for training.Unfortunately, these approaches are not always optimizable, and mayalso fail to generalize to unseen perturbations.Instead of using conventional metrics (e.g. P

ESQ ) as a proxy,Manocha et al. [1] proposed D

PAM that was trained directly on anew dataset of human JND judgments. D

PAM correlates well withhuman judgment for small perturbations, but requires a large set ofannotated judgments to generalize well across unseen perturbations.Recently, Serra et al. [8] proposed S

ESQA trained using the sameJND dataset [1], along with other datasets and objectives like P

ESQ .Our work is concurrent with theirs, but direct comparison is not yetavailable. a r X i v : . [ ee ss . A S ] F e b n c od e r P r o j ec ti on Maximize similarity

Data Augmentation Content Emb Acoustic EmbData Augmentation

Minimize similarityContrastive Loss C l a ss - n e t L o ss - n e t h CELoss d x x L o ss - n e t Rank Loss xx - x + E n c od e r E n c od e r (a) (c)(b) L o ss - n e t Fig. 1 : Training architecture : (a) We ﬁrst train an audio encoder using contrastive learning, then (b) train the loss-net on JND data [1]. (c) Finally, we ﬁne-tunethe loss-net on the newly collected dataset of triplet comparisons.

The ability to represent high-dimensional audio using compactrepresentations has proven useful for many speech applications(e.g. speech recognition [15], audio retrieval [16]).

Multi-dimensional learning:

Learning representations to identifythe underlying explanatory factors make it easier to extract useful in-formation on downstream tasks [17]. Recently in audio, Lee et al. [18]adapted Conditional Similarity Networks (CSN) for music similarity.The objective was to disentangle music into genre, mood, instrument,and tempo. Similarly, Hung et al. [19] explored disentangled repre-sentations for timbre and pitch of musical sounds useful for musicediting. Chou et al. [20] explored the disentanglement of speakercharacteristics from linguistic content in speech signals for voice con-version. Chen et al. [21] explored the idea of disentangling phoneticand speaker information for the task of audio representation.

Contrastive learning : Most of the work in audio has focused onaudio representation learning [22, 23], speech recognition [24], andphoneme segmentation [25]. To the best of our knowledge, no priormethod has used contrastive learning together with multi-dimensionalrepresentation learning to learn audio similarity.

3. THE CDPAM METRIC

This section describes how we train the metric in three stages,depicted in Fig 1: (a) pre-train the audio encoder using contrastivelearning; (b) train the loss-net on the perceptual JND data; and (c) ﬁne-tune the loss-net on the new perceptual triplet data.

We borrow ideas from the SimCLR frame-work [26] for contrastive learning. SimCLR learns representationsby maximizing agreement between differently augmented views ofthe same data example via a contrastive loss in the latent space. Anaudio is taken and transformations are applied to it to get a pair ofaugmented audio waveforms x i and x j . Each waveform in that pairis passed through an encoder to get representations. These represen-tations are further passed through a projection network to get ﬁnalrepresentations z . The task is to maximize the similarity betweenthese two representations z i and z j for the same audio, in contrastto the representation z k on an unrelated piece of audio x k .In order to combine multi-dimensional representation learningwith contrastive learning, we force our audio encoder to output twosets of embeddings: acoustic and content . We learn these separatelyusing contrastive learning. To learn acoustic embedding , we considerdata augmentation that takes the same acoustic perturbation parame-ters but different audio content, whereas to learn content embedding ,we take different acoustic perturbation parameters but the same audiocontent. The pre-training dataset consists of roughly 100K examples. JND dataset:

We use the same dataset of crowd-sourced humanperceptual judgments proposed by Manocha et al. [1]. In short,the dataset consists of around 55K pairs of human subjective judg-ments, each pair coming from the same utterance, with annotationsof whether the two recordings are exactly the same or different?

Per-turbations consist of additive linear background noise, reverberation,coding/compression, equalization, and various miscellaneous noiseslike pops and dropouts.

Fine-tuning dataset:

To make the metric robust to large (beyondJND) perturbations, we create a triplet comparison dataset. Thisimproves the generalization performance of the metric to includea broader range of perturbations and also enhances the ordering ofthe learned space. We follow the same framework and perturbationsas Manocha et al. [1]. This dataset consists of around 30K pairedexamples of crowd-sourced human judgments.

Fig. 1(a): The audio encoder consists of a 16 layer CNNwith × kernels that is downsampled by half every fourth layer.We use global average pooling at the output to get a 1024-dimensionalembedding, equally split into acoustic and content components. Theprojection network is a small fully-connected network that takes ina feature embedding of size 512 and outputs an embedding of 256dimensions. The contrastive loss is taken over this projection. We useAdam [30] with a batch size of 16, learning rate of − , and trainfor 250 epochs.We use the NT-Xent loss (Normalized Temperature-Scaled Cross-Entropy Loss) proposed by Chen et al. [26]. The key is to mapthe two augmented versions of an audio (positive pair) with highsimilarity and all other examples in the batch (negative pairs) withlow similarity. For measuring similarity, we use cosine distance. Formore information, refer to SimCLR [26].

Loss-network

Fig. 1(b-c): Our loss-net is a small 4 layer fullyconnected network that takes in the output of the audio encoder( acoustic embedding ) and outputs a distance (using the aggregatedsum of the L1 distance of deep features). This network also hasa small classiﬁcation network at the end that maps this distanceto a predicted human judgment. Our loss-net is trained usingbinary cross-entropy between the predicted value and ground truthhuman judgment. We use Adam with a learning rate of − ,and train for 250 epochs. For ﬁne-tuning on triplet data, we use MarginRankingLoss with a margin of . , using Adam with a learningrate of − for 100 epochs.As part of online data augmentation to make the model invariantto small delay, we decide randomly if we want to add a 0.25s silenceto the audio at the beginning or the end and then present it to thenetwork. This helps to provide shift-invariance property to the model,to disambiguate that in fact the audio is similar when time-shifted.To also encourage amplitude invariance, we also randomly apply asmall gain (-20dB to 0dB) on the training data. ype Name VoCo [27] FFTnet [28]

BWE [29]

Dereverb HiFi-GAN PEASS VC NoizeusConventional MSE P ESQ

VISQOL

JND metric D

PAM

Ours (default) C

DPAM : Spearman correlation of models with various MOS tests. Models include conventional metrics, DPAM and ours. Higher is better.

4. EXPERIMENTS4.1. Subjective Validation

We use previously published diverse third-party studies to verify thatour trained metric correlates well with their task. We show the resultsof our model and compare it with D

PAM as well as more conventionalobjective metrics such as

MSE , P

ESQ [2], and V

ISQOL [3].We compute the correlation between the model’s predicteddistance with the publicly available MOS, using Spearman’s Rankorder correlation (SC). These correlation scores are evaluated perspeaker where we average scores for each speaker for each condition.As an extension, we also check for 2AFC accuracy where wepresent one reference recording and two test recordings and asksubjects which one sounds more similar to the reference?

Each tripletis evaluated by roughly 10 listeners. 2AFC checks for the exactordering of similarity at per sample basis, whereas MOS checksfor aggregated ordering, scale, and consistency. In addition to allevaluation datasets considered by Manocha et al. [1], we consideradditional datasets:1.

Dereverberation [31]: consists of MOS tests to assess the perfor-mance of 5 deep learning-based speech enhancement methods.2.

HiFi-GAN [32]: consists of MOS and 2AFC scores to assessimprovement across 10 deep learning based speech enhancementmodels (denoising and dereverberation).3.

PEASS [33]: consists of MOS scores to assess audio source sepa-ration performance across 4 metrics: global quality , preservationof target source , suppression of other sources , and absence ofadditional artifacts . Here, we only look at global quality .4. Voice Conversion (VC) [34]: consists of tasks to judge theperformance of various voice conversion systems trained usingparallel (

HUB ) and non-parallel data (

SPO ). Here we only consider

HUB .5.

Noizeus [35]: consists of a large scale MOS study of non-deeplearning-based speech enhancement systems across 3 metrics:

SIG -speech signal alone;

BAK -background noise; and

OVRL -overallquality. Here, we only look at

OVRL .Results are displayed in Tables 1 and 2, in which our proposed metrichas the best performance overall. Next, we summarize with a fewobservations:

Type Name FFTnet BWE HiFiGAN SimulatedConventional MSE P ESQ V ISQOL

JND metric D

PAM

Ours (default) C

DPAM

Table 2 : of various models, including conventionalmetrics, DPAM and ours. Higher is better. • Similar to ﬁndings by Manocha et al. [1], conventional metricslike P ESQ and V

ISQOL perform better on measuring large distances(e.g.

Dereverb , HiFi-GAN ) than subtle differences (e.g.

BWE ),suggesting that these metrics do not correlate well with humanperception when measuring subtle differences.• We observe a natural compromise between generalizing to largeaudio differences well beyond JND (e.g. FFTnet, VoCo, etc.) andfocusing only on small differences (e.g. BWE). As we see, C

DPAM is able to correlate well across a wide variety of datasets, whereasD

PAM correlates best where the audio differences are near JND.C

DPAM scores higher than D

PAM on BWE on MOS correlation, buthas a lower 2AFC score suggesting that D

PAM might be better atordering individual pairwise judgments closer to JND.• Compared to D

PAM , C

DPAM performs better across a wide varietyof tasks and perturbations, showing higher generalizability acrossperturbations and downstream tasks.

We perform ablation studies to better understand the inﬂuence ofdifferent components of our metric in Table 3. We compare our trainedmetric at various stages: (i) after self-supervised training; (ii) afterJND training; and (iii) after triplet ﬁnetuning ( default ). To furthercompare amongst self-supervised approaches, we also show resultsof self-supervised metric learning using triplet loss. To also showimprovements due to learning multi-dimensional representations, weshow results of a model trained using contrastive learning without content dimension. The metrics are compared on (i) robustnessto content variations; (ii) monotonic behavior with increasingperturbation levels; and (iii) correlation with subjective ratings froma subset of existing datasets.

Robust to content variations

To evaluate robustness to contentvariations, we create a test dataset of two groups: one consistingof pairs of recordings that have the same acoustic perturbation levelsbut varying audio content; the other consisting of pairs of recordingshaving different perturbation levels and audio content. We calculatethe common area between these normalized distributions. Our ﬁnalmetric has the lowest common area, suggesting that it is morerobust to changing audio content. Decreasing common area also

Name ComArea ↓ Mono ↑ VoCo ↑ FFTnet ↑ BWE ↑ D PAM [1] - - - self-sup. (triplet m.learning) contrastive w/o mul-dim. rep. self-sup. (contrastive) +JND 0.32

Table 3 : Ablation studies . Sec 4.2 describes common area, mono-tonicity, and Spearman correlations for 3 datasets. ↑ or ↓ is better.orresponds with increasing MOS correlations across downstreamtasks, suggesting that the task of separating these two distributiongroups may be a factor when learning acoustic audio similarity. Clustered Acoustic Space:

To further quantify this learned space,we also calculate the Precision of Top K retrievals which measuresthe quality of top K items in the ranked list. Given 10 differentacoustic perturbation groups - each group consisting of 100 ﬁleshaving the same acoustic perturbation levels but different audiocontent, we take randomly selected queries and calculate the numberof correct class instances in the top K retrievals. We report the meanof this metric over all queries ( M P k ). C DPAM gets

M P k =10 = 0.92and M P k =20 = 0.87, suggesting that these acoustic perturbationgroups all cluster together in the learned space. Monotonicity

To show our metric’s monotonicity with increasinglevels of noise, we create a test dataset of recordings with differentaudio content and increasing noise perturbation levels (both individual and combined perturbations). We calculate SC between the distancefrom the metric and the perturbation levels. Both D

PAM and C

DPAM behave equally monotonically with increasing levels of noise.

MOS Correlations

Each of the key components of C

DPAM namely contrastive-learning , multi-dimensional representation learning , and triplet learning have a signiﬁcant impact on generalization acrossdownstream tasks. Surprisingly, even our self-supervised model has anon-trivial positive correlation with increasing noise as well as MOScorrelation across datasets. This suggests that a self-supervised modelnot trained on any classiﬁcation or perceptual task is still able to learnuseful perceptual knowledge. This is true across datasets rangingfrom subtle to large differences suggesting that contrastive learningcan be a useful pre-training strategy. We show the utility of our trained metric as a loss function for thetask of waveform generation. We use the current state-of-the-artMelGAN [9] vocoder. We train two models: i) single speaker modeltrained on LJ [36], and ii) cross-speaker model trained on a subsetof VCTK [37]. Both the models were trained for around iterations until convergence. We take the existing model and just addC

DPAM as an additional loss in the training objective.We randomly select 500 unseen audio recordings to evaluateresults. For the single-speaker model, we use LJ dataset, whereasfor the cross-speaker model, we use the 20 speaker DAPS [38]dataset. We perform A/B preference tests on Amazon MechanicalTurk (AMT), consisting of

Ours vs baseline pairwise comparisons.Each pair is rated by 6 different subjects and then majority votedto see which method performs better per utterance. As shown inFig 2(a), our models outperform the baseline in both categories. Allresults are statistically signiﬁcant with p < − . Our model isstrongly preferred over the baseline, but the maximum improvementis observed in the cross-speaker scenario where our model performsbest. Speciﬁcally, we observe that MelGAN enhanced with C DPAM detects and follows the reference pitch better than the baseline. P ESQ

STOI CSIG CBAK COVL MOSNoisy

DEMUCS [10]

DEMUCS+C

DPAM

Finetune C

DPAM

Table 4 : Evaluation of denoising models using the VCTK [37] testset with four objective measures and one subjective measure. .

73 4 . .

86 4 . M O S S c o re DE M UCS DE M UCS+CDPA M F i n et un e CDPA M C le an ( b ) Ours preferredover M e l GAN % - s i ng l e s p . % - c r o ss s p ea k e r (a) % = c h a n ce Fig. 2 : Subjective tests : (a) In pairwise tests, ours is typically pre-ferred over MelGAN for single speaker and cross-speaker synthesis.(b) MOS tests show denoising methods are improved by CDPAM.

To further demonstrate the effectiveness of our metric, we usethe current state-of-the-art DEMUCS architecture based speech de-noiser [10] and supplement C

DPAM in two ways: (i)

DEMUCS+CDPAM :train from scratch using a combination of L1, multi-resolution STFTand C

DPAM ; and (ii)

Fintune C DPAM : pre-train on L1 and multi-resolution STFT loss and ﬁnetune on C

DPAM . The training datasetconsists of VCTK [37] and DNS [39] datasets. For a fair comparison,we only compare real-time (causal) models.We randomly select 500 audio clips from the VCTK test setand evaluate scores on that dataset. We evaluate the quality ofenhanced speech using both objective and subjective measures. Forthe objective measures, we use: i) P

ESQ (from 0.5 to 4.5); (ii) Short-Time Objective Intelligibility (

STOI ) (from 0 to 100); (iii)

CSIG :MOS prediction of the signal distortion attending only to the speechsignal (from 1 to 5); (iv)

CBAK : MOS prediction of the intrusivenessof background noise (from 1 to 5); (v)

COVL : MOS prediction ofthe overall effect (from 1 to 5). We compare the baseline model withboth our models. Results are shown in Table 4.For subjective studies, we conducted a MOS listening study onAMT where each subject is asked to rate the sound quality of anaudio snippet on a scale of 1 to 5, with . In total,we collect around 1200 ratings for each method. We provide studio-quality audio as reference for high-quality, and the input noisy audioas low-anchor. As shown in Fig 2(b), both our models perform betterthan the baseline approach. We observe that our

Finetune C DPAM model scores the highest MOS score. This highlights the usefulnessof using C

DPAM in audio similarity tasks. Speciﬁcally, C

DPAM canidentify and eliminate minor human perceptible artifacts that are notcaptured by traditional losses. We also note that higher objectivescores do not guarantee higher MOS, further motivating the need forbetter objective metrics.

5. CONCLUSION AND FUTURE WORK

In this paper, we present C

DPAM , a contrastive learning-based deepperceptual audio metric that correlates well with human subjectiveratings across tasks. The approach relies on multi-dimensional andself-supervised learning to augment limited human-labeled data. Weshow the utility of the learned metric as an optimization objectivefor speech synthesis and enhancement, but it could be applied inmany other applications. We would like to extend this metric toinclude content similarity as well, in general going beyond acousticsimilarity for applications like music similarity. Though we showedtwo applications of the metric, future works could also explore otherapplications like audio retrieval and speech recognition. . REFERENCES [1] P. Manocha, A. Finkelstein, et al., “A differentiable perceptualaudio metric learned from just noticeable differences,” in

Interspeech , 2020.[2] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,“Perceptual evaluation of speech quality (PESQ)-a new methodfor speech quality assessment,” in

ICASSP , 2001.[3] A. Hines, J. Skoglund, A C. Kokaram, et al., “ViSQOL: anobjective speech quality model,”

EURASIP Journal on Audio,Speech, and Music Processing , 2015.[4] C-C. Lo, S-W. Fu, W-C. Huang, X. Wang, Junichi Yamagishi,Yu Tsao, and Hsin-Min Wang, “MOSNet: Deep learning basedobjective assessment for voice conversion,”

Interspeech , 2019.[5] B. Patton, Y. Agiomyrgiannakis, M. Terry, K. Wilson, R. A.Saurous, and D. Sculley, “AutoMOS: Learning a non-intrusiveassessor of naturalness-of-speech,” in

NIPS Workshop , 2016.[6] S-W. Fu, C-F. Liao, and Y. Tsao, “Learning with learned lossfunction: Speech enhancement with quality-net,”

SPS , vol. 27,pp. 26–30, 2019.[7] S-W. Fu, C-F. Liao, Y. Tsao, and S.D. Lin, “MetricGAN:Generative adversarial networks based black-box metric scoresoptimization for speech enhancement,” in

ICML , 2019.[8] J. Serr`a, J. Pons, et al., “SESQA: semi-supervised learning forspeech quality assessment,” arXiv:2010.00368 , 2020.[9] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville,“Melgan: Generative adversarial networks for conditionalwaveform synthesis,” in

NeurIPS , 2019.[10] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speechenhancement in the waveform domain,” in

Interspeech , 2020.[11] S V. Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem,“Are disentangled representations helpful for abstract visualreasoning?,” in

NeurIPS , 2019.[12] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey,D. Rezende, and A. Lerchner, “Towards a deﬁnition ofdisentangled representations,” arXiv:1812.02230 , 2018.[13] A. Hines, J. Skoglund, A. Kokaram, and N. Harte, “Robustnessof speech quality metrics: Comparing ViSQOL, PESQ andPOLQA,” in

ICASSP , 2013.[14] H. Zhang, X. Zhang, and G. Gao, “Training supervised speechseparation system to improve STOI and PESQ directly,” in

ICASSP , 2018.[15] A. Conneau, A. Baevski, et al., “Unsupervised cross-lingual rep-resentation learning for speech recognition,” arXiv:2006.13979 ,2020.[16] P. Manocha, R. Badlani, A. Kumar, A. Shah, B. Elizalde, andB. Raj, “Content-based representations of audio using siameseneural networks,” in

ICASSP , 2018.[17] Y. Bengio, A. Courville, and P. Vincent, “Representationlearning: A review and new perspectives,”

PAML , 2013.[18] J. Lee, N J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentan-gled multidimensional metric learning for music similarity,” in

ICASSP , 2020.[19] Y N. Hung, Y A. Chen, and Y H. Yang, “Learning disentangledrepresentations for timber and pitch in music audio,” in arXiv:1811.03271 , 2018. [20] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target voiceconversion without parallel data by adversarially learningdisentangled audio representations,” in arXiv:1804.02812 ,2018.[21] Y.C. Chen, S.F. Huang, H.Y. Lee, Y.H. Wang, and C.H. Shen,“Audio word2vec: Sequence-to-sequence autoencoding for un-supervised learning of audio segmentation and representation,”

ASLP , 2019.[22] A. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” in arXiv:1807.03748 , 2018.[23] P. Chi, P. Chung, T. Wu, et al., “Audio Albert: A liteBERT for self-supervised learning of audio representation,” in arXiv:2005.08575 , 2020.[24] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” in

Inter-speech , 2019.[25] F. Kreuk, J. Keshet, and Y. Adi, “Self-supervised contrastivelearning for unsupervised phoneme segmentation,” in

Inter-speech , 2020.[26] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simpleframework for contrastive learning of visual representations,”in

ICML , 2020.[27] Z. Jin, G J. Mysore, S. Diverdi, J. Lu, and A. Finkelstein, “Voco:Text-based insertion and replacement in audio narration,”

TOG ,2017.[28] Z. Jin, A. Finkelstein, G J. Mysore, and J. Lu, “FFTNet: Areal-time speaker-dependent neural vocoder,” in

ICASSP , 2018.[29] B. Feng, Z. Jin, J. Su, and A. Finkelstein, “Learning bandwidthexpansion using perceptually-motivated loss,” in

ICASSP , 2019.[30] D P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” in

ICLR , 2015.[31] J. Su, A. Finkelstein, and Z. Jin, “Perceptually-motivatedenvironment-speciﬁc speech enhancement,” in

ICASSP , 2019.[32] J. Su, Z. Jin, et al., “HiFi-GAN: High-ﬁdelity denoising anddereverberation,”

Interspeech , 2020.[33] V. Emiya, E. Vincent, N. Harlander, et al., “Subjective andobjective quality assessment of audio source separation,”

ASLR ,2011.[34] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villav-icencio, T. Kinnunen, and Z. Ling, “The voice conversionchallenge 2018,” arXiv:1804.04262 , 2018.[35] Y. Hu and P. C. Loizou, “Subjective comparison and evaluationof speech enhancement algorithms,”

Speech communication ,2007.[36] K. Ito et al., “The LJ speech dataset,” 2017.[37] C. Valentini-Botinhao et al., “Noisy speech database for trainingspeech enhancement algorithms and TTS models,” 2017.[38] G J. Mysore, “Can we automatically transform speech recordedon common consumer devices in real-world environments intoprofessional production quality speech?—a dataset, insights,and challenges,”

SPS , 2014.[39] C K. Reddy, E. Beyrami, H. Dubey, V. Gopal, et al., “Theinterspeech 2020 deep noise suppression challenge: Datasets,subjective speech quality and testing framework,”