[PDF] WARP-Q: Quality Prediction For Generative Neural Speech Codecs

Abstract

Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. These codecs use a DNN with parametric input to synthesise high quality speech outputs. Existing objective speech quality models (e.g., POLQA, ViSQOL) do not accurately predict the quality of coded speech from these generative models underestimating quality due to signal differences not highlighted in subjective listening tests. We present WARP-Q, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC speech representations. It is robust to small perceptual signal changes. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics in addition to versatility for general quality assessment scenarios.

Full PDF

WWARP-Q: QUALITY PREDICTION FOR GENERATIVE NEURAL SPEECH CODECS

Wissam A. Jassim , Jan Skoglund , Michael Chinen , Andrew Hines School of Computer Science, University College Dublin, Dublin, Ireland Chrome Media, Google, San Francisco, CA, USA [email protected], [email protected], [email protected], [email protected] ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works.

ABSTRACT

Good speech quality has been achieved using waveform match-ing and parametric reconstruction coders. Recently developed verylow bit rate generative codecs can reconstruct high quality widebandspeech with bit streams less than 3 kb/s. These codecs use a DNNwith parametric input to synthesise high quality speech outputs. Ex-isting objective speech quality models (e.g., POLQA, ViSQOL) donot accurately predict the quality of coded speech from these gener-ative models underestimating quality due to signal differences nothighlighted in subjective listening tests. We present WARP-Q, afull-reference objective speech quality metric that uses dynamic timewarping cost for MFCC speech representations. It is robust to smallperceptual signal changes. Evaluation using waveform matching,parametric and generative neural vocoder based codecs as well aschannel and environmental noise shows that WARP-Q has bettercorrelation and codec quality ranking for novel codecs compared totraditional metrics in addition to versatility for general quality as-sessment scenarios.

Index Terms — Dynamic time warping, low bit rate speech cod-ing, LPCNet, WaveNet, speech quality

1. INTRODUCTION

Estimation of speech quality is important for monitoring, evaluat-ing, and developing speech transmission and communication ser-vices. Usually, assessing speech quality can be done subjectivelywith listening tests providing accurate results with small conﬁdenceintervals [1]. However, subjective methods are expensive and timeconsuming. Objective speech quality estimation models can pro-vide a practical and efﬁcient alternative for evaluating and predictingspeech quality.Different kinds of objective models exist depending on thespeech applications and services. Models such as POLQA [2],PESQ [3], and ViSQOL [4, 5] have been shown to work well for awide variety of coding, channel and environmental degradations tothe speech signal. They are full-reference (FR) metrics that comparea clean reference to a test signal that has been degraded. They dothis by aligning and comparing the signals and mapping an estimateof the differences between the signal to a perceptual mean opinionscore (MOS) scale.Recently, data driven algorithms based on deep neural net-works (DNNs) have created a new generation of generative speechsynthesis models [6–9], often with text-to-speech as the applica-tion. Of these, the auto-regressive teacher-forced architecture inWaveNet [6], WaveRNN [10], and SampleRNN [11] has been used as the basis in new generative codecs [12–14]. These codecs arewideband and are designed to operate at low bit rates, and haveshown very promising results. The reconstructed audio waveformsare generated by a neural network conditioned by traditional lowbit rate parametric vocoder parameters, i.e., the speech signal isrepresented and transmitted as a sequence of parameters extracted atthe encoder [12].The DNN-based codecs are generative, meaning that while theoriginal and coded speech signals may both sound good, they havestructural differences in them. This is because the decoded signalis generated from the parameters of the coded signal and from themodel, so it may not fully align spectro-temporally with the origi-nal signal. These alignment differences cause problems for full ref-erence quality models. Although they deal well with macro mis-alignments (delays etc.), micro-alignments across time or frequencycomponents of speech cause quality prediction issues.Although most common quality metrics such as POLQA andVISQOL metrics provide accurate quality scores for the speech sig-nals processed by telephony and voice over IP (VoIP) transmissionsystems, they fail to provide acceptable results when speech signalsare distorted by the effects of low bit rate DNN-based codecs. Thiswork sought to develop a new FR model for speech quality predic-tion that worked for generative speech codecs. The aim was to havea general model that would work for speech from both generativeand traditional codecs.

2. BACKGROUND

Speech codec algorithms are designed to compress speech signalsat a low bit rate and yet retaining high speech quality. The pro-cesses include analysing and converting spoken sounds into digitalcodes and vice versa. Over decades, different types of codecs havebeen introduced [15]. Different aspects such as bit rate, language ofspoken words, channel errors, coding delays, and memory and com-putational cost determine the need and performance of any speechcodec algorithm.Speech coders are traditionally classiﬁed into two types of al-gorithms: parametric, such as MELP [16] and waveform (match-ing) codecs, e.g., Speex [17] and Opus [18]. Traditional parametriccodecs are generative in that they extract speech features controllinga generative synthesis, focusing on sounding similar to the input anddisregarding the actual output waveform. While this can cause ﬁ-delity issues it does not signiﬁcantly impact time alignment. TheWaveNet [6, 12] and LPCNet [13, 19] codecs are examples of thenew class of generative neural codecs, where the synthesis is drivenby a neural network. These neural-based generative codecs can pro-duce high ﬁdelity output but pose alignment challenges to objective a r X i v : . [ ee ss . A S ] F e b uality metrics.We previously compared the performance of these two types ofcodecs in terms of several quality aspects, such as accuracy of pitchperiods estimation, the word error rates for automatic speech recog-nition (ASR), and the inﬂuence of speaker gender and coding de-lays [20]. It was observed that these factors should be taken intoaccount in order to design a new and robust FR metric that is work-able for different codecs. We analysed why existing speech qualitymodels underrated the quality of generative codec outputs and con-sidered the micro-alignment differences as a potential cause.Generative codecs rely on a combination of parametricallycoded information and a neural model (e.g., WaveNet) to synthesisethe output. Although the resulting codec speech is rated as highquality, there are small differences between the original and codecspeech. Some of these are temporal micro-alignments and othersmanifest as slight pitch shifts. While these are potentially percep-tible differences, a human listener may not be able to distinguish aquality difference. On the other hand, traditional codecs keep theoriginal and coded signals temporally aligned and may be penalisedin subjective ratings due to the spectro-temporal differences thatmanifest as noise or corrupt speech in the coded output.Standard speech quality models rely on evaluating the similarity between reference and test signals as a salient feature for assessingquality. They pre-align the signals in order to account for qualityissues resulting from delay and signal corruption. For example, theViSQOL metric [4, 5] uses the neurogram similarity index measure(NSIM) to estimate the similarity between a pre-aligned referencepatch and a degraded spectrogram patch frame by frame.In this study, we propose WARP-Q, based on dynamic timewarping (DTW), calculating an optimal match between two given se-quences. DTW has been successfully adopted for a range of speechprocessing applications. In [21], the global alignment distance basedon the original DTW is employed for test and received speech com-parison. It showed results comparable to that of the PESQ metric forperceived speech quality measurement in VoIP and global system formobile communications (GSM) networks.WARP-Q takes a different approach to traditional speech qual-ity models handling time-alignment and signal similarity in a com-bined manner. We use a special type of DTW algorithm, known as subsequence dynamic time warping (SDTW) [22], to measure the distance between speech signals. Unlike the original DTW algo-rithm which aims to ﬁnd an optimal global alignment between twogiven sequences, the SDTW ﬁnds a subsequence within the longersequence that optimally ﬁts the shorter sequence. It has been suc-cessfully employed in audio matching scenarios and content-basedaudio retrieval applications [23]. We refer to Figs. 3.10 and 7.13from M¨uller [22] for more details about the difference between theoriginal DTW and SDTW.We show that this simple concept allows the perceptual qual-ity impact of micro-alignment and signal corruption to be capturedand quantiﬁed together. The proposed SDTW-based metric predictsspeech quality correctly for generative codecs while also perform-ing competitively with standard metrics for a wide range of standardcoding algorithms and distortion effects.

3. PROPOSED ALGORITHM

Fig. 1 illustrates the four processing stages of the proposed algo-rithm: pre-processing; feature extraction; similarity comparison; andsubsequence score aggregation. Python source code for the WARP-Q model is available for download in [24].

Fig. 1 : Block diagram of the proposed WARP-Q metric.

The reference and degraded input signals are set to the same sam-pling frequency, f s = 16 kHz. Silent non-speech segments are re-moved from reference and degraded signals using a voice activitydetection (VAD) algorithm. Our implementation used a WebRTC-based VAD with default parameters [25]. Mel frequency cepstral coefﬁcients (MFCCs) representations of thereference and degraded signals are generated using 12 critical bandsup to 5 kHz for each frame [26]. A Hann window with a length of32 ms and 80 % overlap was used for framing. Spectral coefﬁcientswere extracted using the discrete cosine transform (DCT) type-2with orthonormal bases and a cepstral ﬁltering of 3 for liftering. TheMFCCs signal representations are normalised so that they have thesame segmental statistics (zero mean and unit variance). The spec-tral coefﬁcients of each feature vector were normalised using thecepstral mean and variance normalisation (CMVN) algorithm [27]. Let X = ( x , x , ..., x N ) and Y = ( y , y , ..., y M ) be two fea-ture sequences over a feature space. The length M is assumed tobe much larger than the length N . For the two given sequences, theSDTW algorithm considers all possible subsequences of Y to ﬁndthe optimal one that minimises the DTW distance to X . The optimalsubsequence of Y is determined by two optimal indices a ∗ and b ∗ ,where a ∗ , b ∗ ∈ [1 : M ] with a ∗ ≤ b ∗ , such that the subsequence Y ( a ∗ : b ∗ ) has the minimum DTW distance to X over other sub-sequences. To reveal the optimal index b ∗ , the algorithm computesthe N × M accumulated cost matrix denoted by D using dynamicprogramming. The index that minimises cost values in last row of D represents the optimal index b ∗ . To reveal the optimal index a ∗ , thealgorithm drives the optimal warping path P ∗ (list of index pairs)between X and Y ( a ∗ : b ∗ ) using backtracking, which starts with q = ( N, b ∗ ) and stops as soon as the ﬁrst row of D is reached bysome element q r = (1 , m ) , m ∈ [1 : M ] . Refer to Exercise 7.6 fromM¨uller, 2015 [22] for more details about the SDTW algorithm. Weuse the Librosa Python library [26] to implement the SDTW algo-rithm with a step size condition of (cid:80) { (1 , , (3 , , (1 , } and anEuclidean cost function. The reference and degraded MFCC representations can be treated as2-dimensional matrices for processing. The reference MFCC ma-trix, Y , has a size of K × M , where K = 12 which representsthe 12 frequency bands, and M is the total number of signal frames.The MFCC matrix of degraded signal is divided into a number, L , A m p li t u d e Reference Signal0 1 2 3 4 5 6Time (s)0.50.00.5 A m p li t u d e WaveNet Signal (a)

Signals in time space F r e q . ( H z ) Reference signal MFCC matrix, Y0 1 2 3 4 5 6Time (s)0512102420484096 F r e q . ( H z ) P a t c h X WaveNet signal MFCC matrix (b)

MFCC feature space X , ( s ) b = 649 b t = 2.596 s a = 239 a t = 0.956 sAccumulated Cost Matrix D ( X , Y ) Opt. path P* (c)

Sub-signal DTW space

Fig. 2 : SDTW-based accumulated cost and optimal path between two signals. (a) plots of a reference signal and its corresponding codedversion from a WaveNet coder at 6 kb/s (obtained from the VAD stage), (b) normalised MFCC matrices of the two signals, (c) plots ofSDTW-based accumulated alignment cost matrix D ( X,Y ) and its optimal path P ∗ between the MFCC matrix Y of the reference signal anda patch X extracted from the MFCC matrix of the degraded signal. The optimal indices ( a ∗ & b ∗ ) are also shown. X corresponds to a shortsegment (2 s long) from the WaveNet signal (highlighted in green color).of patches with a overlap. Each patch, X i ( i = 1 , , ..., L ), isof size K × N , where N = 100 corresponds to 100 frames (400ms) patch length from the degraded speech. L is equal to the totalnumber of degraded signal frames divided by N . For each degradedpatch X i , the SDTW algorithm described above is used to computethe accumulated cost matrix D ( X i ,Y ) , optimal warping path P ∗ , andoptimal indices a ∗ and b ∗ between X i and the reference MFCC ma-trix Y . The accumulated cost of index b ∗ is adopted as the salientfeature for quality score estimation. The quality score per each de-graded patch is computed as follows: C i = 1 N D ( X i ,Y ) [ N, b ∗ ] , i = 1 , , ..., L. (1)Note that in Eq. 1, the accumulated cost is divided by N (lengthof degraded patch) to suppress the dynamic range of the predictedscores. Eq. 1 provides a vector of L elements corresponding to thetotal number of patches in the degraded signal. Finally, the aggregatequality score (QS) for the degraded signal is computed as the medianof the cost per patch: QS = Median ([ C , C , ..., C L ]) . (2)An illustration of the process is presented in Fig. 2. A referencesignal taken from the Genspeech database [20] with its correspond-ing coded version from a WaveNet coder at 6 kb/s are fed to theVAD algorithm to remove non-speech segments from them. Theplots of the two processed signals are shown in Fig. 2a. Their nor-malised MFCC representations are shown in Fig. 2b. A patch X isextracted from the MFCC array of the degraded signal. In this ex-ample, for better visualisation, we used a wider patch of 2 s length( N = 500 , which corresponds to 500 frames long). The extractedpatch is highlighted in green color in Figs. 2a and 2b. Fig. 2c dis-plays D ( X,Y ) with its corresponding P ∗ computed by the SDTWalgorithm. The optimal index located in the top row of D ( X,Y ) is b ∗ = 649 , which corresponds to ∗ ms / frame = 2 . s intime, i.e., b ∗ t = 2 . s. Furthermore, the optimal index locatedin the bottom row of D ( X,Y ) is a ∗ = 239 , which corresponds to . s time index, i.e., a ∗ t = 0 . s. This indicates that the subse-quence of Y that has the minimum alignment cost distance to patch X is Y ( a ∗ = 239 : b ∗ = 649) , i.e., Y ( a ∗ t = 0 .

956 : b ∗ t = 2 . sin time. Table 1 : The Genspeech dataset. Further details in [20].

Codec Bit rate DescriptionLPCNetUnquant — LPCNet operating on Opus unquantized featuresOpus9.0 9 kb/s Wideband vocoder (SILK mode)WaveNet6.0 6 kb/s WaveNet operating on Opus quantized featuresLPCNet1.6 1.6 kb/s WaveRNN + linear predictionLPCNet6.0 6 kb/s LPCNet operating on Opus quantized featuresMELP2.4 2.4 kb/s Narrowband vocoderOpus6.0 6 kb/s Narrowband vocoder (SILK mode)Speex4.0 4 kb/s Wideband vocoder (wideband quality 0)

4. EXPERIMENTAL EVALUATION

Data from subjective experiments on a variety of parametric and gen-erative codecs (the Genspeech database, with data from [19, 28]) areused to evaluate WARP-Q and benchmark the performance againstexisting models. Table 1 summarises the codecs (further detailsavailable in [20]). Furthermore, the capability of WARP-Q to pre-dict speech quality for quality issues beyond low bit rate coding isevaluated using other datasets: the TCD-VoIP [29], a database whichcontains speech signals under a range of common VoIP degradationswith channel and environmental issues, and the ITU-T P. Supple-ment 23 (P.Sup23) [EXP1 and EXP3] [30], a database which con-tains speech samples under a range of traditional coding and someenvironmental degradations. Note that the original MUSHRA scoresfrom the Genspeech database were linearly rescaled to be in the samerange of MOS of other databases.Models are compared using Pearson’s correlation coefﬁcient andSpearman rank-order correlation coefﬁcient. Fig. 3 compares the percondition (i.e., grouped by codec) WARP-Q scores to that of existingmetrics for the Genspeech dataset. The proposed metric providedscores that are ranked and consistent more than other metrics for allcodecs.Fig. 4a presents a scatter plot of WARP-Q scores against sub-jective quality ratings for samples from the four datasets at a stan-dardised sampling frequency ( f s = 16 kHz). A consistent inversecorrelation between WARP-Q scores and MOS is apparent and therange of predicted scores is good between datasets. This highlightsthe robustness of the proposed algorithm to different degradationscenarios as the predicted quality scores remain bounded in a sim- 026 : $ 5 3 4 6 F R U H 3HDUVRQU 6SHDUPDQ &RGHF /3&1HW8QTXDQ2SXV:DYH1HW/3&1HW/3&1HW0(/32SXV6SHH[ (a) 026 9 L 6 42 / 6 F R U H 3HDUVRQU 6SHDUPDQ &RGHF /3&1HW8QTXDQ2SXV:DYH1HW/3&1HW/3&1HW0(/32SXV6SHH[ (b) 026 3(6 4 6 F R U H 3HDUVRQU 6SHDUPDQ &RGHF /3&1HW8QTXDQ2SXV:DYH1HW/3&1HW/3&1HW0(/32SXV6SHH[ (c) 026 3 2 / 4 $ 6 F R U H 3HDUVRQU 6SHDUPDQ &RGHF /3&1HW8QTXDQ2SXV:DYH1HW/3&1HW/3&1HW0(/32SXV6SHH[ (d) Fig. 3 : Per condition QS predicted for the Genspeech database using: (a) the proposed metric, (b) ViSQOL metric, (c) PESQ metric, and(d) POLQA metric. Points are highlighted in two different colors: in green for generative neural coders (LPCNetUnquan, WaveNet6.0,LPCNet1.6 and LPCNet6.0) and in red for traditional coders (Opus9.0, MELP2.4, Opus6.0 and Speex4.0). 026 : $ 5 3 4 6 F R U H 'DWDEDVH *HQVSHHFK7&'9R,336XS(;336XS(;3 (a) 026 : $ 5 3 4 6 F R U H 3HDUVRQ 6SHDUPDQ (b) 026 : $ 5 3 4 6 F R U H 3HDUVRQ 6SHDUPDQ (c) 026 : $ 5 3 4 6 F R U H 3HDUVRQ 6SHDUPDQ (d) Fig. 4 : QS predicted by WARP-Q using Eq. 2 for: (a) per sample scores for all databases, (b) per condition scores for the TCD-VoIP database,(c) per condition scores for the P.Sup23 EXP1 database, and (d) per condition scores for the P.Sup23 EXP3 database.

Table 2 : Benchmark Statistics

Database: Genspeech TCD-VoIP P.Sup23 EXP1 P.Sup23 EXP3PearsonWARP-Q -0.89 -0.9 -0.88 -0.87ViSQOL 0.64 0.74 0.87 0.78PESQ 0.49

SpearmanWARP-Q -0.9 -0.92 -0.92 -0.79ViSQOL 0.74 0.76 0.89 0.67PESQ 0.52 0.91 0.96 0.87POLQA 0.76 0.89 ilar range. Fig. 4b-4d present the promising per condition resultspredicted by WARP-Q for the TCD-VoIP and P.Sup23 [EXP1 andEXP3] datasets.Table 2 presents correlation statistics per condition, for eachmetric by dataset. The proposed metric shows promise as a gen-eral use speech quality model for coding, channel, noise and otherquality degradations in a competitive way to the PESQ, POLQA, andViSQOL metrics.

5. DISCUSSION AND CONCLUSIONS

Generative coding is changing the fundamental relationship betweenthe source and codec signal: if the reference and codecs signal are perceptually different in timing or pitch but indistinguishable inquality then subjectively they are both high quality. Traditionalobjective quality metrics have matured over the last two decadesbut do not easily adapt to the challenge posed by generative codecs.A new approach is needed. Adopting the SDTW algorithm andapplying it to MFCC features allows WARP-Q to be resilient tomicro-alignment issues while penalising perceptible signal intensitychanges caused by coding artefacts. The results show that althoughWARP-Q is a simple model building on well established speechsignal processing features and algorithms it solves the unmet needof a speech quality model that can be applied to generative neuralcodecs. Work is ongoing to further optimise the model (e.g. DTWparameters, choice of MFCC representation, median aggregation)and add a cognitive mapping subsystem to map WARP-Q scores toa human subjective (MOS) rating scale. Python code of WARP-Q isavailable on GitHub for ease of access and contribution.

6. ACKNOWLEDGMENT

The authors would like to thank Google for kindly providing thesubjective labelled data of low bit rate codecs, ViSQOL, PESQ, andPOLQA quality scores. This work has emanated from research sup-ported in part by the Google Chrome University Program and re-search grants from Science Foundation Ireland (SFI) co-funded un-der the European Regional Development Fund under Grant Number13/RC/2289 P2 and 13/RC/2077. . REFERENCES [1] S. M¨oller, W. Chan, N. Cˆot´e, T. H. Falk, A. Raake, andM. W¨altermann, “Speech quality estimation: Models andtrends,”

IEEE Signal Processing Magazine , vol. 28, no. 6, pp.18–28, 2011.[2] ITU, “Perceptual objective listening quality assessment,” Int.Telecomm. Union, Geneva, Switzerland, ITU-T Rec. P.863,2018.[3] ITU, “Perceptual Evaluation of Speech Quality (PESQ): anobjective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,” in

ITU-TRec. P.862 , 2001.[4] M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman,and A. Hines, “Visqol v3: An open source production readyobjective speech and audio metric,” in , 2020, pp. 1–6.[5] A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte,“ViSQOL: An objective speech quality model,”

EURASIPJournal on Audio, Speech, and Music Processing , vol. 2015,no. 1, p. 13, 2015.[6] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model forraw audio,” in

Arxiv , 2016. [Online]. Available: https://arxiv . org/abs/1609 . , April 2018, pp. 4779–4783.[8] S. ¨O. Arik, G. F. Diamos, A. Gibiansky, J. Miller, K. Peng,W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” CoRR , vol. abs/1705.08947,2017. [Online]. Available: http://arxiv . org/abs/1705 . ICASSP 2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2019,pp. 3617–3621.[10] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van denOord, S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neuralaudio synthesis,”

CoRR , vol. abs/1802.08435, 2018. [Online].Available: http://arxiv . org/abs/1802 . arXiv:1612.07837 ,2016.[12] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg,Q. Wang, and T. C. Walters, “Wavenet based low rate speechcoding,” IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , pp. 676–680, 2018.[13] J. Valin and J. Skoglund, “LPCNET: Improving Neural SpeechSynthesis through Linear Prediction,” in

ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , May 2019, pp. 5891–5895. [14] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes,“High-quality speech coding with sample RNN,” in , 2019,pp. 7155–7159, arXiv:1811.03021 .[15] W. B. Kleijn and K. K. Paliwal,

Speech Coding and Synthesis .USA: Elsevier Science Inc., 1995.[16] A. McCree, Kwan Truong, E. B. George, T. P. Barnwell, andV. Viswanathan, “A 2.4 kbit/s MELP coder candidate for thenew U.S. Federal Standard,” in , vol. 1, May 1996, pp. 200–203 vol. 1.[17] J.-M. Valin,

The Speex codec manual , Xiph.Org Foundation,2007. [Online]. Available: https://speex . org/docs/manual/speex-manual . pdf[18] J.-M. Valin, K. Vos, and T. Terriberry, “Deﬁnition of the OpusAudio Codec,” RFC 6716, Sep. 2012. [Online]. Available:https://rfc-editor . org/rfc/rfc6716 . txt[19] J.-M. Valin and J. Skoglund, “A Real-Time WidebandNeural Vocoder at 1.6kb/s Using LPCNet,” in Proc.Interspeech 2019 , 2019, pp. 3406–3410. [Online]. Available:http://dx . doi . org/10 . . , 2020, pp. 1–6.[21] I. Kraljevski, S. Chungurski, Z. Gacovski, and S. Arsenovski,“Perceived speech quality estimation using dtw algorithm,”in , 2008.[22] M. M¨uller, Fundamentals of Music Processing: Audio, Anal-ysis, Algorithms, Applications , 1st ed. Springer PublishingCompany, Incorporated, 2015.[23] J. Serra, E. Gomez, P. Herrera, and X. Serra, “Chroma binarysimilarity and local alignment applied to cover song identiﬁ-cation,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 16, no. 6, pp. 1138–1151, 2008.[24] “WARP-Q Software,” https://github . com/wjassim/WARP-Q . git, 2021.[25] “Python interface to the WebRTC voice activity detector,”https://github . com/wiseman/py-webrtcvad.[26] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Bat-tenberg, and O. Nieto, “librosa: Audio and music signal anal-ysis in python,” in Proceedings of the 14th python in scienceconference , vol. 8, 2015.[27] A. Torﬁ, “SpeechPy: Speech recognition and featureextraction,” Aug. 2017. [Online]. Available: https://doi . org/10 . . Proc. Interspeech2020 , 2020.[29] N. Harte, E. Gillen, and A. Hines, “TCD-VoIP, a researchdatabase of degraded speech for assessing quality in voip appli-cations,” in , 2015, pp. 1–6.[30] ITU, “ITU-T coded-speech database,” in