WARP-Q: Quality Prediction For Generative Neural Speech Codecs
Wissam A. Jassim, Jan Skoglund, Michael Chinen, Andrew Hines
WWARP-Q: QUALITY PREDICTION FOR GENERATIVE NEURAL SPEECH CODECS
Wissam A. Jassim , Jan Skoglund , Michael Chinen , Andrew Hines School of Computer Science, University College Dublin, Dublin, Ireland Chrome Media, Google, San Francisco, CA, USA [email protected], [email protected], [email protected], [email protected] ©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works.
ABSTRACT
Good speech quality has been achieved using waveform match-ing and parametric reconstruction coders. Recently developed verylow bit rate generative codecs can reconstruct high quality widebandspeech with bit streams less than 3 kb/s. These codecs use a DNNwith parametric input to synthesise high quality speech outputs. Ex-isting objective speech quality models (e.g., POLQA, ViSQOL) donot accurately predict the quality of coded speech from these gener-ative models underestimating quality due to signal differences nothighlighted in subjective listening tests. We present WARP-Q, afull-reference objective speech quality metric that uses dynamic timewarping cost for MFCC speech representations. It is robust to smallperceptual signal changes. Evaluation using waveform matching,parametric and generative neural vocoder based codecs as well aschannel and environmental noise shows that WARP-Q has bettercorrelation and codec quality ranking for novel codecs compared totraditional metrics in addition to versatility for general quality as-sessment scenarios.
Index Terms — Dynamic time warping, low bit rate speech cod-ing, LPCNet, WaveNet, speech quality
1. INTRODUCTION
Estimation of speech quality is important for monitoring, evaluat-ing, and developing speech transmission and communication ser-vices. Usually, assessing speech quality can be done subjectivelywith listening tests providing accurate results with small confidenceintervals [1]. However, subjective methods are expensive and timeconsuming. Objective speech quality estimation models can pro-vide a practical and efficient alternative for evaluating and predictingspeech quality.Different kinds of objective models exist depending on thespeech applications and services. Models such as POLQA [2],PESQ [3], and ViSQOL [4, 5] have been shown to work well for awide variety of coding, channel and environmental degradations tothe speech signal. They are full-reference (FR) metrics that comparea clean reference to a test signal that has been degraded. They dothis by aligning and comparing the signals and mapping an estimateof the differences between the signal to a perceptual mean opinionscore (MOS) scale.Recently, data driven algorithms based on deep neural net-works (DNNs) have created a new generation of generative speechsynthesis models [6–9], often with text-to-speech as the applica-tion. Of these, the auto-regressive teacher-forced architecture inWaveNet [6], WaveRNN [10], and SampleRNN [11] has been used as the basis in new generative codecs [12–14]. These codecs arewideband and are designed to operate at low bit rates, and haveshown very promising results. The reconstructed audio waveformsare generated by a neural network conditioned by traditional lowbit rate parametric vocoder parameters, i.e., the speech signal isrepresented and transmitted as a sequence of parameters extracted atthe encoder [12].The DNN-based codecs are generative, meaning that while theoriginal and coded speech signals may both sound good, they havestructural differences in them. This is because the decoded signalis generated from the parameters of the coded signal and from themodel, so it may not fully align spectro-temporally with the origi-nal signal. These alignment differences cause problems for full ref-erence quality models. Although they deal well with macro mis-alignments (delays etc.), micro-alignments across time or frequencycomponents of speech cause quality prediction issues.Although most common quality metrics such as POLQA andVISQOL metrics provide accurate quality scores for the speech sig-nals processed by telephony and voice over IP (VoIP) transmissionsystems, they fail to provide acceptable results when speech signalsare distorted by the effects of low bit rate DNN-based codecs. Thiswork sought to develop a new FR model for speech quality predic-tion that worked for generative speech codecs. The aim was to havea general model that would work for speech from both generativeand traditional codecs.
2. BACKGROUND
Speech codec algorithms are designed to compress speech signalsat a low bit rate and yet retaining high speech quality. The pro-cesses include analysing and converting spoken sounds into digitalcodes and vice versa. Over decades, different types of codecs havebeen introduced [15]. Different aspects such as bit rate, language ofspoken words, channel errors, coding delays, and memory and com-putational cost determine the need and performance of any speechcodec algorithm.Speech coders are traditionally classified into two types of al-gorithms: parametric, such as MELP [16] and waveform (match-ing) codecs, e.g., Speex [17] and Opus [18]. Traditional parametriccodecs are generative in that they extract speech features controllinga generative synthesis, focusing on sounding similar to the input anddisregarding the actual output waveform. While this can cause fi-delity issues it does not significantly impact time alignment. TheWaveNet [6, 12] and LPCNet [13, 19] codecs are examples of thenew class of generative neural codecs, where the synthesis is drivenby a neural network. These neural-based generative codecs can pro-duce high fidelity output but pose alignment challenges to objective a r X i v : . [ ee ss . A S ] F e b uality metrics.We previously compared the performance of these two types ofcodecs in terms of several quality aspects, such as accuracy of pitchperiods estimation, the word error rates for automatic speech recog-nition (ASR), and the influence of speaker gender and coding de-lays [20]. It was observed that these factors should be taken intoaccount in order to design a new and robust FR metric that is work-able for different codecs. We analysed why existing speech qualitymodels underrated the quality of generative codec outputs and con-sidered the micro-alignment differences as a potential cause.Generative codecs rely on a combination of parametricallycoded information and a neural model (e.g., WaveNet) to synthesisethe output. Although the resulting codec speech is rated as highquality, there are small differences between the original and codecspeech. Some of these are temporal micro-alignments and othersmanifest as slight pitch shifts. While these are potentially percep-tible differences, a human listener may not be able to distinguish aquality difference. On the other hand, traditional codecs keep theoriginal and coded signals temporally aligned and may be penalisedin subjective ratings due to the spectro-temporal differences thatmanifest as noise or corrupt speech in the coded output.Standard speech quality models rely on evaluating the similarity between reference and test signals as a salient feature for assessingquality. They pre-align the signals in order to account for qualityissues resulting from delay and signal corruption. For example, theViSQOL metric [4, 5] uses the neurogram similarity index measure(NSIM) to estimate the similarity between a pre-aligned referencepatch and a degraded spectrogram patch frame by frame.In this study, we propose WARP-Q, based on dynamic timewarping (DTW), calculating an optimal match between two given se-quences. DTW has been successfully adopted for a range of speechprocessing applications. In [21], the global alignment distance basedon the original DTW is employed for test and received speech com-parison. It showed results comparable to that of the PESQ metric forperceived speech quality measurement in VoIP and global system formobile communications (GSM) networks.WARP-Q takes a different approach to traditional speech qual-ity models handling time-alignment and signal similarity in a com-bined manner. We use a special type of DTW algorithm, known as subsequence dynamic time warping (SDTW) [22], to measure the distance between speech signals. Unlike the original DTW algo-rithm which aims to find an optimal global alignment between twogiven sequences, the SDTW finds a subsequence within the longersequence that optimally fits the shorter sequence. It has been suc-cessfully employed in audio matching scenarios and content-basedaudio retrieval applications [23]. We refer to Figs. 3.10 and 7.13from M¨uller [22] for more details about the difference between theoriginal DTW and SDTW.We show that this simple concept allows the perceptual qual-ity impact of micro-alignment and signal corruption to be capturedand quantified together. The proposed SDTW-based metric predictsspeech quality correctly for generative codecs while also perform-ing competitively with standard metrics for a wide range of standardcoding algorithms and distortion effects.
3. PROPOSED ALGORITHM
Fig. 1 illustrates the four processing stages of the proposed algo-rithm: pre-processing; feature extraction; similarity comparison; andsubsequence score aggregation. Python source code for the WARP-Q model is available for download in [24].
Fig. 1 : Block diagram of the proposed WARP-Q metric.
The reference and degraded input signals are set to the same sam-pling frequency, f s = 16 kHz. Silent non-speech segments are re-moved from reference and degraded signals using a voice activitydetection (VAD) algorithm. Our implementation used a WebRTC-based VAD with default parameters [25]. Mel frequency cepstral coefficients (MFCCs) representations of thereference and degraded signals are generated using 12 critical bandsup to 5 kHz for each frame [26]. A Hann window with a length of32 ms and 80 % overlap was used for framing. Spectral coefficientswere extracted using the discrete cosine transform (DCT) type-2with orthonormal bases and a cepstral filtering of 3 for liftering. TheMFCCs signal representations are normalised so that they have thesame segmental statistics (zero mean and unit variance). The spec-tral coefficients of each feature vector were normalised using thecepstral mean and variance normalisation (CMVN) algorithm [27]. Let X = ( x , x , ..., x N ) and Y = ( y , y , ..., y M ) be two fea-ture sequences over a feature space. The length M is assumed tobe much larger than the length N . For the two given sequences, theSDTW algorithm considers all possible subsequences of Y to findthe optimal one that minimises the DTW distance to X . The optimalsubsequence of Y is determined by two optimal indices a ∗ and b ∗ ,where a ∗ , b ∗ ∈ [1 : M ] with a ∗ ≤ b ∗ , such that the subsequence Y ( a ∗ : b ∗ ) has the minimum DTW distance to X over other sub-sequences. To reveal the optimal index b ∗ , the algorithm computesthe N × M accumulated cost matrix denoted by D using dynamicprogramming. The index that minimises cost values in last row of D represents the optimal index b ∗ . To reveal the optimal index a ∗ , thealgorithm drives the optimal warping path P ∗ (list of index pairs)between X and Y ( a ∗ : b ∗ ) using backtracking, which starts with q = ( N, b ∗ ) and stops as soon as the first row of D is reached bysome element q r = (1 , m ) , m ∈ [1 : M ] . Refer to Exercise 7.6 fromM¨uller, 2015 [22] for more details about the SDTW algorithm. Weuse the Librosa Python library [26] to implement the SDTW algo-rithm with a step size condition of (cid:80) { (1 , , (3 , , (1 , } and anEuclidean cost function. The reference and degraded MFCC representations can be treated as2-dimensional matrices for processing. The reference MFCC ma-trix, Y , has a size of K × M , where K = 12 which representsthe 12 frequency bands, and M is the total number of signal frames.The MFCC matrix of degraded signal is divided into a number, L , A m p li t u d e Reference Signal0 1 2 3 4 5 6Time (s)0.50.00.5 A m p li t u d e WaveNet Signal (a)
Signals in time space F r e q . ( H z ) Reference signal MFCC matrix, Y0 1 2 3 4 5 6Time (s)0512102420484096 F r e q . ( H z ) P a t c h X WaveNet signal MFCC matrix (b)
MFCC feature space X , ( s ) b = 649 b t = 2.596 s a = 239 a t = 0.956 sAccumulated Cost Matrix D ( X , Y ) Opt. path P* (c)
Sub-signal DTW space
Fig. 2 : SDTW-based accumulated cost and optimal path between two signals. (a) plots of a reference signal and its corresponding codedversion from a WaveNet coder at 6 kb/s (obtained from the VAD stage), (b) normalised MFCC matrices of the two signals, (c) plots ofSDTW-based accumulated alignment cost matrix D ( X,Y ) and its optimal path P ∗ between the MFCC matrix Y of the reference signal anda patch X extracted from the MFCC matrix of the degraded signal. The optimal indices ( a ∗ & b ∗ ) are also shown. X corresponds to a shortsegment (2 s long) from the WaveNet signal (highlighted in green color).of patches with a overlap. Each patch, X i ( i = 1 , , ..., L ), isof size K × N , where N = 100 corresponds to 100 frames (400ms) patch length from the degraded speech. L is equal to the totalnumber of degraded signal frames divided by N . For each degradedpatch X i , the SDTW algorithm described above is used to computethe accumulated cost matrix D ( X i ,Y ) , optimal warping path P ∗ , andoptimal indices a ∗ and b ∗ between X i and the reference MFCC ma-trix Y . The accumulated cost of index b ∗ is adopted as the salientfeature for quality score estimation. The quality score per each de-graded patch is computed as follows: C i = 1 N D ( X i ,Y ) [ N, b ∗ ] , i = 1 , , ..., L. (1)Note that in Eq. 1, the accumulated cost is divided by N (lengthof degraded patch) to suppress the dynamic range of the predictedscores. Eq. 1 provides a vector of L elements corresponding to thetotal number of patches in the degraded signal. Finally, the aggregatequality score (QS) for the degraded signal is computed as the medianof the cost per patch: QS = Median ([ C , C , ..., C L ]) . (2)An illustration of the process is presented in Fig. 2. A referencesignal taken from the Genspeech database [20] with its correspond-ing coded version from a WaveNet coder at 6 kb/s are fed to theVAD algorithm to remove non-speech segments from them. Theplots of the two processed signals are shown in Fig. 2a. Their nor-malised MFCC representations are shown in Fig. 2b. A patch X isextracted from the MFCC array of the degraded signal. In this ex-ample, for better visualisation, we used a wider patch of 2 s length( N = 500 , which corresponds to 500 frames long). The extractedpatch is highlighted in green color in Figs. 2a and 2b. Fig. 2c dis-plays D ( X,Y ) with its corresponding P ∗ computed by the SDTWalgorithm. The optimal index located in the top row of D ( X,Y ) is b ∗ = 649 , which corresponds to ∗ ms / frame = 2 . s intime, i.e., b ∗ t = 2 . s. Furthermore, the optimal index locatedin the bottom row of D ( X,Y ) is a ∗ = 239 , which corresponds to . s time index, i.e., a ∗ t = 0 . s. This indicates that the subse-quence of Y that has the minimum alignment cost distance to patch X is Y ( a ∗ = 239 : b ∗ = 649) , i.e., Y ( a ∗ t = 0 .
956 : b ∗ t = 2 . sin time. Table 1 : The Genspeech dataset. Further details in [20].
Codec Bit rate DescriptionLPCNetUnquant — LPCNet operating on Opus unquantized featuresOpus9.0 9 kb/s Wideband vocoder (SILK mode)WaveNet6.0 6 kb/s WaveNet operating on Opus quantized featuresLPCNet1.6 1.6 kb/s WaveRNN + linear predictionLPCNet6.0 6 kb/s LPCNet operating on Opus quantized featuresMELP2.4 2.4 kb/s Narrowband vocoderOpus6.0 6 kb/s Narrowband vocoder (SILK mode)Speex4.0 4 kb/s Wideband vocoder (wideband quality 0)
4. EXPERIMENTAL EVALUATION
Data from subjective experiments on a variety of parametric and gen-erative codecs (the Genspeech database, with data from [19, 28]) areused to evaluate WARP-Q and benchmark the performance againstexisting models. Table 1 summarises the codecs (further detailsavailable in [20]). Furthermore, the capability of WARP-Q to pre-dict speech quality for quality issues beyond low bit rate coding isevaluated using other datasets: the TCD-VoIP [29], a database whichcontains speech signals under a range of common VoIP degradationswith channel and environmental issues, and the ITU-T P. Supple-ment 23 (P.Sup23) [EXP1 and EXP3] [30], a database which con-tains speech samples under a range of traditional coding and someenvironmental degradations. Note that the original MUSHRA scoresfrom the Genspeech database were linearly rescaled to be in the samerange of MOS of other databases.Models are compared using Pearson’s correlation coefficient andSpearman rank-order correlation coefficient. Fig. 3 compares the percondition (i.e., grouped by codec) WARP-Q scores to that of existingmetrics for the Genspeech dataset. The proposed metric providedscores that are ranked and consistent more than other metrics for allcodecs.Fig. 4a presents a scatter plot of WARP-Q scores against sub-jective quality ratings for samples from the four datasets at a stan-dardised sampling frequency ( f s = 16 kHz). A consistent inversecorrelation between WARP-Q scores and MOS is apparent and therange of predicted scores is good between datasets. This highlightsthe robustness of the proposed algorithm to different degradationscenarios as the predicted quality scores remain bounded in a sim- 0 2 6 : $ 5 3 4 6 F R U H 3 H D U V R Q U 6 S H D U P D Q &