[PDF] Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Abstract

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency ( F 0 ) features are outside the F 0 range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary F 0 features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary F 0 features and the effectiveness of the cascaded structure for speech generation.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 1

Quasi-Periodic WaveNet: An Autoregressive RawWaveform Generative Model with Pitch-dependentDilated Convolution Neural Network

Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, and Tomoki Toda

Abstract —In this paper, a pitch-adaptive waveform generativemodel named Quasi-Periodic WaveNet (QPNet) is proposed toimprove the pitch controllability of vanilla WaveNet (WN) usingpitch-dependent dilated convolution neural networks (PDCNNs).Speciﬁcally, as a probabilistic autoregressive generation modelwith stacked dilated convolution layers, WN achieves high-ﬁdelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signalsdegrade the pitch controllability of WN. For instance, it is difﬁcultfor WN to precisely generate the periodic components of audiosignals when the given auxiliary fundamental frequency ( F )features are outside the F range observed in the training data. Toaddress this problem, QPNet with two novel designs is proposed.First, the PDCNN component is applied to dynamically changethe network architecture of WN according to the given auxiliary F features. Second, a cascaded network structure is utilizedto simultaneously model the long- and short-term dependencesof quasi-periodic signals such as speech. The performances ofsingle-tone sinusoid and speech generations are evaluated. Theexperimental results show the effectiveness of the PDCNNs forunseen auxiliary F features and the effectiveness of the cascadedstructure for speech generation. Index Terms —WaveNet, pitch-dependent dilated convolution,quasi-periodic structure, vocoder, pitch controllability.

I. I

NTRODUCTION R AW waveform generation of audio signals like speechand music is a commonly used technique as the coreof many applications such as text-to-speech (TTS), voiceconversion (VC), and music synthesis. However, because ofthe extremely high temporal resolution (sampling rates areusually higher than 16kHz) and the very long term dependenceof audio signals, directly modeling the raw waveform signalsis challenging. To overcome these difﬁculties, in conventionalsynthesis techniques, audio signals are usually encoded intolow temporal resolution acoustic features and then audio wave-forms are decoded on the basis of these acoustic features. The

This work was supported in part by theJapan Science and TechnologyAgency (JST), Precursory Research for Embryonic Science and Technology(PRESTO) under Grant JPMJPR1657, in part by the JST, CREST underGrant JPMJCR19A3, and in part by the Japan Society for the Promotionof Science (JSPS) Grants-in-Aid for Scientiﬁc Research (KAKENHI) underGrant 17H06101.Y.-C. Wu and P. L. Tobing are with Graduate School ofInformatics, Nagoya University, Aichi, Japan (e-mail: { yichiao.wu,patrick.lumbantobing } @g.sp.m.is.nagoya-u.ac.jp)T. Hayashi is with Graduate School of Information Science, NagoyaUniversity, Aichi, Japan (e-mail: [email protected])K. Kobayashi is with Information Technology Center, Nagoya University,Aichi, Japan (e-mail: [email protected])T. Toda is with Information Technology Center, Nagoya University, Aichi,Japan (e-mail: [email protected]) analysis-synthesis (encoding-decoding) technique is called thevocoder [1], [2], which is often built on a source-ﬁlter [3]speech production model including source excitations andvocal tracts. However, because of the oversimpliﬁed assump-tions of the speech generation mechanism, the lost temporaldetails and phase information lead to the serious qualitydegradation of conventional vocoders such as STRAIGHT [4]and WORLD [5].Owing to the recent development of deep learning, manyneural-based audio generation models [6]–[17] have beenproposed to generate raw audio waveforms without the var-ious assumptions imposed on conventional vocoders. Thatis, advanced, and deep network architectures directly modelthe long-term dependence of high-temporal-resolution audiowaveforms. In this paper, we focus on WaveNet (WN) [6],which is one of the state-of-the-art audio generation modelsand has been applied to a variety of applications such as musicgeneration [18], text-to-speech (TTS) [19], [20], speech cod-ing [21], speech enhancement [22], [23], and voice conversion(VC) [24]–[28]. The main core of WN is an autoregressive(AR) network modeling the probability distribution of eachaudio sample conditioned on auxiliary features and a speciﬁcnumber of previous samples called a receptive ﬁeld . To handlethe very long term dependence of audio signals, a stackeddilated convolution network (DCNN) [29] structure is utilizedto efﬁciently extend the receptive ﬁeld . Furthermore, the WNvocoder [30]–[33], which conditions WN on the acousticfeatures extracted by conventional vocoders to recover the lostinformation, achieves signiﬁcant speech quality improvementsfor speech generation by replacing the synthesis process oftraditional vocoders.Although WN attains excellent performance in high-ﬁdelityspeech generation, the ﬁxed architecture is inefﬁcient andthe lack of prior audio-related knowledge limits the pitchcontrollability of the WN vocoder. Speciﬁcally, because ofthe quasi-periodicity of speech, each sample may have aspeciﬁc dependent ﬁeld related to its periodicity instead of aﬁxed receptive ﬁeld that presumably includes many redundantprevious samples. The requirement of a long receptive ﬁeld for modeling speech dependency will lead to a huge networkand high demands for computation power. The data-drivenarchitecture without prior speech knowledge only implicitlymodels the relationship between the periodicity of waveformsignals and the auxiliary fundamental frequency ( F ) features,which may not explicitly generate speech with the precisepitch corresponding to the auxiliary F values, especially in a r X i v : . [ ee ss . A S ] J u l OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 2 an unseen F case. However, the pitch controllability is anessential feature for the deﬁnition of a vocoder.To address these problems, inspired by the source-ﬁlter model [3] and code-excited linear prediction (CELP)codec [34], [35], we propose Quasi-Periodic WaveNet (QP-Net) [36], [37] with a pitch-dependent dilated convolutionneural network (PDCNN). Speciﬁcally, the generation processof periodic signals can be modeled as the generation of a singlepitch cycle signal (short-term correlation) and then extendingthis single cycle signal to form the whole periodic sequenceson the basis of pitches (long-term correlation). As a result, wedevelop QPNet including two cascaded WNs with differentDCNNs. Vanilla WN with ﬁxed DCNNs is the ﬁrst stage,which is used to model the relationship between the currentsample and a speciﬁc segment of the nearest previous samples,and the second stage utilizes the PDCNNs to link the corre-lations of the relevant segments in the current and previouscycles. The Pitch-adaptive architecture allows each sample tohave an exclusive receptive ﬁeld length corresponding to theauxiliary F features and improves the pitch controllabilityby introducing the periodicity information into the network.The proposed QPNet with the improved pitch controllabilityis more line with the deﬁnition of a vocoder. Furthermore, amore compact network size while achieving acceptable qualitysimilar to that of vanilla WN is feasible for QPNet because ofthe more efﬁcient way the receptive ﬁeld is extended, whichis highly related to the modeling capability.The paper is organized as follows. In Section II, we reviewrecent neural-based speech generation models. In Section III,a brief introduction to WN is presented. In Section IV, wedescribe the concepts and details of QPNet. In Sections Vand VI, we report objective and subjective experimental re-sults to evaluate the effectiveness of QPNet for generatinghigh-temporal-resolution periodic sinusoid signals and quasi-periodic speech, respectively. Finally, the conclusion is givenin Section VII. II. R ELATED W ORK

Recent mainstream speech generation techniques use ARmodels such as WN [6] and SampleRNN [7] to model the verylong term dependence of speech signals with high temporalresolution. In contrast to conditioned on linguistic and F features to generate speech like vanilla WN, taking an ARmodel as a vocoder is a more efﬁcient way to train the ARmodel and make it generate the desired speech conditionedon handcrafted acoustic features. Many acoustic featureshave been applied to these AR vocoders such as the Mel-cepstral coefﬁcients ( mcep ) with band aperiodicity ( ap ) and F features, which are extracted from WORLD [30]–[32] orSTRAIGHT [38], and Mel-spectrograms with F features [33].Furthermore, to achieve acceptable speech quality, the basicAR vocoders usually require a huge network for the long receptive ﬁeld . However, although the speech qualities of thesebasic AR vocoders are signiﬁcantly higher than those of thetraditional vocoders, the AR mechanism and the complicatednetwork structure make these AR vocoders difﬁcult to generatespeech in real-time [6], [7]. To tackle this issue, the authors of FFTNet [8] and WaveRNN [9] proposed more compact ARvocoders with speciﬁc network structures based on speech-related knowledge and efﬁcient computation mechanisms.Moreover, AR models generating glottal excitation [39], [40]and linear predictive coding (LPC) residual [10] signals havebeen proposed to ease the burden of modeling speaker identityand spectral information. Because of the speaker-independentcharacteristic of these source signals, the requirements forthe network capacity and speaker adaptation of these glottalvocoders and LPCNet are greatly reduced.In addition, ﬂow-based [41], [42] non-AR vocoders havebeen proposed for efﬁcient parallel generations. For ex-ample, parallel WaveNet [11] and ClariNet [12] with in-verse autoregressive ﬂow (IAF) [43] and WaveGlow [13]and FloWaveNet [14] with Glow [44] model an invertibletransformation between a simple probability distribution ofnoise signals and a target distribution of speech signals forgenerating waveforms from a known noise sequence.Non-AR vocoders with mixed sine-based excitation inputsproduced on the basis of F and Gaussian noise [16], [17]or periodic sinusoid signals and aperiodic Gaussian noiseinputs [15] have also been proposed to simultaneously generatewhole waveforms while attaining pitch controllability via themanipulation of the periodic inputs. However, to synchronizethe phases of generated and ground truth waveforms duringtraining, these models need a handcrafted design of the inputsignal or a GAN [45] structure, which increases the complexityof the models. Moreover, directly applying these models torelated applications such as music generation is not straight-forward because of the tailored architectures.Instead of the carefully designed inputs and speciﬁc net-works, we proposed a simple module PDCNNs, which can beeasily applied to any DCNN-based generative model to im-prove its audio signal modeling capability by introducing pitchinformation into the network. We applied PDCNNs to WN todevelop a pitch-dependent adaptive network QPNet [36], [37]for speech generation with arbitrary F values. In this paper,we further evaluate the periodical modeling capability of QP-Net with PDCNNs for nonspeech sinusoid signals generationand comprehensively explore the effectiveness of the QPNetmodel with different cascade orders, network structures, andadaptive dilation sizes.III. W AVE N ET FOR S PEECH G ENERATION

A. WaveNet

Because an audio waveform is a sequential signal with astrong long-term dependency, WN [6] is used to model audiosignals in an AR manner that predicts the distribution of eachwaveform sample on the basis of its previous samples. Theconditional probability function can be formulated as P ( x ) = T (cid:89) t =1 P ( x t | x t − r , . . . , x t − ) (1)where t is the sample index, x t is the current audio sample, and r is a speciﬁc length of the previous samples called a receptiveﬁeld . Instead of the general recurrent structure for AR model-ing, WN applies stacked convolution neural networks (CNNs) OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 3 with a dilated mechanism and a causal structure to model thevery long term dependence and causality of audio signals.Since the modeling capability of WN is highly related to theamounts of the previous samples taken into consideration forpredicting the current sample, the dilated mechanism improvesthe efﬁciency of extending the receptive ﬁeld length. Moreover,a categorical distribution is applied to model the conditionalprobability whereas audio signals are encoded into 8 bitsby using the µ -law algorithm. The categorical distribution isﬂexible to model an arbitrary distribution of target speech.Taken together, the data ﬂow of WN is as follows: previousaudio samples pass through a causal layer and several residualblocks with DCNNs, gated structures, and residual and skipconnections. Speciﬁcally, the gated structure for enhancing themodeling capability of the network is formulated as z (o) = tanh (cid:16) V f,k ∗ z (i) (cid:17) (cid:12) σ (cid:16) V g,k ∗ z (i) (cid:17) (2)where z (i) and z (o) are the input and output feature maps ofthe gated structure, respectively. V is a trainable convolutionﬁlter, ∗ is the convolution operator, (cid:12) is an element-wisemultiplication operator, σ is a sigmoid function, k is the layerindex, and f and g are the ﬁlter and gate, respectively. Finally,the summation of all skip connections is processed by twoReLU [46] activations with × convolutions and one softmaxlayer to output the predicted distribution of the current audiosample.Furthermore, to guide the WN model to generate desiredcontents, the vanilla WN is conditioned on not only previoussamples but also linguistic and F features. The conditionalprobability is modiﬁed as P ( x | h ) = T (cid:89) t =1 P ( x t | x t − r , . . . , x t − , h ) (3)where h is the vector of the auxiliary features (linguistic and F features), and the gated activation with auxiliary featuresbecomes z (o) = tanh (cid:16) V (1) f,k ∗ z (i) + V (2) f,k ∗ h (cid:48) (cid:17) (cid:12) σ (cid:16) V (1) g,k ∗ z (i) + V (2) g,k ∗ h (cid:48) (cid:17) (4)where V (1) and V (2) are trainable convolution ﬁlters, and h (cid:48) is the temporal extended auxiliary features, whose temporalresolution matches to the speech samples. B. WaveNet Vocoder

Many conventional vocoders [4], [5] are built on the basisof a source-ﬁlter architecture [3], which models the speechgeneration process as a spectral ﬁlter driven by the sourceexcitation signal. However, the oversimpliﬁed assumptions,such as analysis windows with a ﬁxed length, time-invariantlinear ﬁlters, and stationary Gaussian processing, make thevocoders lose some essential information of speech such asphase and temporal details, and it causes marked qualitydegradation. To address this problem, the authors of [30],[31] proposed the WN vocoder, which conditions WN onthe auxiliary acoustic features extracted by a conventional

Skip connection

Fixed blockFixed blockAdaptiveblockAdaptive block ...

Fixed / Adaptive

InputAuxiliary featuresOutputSkip connection

Input

Fixed/Adaptive residual block ...

UpsampleAcoustic features 1×11×1 1×1

Auxiliary features

Macroblock 1Macroblock 0

Fig. 1. Quasi-Periodic WaveNet vocoder architecture. vocoder to generate raw speech waveforms. That is, the WNvocoder replaces the synthesis part of conventional vocoders tosynthesize high-ﬁdelity speech on the basis of the prosodic andspectral acoustic features extracted by conventional vocoders.Furthermore, conditioning WN on the acoustic features greatlyreduce the requirements of the amounts of the training data,and it makes WN more tractable.

C. Problems in Using WaveNet as A Vocoder

As a vocoder, WN achieves high speech quality, but it lackspitch controllability, which is an essential feature of conven-tional vocoders. Speciﬁcally, the WN vocoder has difﬁcultiesin generating speech with precise pitch conditioning on the F values that are not observed in the F range of trainingdata [36]. Even though the F and spectral features are withinthe observed range, an unseen combination of the auxiliaryfeatures still markedly degrades the generation performanceof the WN vocoder [24]–[28]. The possible reasons for thisproblem are that WN lacks prior speech knowledge and doesnot explicitly model the relationship between the auxiliary F feature and pitch. The defect makes the WN vocoderinconsistent with the deﬁnition of a vocoder. Moreover, sincethe ﬁxed WN architecture assumes each sample has the samelength of the receptive ﬁeld , the inefﬁcient receptive ﬁeld extending may lead to the costly requirements of a hugenetwork and lots of computation power.IV. Q UASI -P ERIODIC W AVE N ET To improve the efﬁciency of extending the receptive ﬁeld and pitch controllability, QPNet introduces the prior pitchinformation into WN by dynamically changing the networkstructure according to the auxiliary F features. Speciﬁcally,as shown in Fig. 1, the main differences between WN andQPNet are the pitch-dependent dilated convolution mechanismhandling the periodicity of audio signals and the cascadedstructures simultaneously modeling the long- and short-termcorrelations. The pitch ﬁltering in CELP, which is the basisof the PDCNN, and the details of QPNet are described asfollows. OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 4

Innovation code book

Short delay predictorLong delay predictor

Weighted mean-square errorPitch Spectral envelope NaturalPredicted

Fig. 2. Code-excited linear prediction system. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4 structure modeling the hierarchical correlations is also applied to QPNet. B. Pitch-dependent Dilated Convolution

The main idea of the PDCNN is that since audio signals attain the quasi-periodic property, the network architecture can be dynamically optimized using the prior pitch information. Specifically, the dilated convolution can be formulated as           o c i p it t t d      X W X W X , (6) where   i X is the input and   o X is the output of the DCNN layer. The trainable 1×1 convolution filters   c W and   p W are respectively for the current and past samples. The dilation size d is constant for the vanilla DCNN but time-variant for the PDCNN. As shown in Figs. 3 and 4, instead of a fixed length of past speech samples, the effective receptive field of the PDCNN includes a pitch-variant length of speech samples. Specifically, although the sinusoids with different frequencies in Figs. 4 (a) and (b) have the same sampling rate and the number of sample points taken into account in both receptive fields is the same, the different F -dependent dilation sizes lead to different effective receptive field lengths. Moreover, to extend the receptive field , vanilla WN utilizes stacked chunks including DCNN layers with different dilation sizes. Specifically, each chunk contains a specific number of DCNN layers, and each layer (except the first layer) twice the dilation size of the last one. The dilation sizes of the first layers of the chunks are set to one, so the dilation size in each chunk exponentially increases with base two. For QPNet, the dilation sizes of PDCNN layers in stacked adaptive modules follow the same extension rule but multiplied by an extra dilated factor to match the pitch of the current sample. The pitch-dependent factor E t is derived from   t s t E F F a   , (7) where F s is the utterance-wise constant sampling rate, F t is the fundamental frequency with speech sample index t , and a is a hyperparameter called the dense factor , which indicates the number of samples in one cycle taken into consideration when predicting the current sample. Therefore, with the same dense factor , different effective receptive fields include the same number of past cycles, as shown in Fig. 4. In summary, the pitch-dependent structure allows each sample to have an exclusive effective receptive field length and efficiently extends it according to the corresponding F value. C. Cascaded Autoregressive Network

Most audio signals are sequential and quasi-periodic, so the generation network should simultaneously model the long-term (periodicity) and short-term (aperiodicity) correlations of audio samples. As shown in Fig. 1, the proposed OPNet model utilizes a cascaded architecture that contains fixed and adaptive (pitch-dependent) modules. The fixed module models the sequential relationship between the current sample and a segment of the most recent samples. The adaptive module models the periodic correlations of the current and related past segments in the successive cycles. Moreover, the fixed module of QPNet is composed of a causal layer, several stacked residual blocks with fixed DCNNs, conditional auxiliary features, gated activations, and residual and skip connections, similarly to vanilla WN. The adaptive module also contains several similar stacked residual blocks but with fixed DCNNs replaced by PDCNNs. In summary, the cascaded structure of QPNet presumably mimics a similar mechanism of CELP for quasi-periodic audio signals. V. P ERIODIC S IGNAL G ENERATION E VALUATION

To evaluate the frequency controllability of the proposed QPNet with the PDCNN, we first evaluated the generation quality of simple periodic but high-temporal-resolution signals. That is, the training data of QPNet were sine waves within a specific frequency range and the corresponding F values. In the test phase, QPNet was conditioned on outside F values and a small piece of the related sine wave for the initial receptive field to generate sinusoid waveforms. A. Sinusoid Evaluation Setting

Because the pitch range of speech is 80–400 Hz, the training sine waves were set to be in the same range with a step size of 20 Hz (ex: 80, 100, 120 … Hz). QPNet had a related one-dimensional F value as its auxiliary feature. To increase the robustness of QPNet, both sinusoid and auxiliary signals were mixed with white noise. The signal-to-noise ratio (SNR) of the sine waves was around 20 dB, and the noise of the auxiliary feature was a random sequence between -1 and 1. Random initial phases were also applied to the sinusoid signals. DCNN

Receptive fieldEffective receptive field Output

Dilation = 2 × E T Hidden layer

Dilation = 1 × E T InputOutput

Dilation = 2

Hidden layer

Dilation = 1

InputPitch-dependent dilated factors E T PDCNN

Receptive fieldEffective receptive field E t =2 E t-1 =3... Fig. 3. Fixed and pitch-dependent dilated convolution. Fig 4. Effective receptive fields with different F values Fig. 3. Fixed and pitch-dependent dilated convolution.

A. Pitch Filtering in CELP

Fig. 2 shows a ﬂowchart of the CELP system [35], which in-cludes an innovation signal codebook and two cascaded time-varying linear recursive ﬁlters. First, each innovation signalin the codebook is scaled and passed to the pitch ﬁlter (longdelay) to generate the pitch periodicity of the speech, and thenthe linear-prediction ﬁlter (short delay) restores the spectralenvelope to obtain the synthesized speech. Secondly, the mean-square errors between the original and synthesized speechsignals are weighted by a linear ﬁlter to attenuate/amplifyfrequency components that are less/more perceptually impor-tant. Finally, the optimum innovation signal and the scaledfactor are determined by minimizing the weighted mean-square error. To be more speciﬁc, the pitch-ﬁltering processcan be formulated as c (o) t = g × c (i) t + b × c (o) t − t d (5)where c (i) is the input, c (o) is the output, t d is the pitch delay, g is the gain, and b is the pitch ﬁlter coefﬁcient. This periodicfeedback structure handling the periodicity of signals is thebasis of the proposed PDCNN, and the cascaded recursivestructure modeling the hierarchical correlations is also appliedto QPNet. B. Pitch-dependent Dilated Convolution

The main idea of the PDCNN is that since audio signalshave the quasi-periodic property, the network architecture canbe dynamically adapted using the prior pitch information.Speciﬁcally, the dilated convolution can be formulated as y (o) t = W (c) ∗ y (i) t + W (p) ∗ y (i) t − d + W (f) ∗ y (i) t + d , (6)where y (i) and y (o) are the input and output of the DCNNlayer. The trainable × convolution ﬁlters W (c) and W (p) $ P S OL W XG H D (IIHFWLYHUHFHSWLYHILHOGRI)UHT 6DPSOHLQGH[ $ P S OL W XG H E (IIHFWLYHUHFHSWLYHILHOGRI)UHT Fig. 4.

Effective receptive ﬁelds with different F values. are respectively for the current and previous samples. ∗ is theconvolution operator. The dilation size d is constant for thevanilla DCNN but time-variant for the PDCNN.To extend the receptive ﬁeld length, the vanilla WN utilizesstacked chunks including DCNN layers with different dilationsizes. Speciﬁcally, each chunk contains a speciﬁc number ofDCNN layers, and each layer (except the ﬁrst layer) twicethe dilation size of the last one. The dilation sizes of the ﬁrstlayers of the chunks are set to one, so the dilation size ineach chunk exponentially increases with base two. As shownin Fig. 3, the dilation sizes of PDCNN layers in the stackedadaptive chunks of QPNet follow the same extension rule butmultiplied by an extra dilated factor to match the instantaneouspitch of the current sample. The pitch-dependent dilated factor E t is derived from E t = F s / ( F ,t × a ) , (7)where F s is the utterance-wise constant sampling rate, F ,t isthe fundamental frequency with speech sample index t , and a is a hyperparameter called the dense factor , which indicatesthe number of samples in one cycle taken into considerationwhen predicting the current sample.Speciﬁcally, the grid sampling locations of each DCNN iscontrolled by the dilation size d , and the dilation size d (cid:48) ofeach PDCNN is controlled by the dilated factor E t as d (cid:48) = E t × d. (8)By setting the F values and the dense factor a , the networkcan control the sparsity of the CNN sampling grids to attainthe desired effective receptive ﬁeld length. As shown in Fig. 4,since the sinusoids in Figs. 4 (a) and (b) have the same densefactors and sampling rates, even though the frequencies ofthem are different, the numbers of cycles in their effectivereceptive ﬁelds are still the same. The difference is thetemporal sparsity of the effective receptive ﬁeld . That is, ﬁxingthe number of sampling grids in each cycle by the dense factor and changing the gaps between the grid sampling locations bythe instantaneous F values lead to pitch-dependent and time-variant effective receptive ﬁeld lengths.In summary, the dilated factor E t is the expanded ratioof the effective receptive ﬁeld length to the receptive ﬁeld length, and the ratio of the receptive ﬁeld length to the dense OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 5

TABLE IA

RCHITECTURE OF S INUSOIDAL G ENERATIVE M ODEL

WNf WNc (r)QPNet pQPNetFixed chunk 3 4 3 -Fixed block 10 4 4 -Adaptive chunk - - 1 4Adaptive block - - 4 4CNN channel 128CNN channel 128CNN channel 64Size ( × ) 2.4 1.5 1.5 1.5 Causal and dilated CNN × CNN in residual block × CNN in output layer factor a is the number of past cycles in the effective receptiveﬁeld . With the pitch-dependent structure, each sample has anexclusive effective receptive ﬁeld length, which is efﬁcientlyextended according to the auxiliary F values. In addition,since speech has voiced and unvoiced segments, we have triedto set E t to one or the value calculated by interpolating the F values of the adjacent voiced segments for the unvoicedsegments, and the results in Section VI show that QPNet withthe continuous E t from interpolated F values achieves higherspeech quality. C. Cascaded Autoregressive Network

Most audio signals are sequential and quasi-periodic, sothe audio generative models usually simultaneously model thelong-term (periodicity) and short-term (aperiodicity) correla-tions of audio samples. As shown in Fig. 1, the proposedOPNet utilizes a cascaded architecture that contains a ﬁxedand an adaptive (pitch-dependent) macroblocks. The ﬁxedmacroblock models the sequential relationship between thecurrent sample and a segment of the most recent samples.The adaptive macroblock models the periodic correlations ofthe current and related past segments in the successive cycles.Speciﬁcally, the ﬁxed macroblock (macroblock 0 in Fig. 1)of the QPNet is composed of several ﬁxed chunks. Eachﬁxed chunk consists of several stacked residual blocks withDCNNs (ﬁxed blocks), conditional auxiliary features, gatedactivations, and residual and skip connections, similarly to thevanilla WN. The adaptive macroblock (macroblock 1 in Fig. 1)also contains several adaptive chunks, which also have similarstacked residual blocks but with PDCNNs (adaptive blocks). Insummary, the cascaded structure of QPNet presumably mimicsa similar generative procedure of CELP for quasi-periodicaudio signals generation.V. P

ERIODIC S IGNAL G ENERATION E VALUATION

To evaluate pitch controllability of the proposed QPNetwith the PDCNNs, we ﬁrst evaluated the generation qualityof simple periodic but high-temporal-resolution signals. Thatis, the training data of QPNet were sine waves within a speciﬁcfrequency range and the corresponding F values. In the testphase, QPNet was conditioned on an F value and a small piece of the related sine wave for the initial receptive ﬁeld togenerate sinusoid waveforms. A. Model Architecture

In this section, to evaluate the effectiveness of the PDCNN,we compared three types of QPNet with two types of WN interms of sine wave generation. Speciﬁcally, in addition to thebasic QPNet, because a sinusoid is a simple periodic signalthat can be modeled well by a pitch-dependent structure, theQPNet model with only adaptive residual blocks (pQPNet)was taken into account. The QPNet model with the reverseorder of the ﬁxed and adaptive macroblocks (rQPNet) wasalso considered. Moreover, a compact-size WN (WNc) and afull-size WN (WNf) models were evaluated as the references.The details of the network architectures are shown inTable I. Since the numbers of CNN channels were the same forall models, the model sizes were proportional to the numbersof the chunks and residual blocks. For instance, the WNfcontained 3 chunks and each chunk included 10 residualblocks, so the model size of the WNf was larger than thatof the WNc, which only had 4 chunks with 4 residual blocksin each chunk. The learning rate was × − without decay,the minibatch size was one, the batch length was 22050, thetraining epochs were two, and the optimizer was Adam [47]for all models. B. Evaluation Setting

Because the pitch range of most speech is around 80–400 Hz, the training sine waves were set to be in the samerange with a step size of 20 Hz (ex: 80, 100, 120 Hz). Eachmodel had a related one-dimensional F value as its auxiliaryfeature. Since the single-tone generation was evaluated, theauxiliary features of all samples in one utterance were thesame. To prevent the networks from suboptimal training andlacking the generality for sinusoid generations with unseen F values, both sinusoid and auxiliary signals were mixed withwhite noise.The signal-to-noise ratio (SNR) of the sine waves wasaround 20 dB, and the noise of the auxiliary feature was arandom sequence between -1 and 1. Random initial phaseswere also applied to the sinusoid signals. The number oftraining utterances was 4000, and each utterance was onesecond. The ground truths were clean sinusoid signals, soeach model was trained as a denoising network. The test dataincluded 20 different F values, which were 10–80 Hz with astep size of 10 Hz, 100–400 Hz with a step size of 100 Hz,and 450–800 Hz with a step size of 50 Hz, and each F valuecontained 10 test utterances with different phase shifts. Bothtraining and test data were encoded using the µ -law into 8 bits,and the sampling rate was 22,050 Hz.In the test stage, the initial receptive ﬁeld of each networkwas fed with the noisy test sine wave, and the length of thegenerated sinusoid was set to 1s. The quality of each generatedwaveform was evaluated on the basis of the SNR and the root-mean-square error (RMSE) of the log F value measured fromthe peak of the power spectral density (PSD). Moreover, thetest data were divided into 10–40 Hz (under / L ), 50–80 Hz OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 6

TABLE IISNR (dB) OF S INUSOID G ENERATION WITH D IFFERENT D ENSE F ACTORS

Dense a Under / L / L / U / U -8.1 -0.8 -0.3 2.7 TABLE IIIL OG F RMSE OF S INUSOID G ENERATION WITH D IFFERENT D ENSE F ACTORS

Dense a Under / L / L / U / U (above / L ), 100–400 Hz (inside), 450600 Hz (under / U ),and 650800 (above / U ) subsets. L is the lower bound and U is the upper bound of the inside F range, which was the F range of the training data. As a result, the under / L andabove / L F ranges are the lower outside F range, and theunder / U and above / U F ranges are the higher outside F range. C. Dense Factor

To explore the efﬁcient dense factor value of the PDCNNs,the sinusoid generative qualities of the pQPNet models withdifferent dense factors was evaluated. Since the chunk andblock numbers of pQPNet were set to four, the length of the receptive ﬁeld was 61 samples. That is, the receptive ﬁeld included from 61 past cycles to less than one cycle accordingto the dense factors from to . Moreover, in contrast tocontaining a ﬁxed number of past cycles for sinusoids witharbitrary pitch, the receptive ﬁeld of WNf contained 11 pastcycles for 80 Hz sinusoids and 56 past cycles for 400 Hzsinusoids when the sampling rate was 22,050 Hz. As a result,the effective receptive ﬁelds of the pQPNet with a dense factor dense factors from to was two. For dense factors of and , pQPNet required at least 10 training epochs to attainstable results. As shown in Tables II and III, the network withthe dense factor of was very unstable even when alreadytrained with 10 epochs. The results indicate that although thesmall dense factor made the network have a long effectivereceptive ﬁeld , the overbrief information of each past cyclemight make it difﬁcult to model signals well. For the insideand lower outside F ranges, the networks with dense factors TABLE IVSNR (dB) OF S INUSOID G ENERATION WITH D IFFERENT M ODELS

WNc WNf pQPNet QPNet rQPNetUnder / L -18.1 / L Inside 28.8 / U / U -14.1 -0.4 -18.6 -8.2Average 3.7 19.8 19.4 7.2 17.0 TABLE VL OG F RMSE OF S INUSOID G ENERATION WITH D IFFERENT M ODELS

WNc WNf pQPNet QPNet rQPNetUnder / L / L Inside 0.01

Under / U / U greater than achieved high SNR values. However, theperformance of the network with a dense factor of markedlydegraded when the auxiliary F values were in the higheroutside F range. The possible reason is that the PDCNNs ofthe network degenerated to DCNNs because the E t becameone when the dense factor was and the F values werehigher than 350 Hz. Moreover, the log F RMSE results showa similar tendency to the SNR results. The networks with densefactors of and achieved the lowest pitch accuracies whilethe networks with dense factors of and achieved thehighest pitch accuracies.In conclusion, the PDCNN with an appropriate dense factor was found to be robust against the conditions in the outside F range, especially in the lower outside F range conditions.For the higher outside F range conditions, the networks stillhad acceptable quality until the F value exceeded 600 Hz.Therefore, we set the dense factors to for the models inthe following evaluations because of the balance between thegenerative performance and the number of past cycles coveredin its receptive ﬁeld . D. Network Comparison

As shown in Tables IV and V, the PDCNNs signiﬁcantlyimproved pitch controllability. The PDCNNs made the QP-series networks achieve much higher SNR and lower log F RMSE values than the same-size WNc network in both higherand lower outside F ranges, and it shows the effectivenessof the PDCNNs to extend the effective receptive ﬁeld length.Although full-size WNf attained similar SNRs to pQPNet, thelog F RMSE of WNf was much higher in the outside F ranges. This indicates that WNf tended to generate the signalsin the inside F range instead of being consistent with theauxiliary F feature, so the generated waveform of WNf mightstill be a perfect sinusoid signal but with an incorrect pitch.The OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 7 7LPHV $ P S OLW XG H D )UHTXHQF\+] 36 ' G % E 3HDN+]615G% 7LPHV $ P S OLW XG H F )UHTXHQF\+] 36 ' G % G 3HDN+]615G% 7LPHV $ P S OLW XG H H )UHTXHQF\+] 36 ' G % I 3HDN+]615G% Fig. 5. Waveform and PSD of 500 Hz sinusoid generated by pQPNets with dense factors ((a), (b)), ((c), (d)), and ((e), (f)). results also imply that the PDCNNs improved the periodicalmodeling capability using prior pitch knowledge.In addition, because of the simple periodic signal generationscenario, pQPNet with the longest effective receptive ﬁeld and the pure PDCNN structure attained the best generativeperformance among all QP-series networks. QPNet and rQP-Net showed some quality degradations when the auxiliary F values were far away from the inside F range, but they stilloutperformed WNc in both measurements and WNf in termsof log F RMSE.

E. Discussion

In this section, several sinusoid generation examples arepresented for looking into the physical phenomena behind theobjective results. As shown in Figs. 5 (a) and (b), the pQPNetwith a dense factor generated clear sine waves with anSNR 23.7 dB when conditioned on an outside auxiliary valueof 500 Hz (under / U ). The PSD of this generated signal hasa peak value of 502 Hz, which is very close to the ground truthand the log F error is less than 0.01. However, the results inFigs. 5 (c) and (d) show that the sine wave generated by thepQPNet with a dense factor includes much harmonic noise,which result in a low SNR. Even if the generated sine waveis still like a periodic signal, the wrong peak value from thesecond harmonic component of the PSD also causes a high 7LPHV $ P S OLW XG H D )UHTXHQF\+] 36 ' G % E 3HDN+]615G% 7LPHV $ P S OLW XG H F )UHTXHQF\+] 36 ' G % G 3HDN+]615G% 7LPHV $ P S OLW X G H H )UHTXHQF\+] 36 ' G % I 3HDN+]615G% Fig. 6. Waveform and PSD of 20 Hz sinusoid generated by pQPNet with a dense factor ((a), (b)), WNf ((c), (d)), and WNc ((e), (f)). log F error. Moreover, the results in Figs. 5 (e) and (f) showthat the pQPNet with a dense factor generated a very noisysignal, which has a very low SNR and a wrong peak value.In addition, as shown in Figs. 6 (a) and (b), the pQPNet witha dense factor still generated a clear sine wave with an SNR23.3 dB and a correct peak value of its PSD when conditionedon an outside 20 Hz (under / L ) auxiliary value. However,the same-size WNc could not generate any meaningful signal,and the SNR of its generated signal is very low as shown inFigs. 6 (c) and (d). By contrast, the WNf still generated a clearsine wave with an SNR 33 dB but its frequency is incorrect asshown in Figs. 6 (e) and (f). Speciﬁcally, the PSD peak valueis 120 Hz, and it implies that the WNf tends to generate seensignals even if conditioned on an unseen auxiliary feature.In conclusion, the SNRs are related to the noisy degrees ofthe generated signals, which indicate the generated signals areclear sinusoids or not. Since it was a single-tone sinusoid gen-eration test, the high log F RMSEs imply that the generatedsignals may include much harmonic noise or the frequenciesof these signals are incorrect. As a result, the generated signalwith a high SNR and a high RMSE is a clear sinusoid with aninaccurate frequency like the signal shown in Fig. 6 (e). Thegenerated signal with a low SNR and a high RMSE is a noisysinusoid with much harmonic noise like the signal shown inFig. 5 (c). The generated signal with a very low SNR is a

OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 8

TABLE VIA

RCHITECTURE OF S PEECH G ENERATIVE M ODEL

WNf WNc (r)QPNet Full-size(r)QPNetFixed chunk 3 4 3 3Fixed block 10 4 4 10Adaptive chunk - - 1 1Adaptive block - - 4 4CNN channel 512CNN channel 512CNN channel 256Size ( × ) 44 24 24 50 Causal and dilated CNN × CNN in residual block × CNN in output layer noise-like signal as shown in Figs. 5 (e) or 6 (c).VI. S

PEECH G ENERATION E VALUATIONS

In this section, we evaluate the effectiveness of the PDCNNsfor speech generation. The appropriate proportions of adaptiveand ﬁxed residual blocks, the continuous pitch-dependentdilated factor, and the order of the macroblocks are explored.

A. Model Architecture

The quality of speech generation was evaluated on the basisof 11 vocoders, which included three types of vocoder, QP-Net, WN, and WORLD. Speciﬁcally, to explore the efﬁcient receptive ﬁeld extension by the PDCNNs, the compact-sizeQPNet vocoders were compared with the same-size WNcand double-size WNf vocoders. Furthermore, the evaluationsincluded several variants of QPNet such as the models withdifferent types of pitch-dependent dilated factor E t and theorder of the ﬁxed and adaptive macroblocks. Speciﬁcally, theQPNet and rQPNet vocoders with the continuous and discrete E t sequences were evaluated. For the unvoiced frames, thediscrete E t sequence was set to ones, and the continuous E t sequence was calculated using interpolated F values asmentioned in Section IV. In addition, the full-size QPNetand rQPNet vocoders, which were full-size WN vocoderscascaded with four extra adaptive residual blocks, were alsotaken into consideration to explore the effect of the ratio ofadaptive to ﬁxed residual blocks. The network architecturesand model sizes are shown in Table VI. The learning ratewas × − without decay, the minibatch size was one, thebatch length was 20,000, and the optimizer was Adam [47]for all models. Since even the compact-size WNc had tens ofmillions parameters, which was the same order of magnitudeas that of WNf, the training iterations were empirically set to200,000 for all models. Note that we did not evaluate speechgeneration using the pQPNet model because it failed to modelthe short-term correlation of speech according to our internalexperiments. B. Evaluation Setting

All models were trained in a multispeaker manner. Thetraining corpus of these multispeaker NN-based vocoders

TABLE VIIC

OMPARISON OF

QPN

ET WITH D IFFERENT D ENSE F ACTORS

Dense a MCD (dB) 4.05 F RMSE 0.23 0.17 0.15

U/V (%) 21.8 16.0 14.2

TABLE VIIIC

OMPARISON OF E FFECTIVE R ECEPTIVE F IELD L ENGTH ( SAMPLES ) Dense a Length ± ± ± ± ± ± ± consisted of the training sets of the ”bdl” and ”slt” speakersof CMU-ARCTIC [48] and all speakers of VCC2018 [49].The total number of training utterances was around 3000,and the total training data length was around four hours.The evaluation corpus was composed of the SPOKE set ofVCC2018, which included two female and two male speakers,and each speaker had 35 test utterances. All speech data wereset to a sampling rate of 22,050 Hz and a 16-bit resolution.The waveform signals for the categorical output of the NN-based vocoders were further encoded into 8 bits using the µ -law. The 513-dimensional spectral ( sp ) and ap and one-dimensional F features were extracted using WORLD. The sp feature was further parameterized into 34-dimensional mcep , ap was coded into two-dimensional components, and F wasconverted into continuous F and the voice/unvoice ( U/V )binary code for the auxiliary features [30]. The F range ofthe SPOKE set was around 40–330 Hz, and the F mean wasaround 150 Hz. The unseen outside auxiliary features weresimulated by replacing the original F values of the acousticfeatures with the scaled F values, and the scaling ratios were1/2, 3/4, 5/4, 3/2, and 2. A demo and open-source QPNetimplementation can be found in [50]. C. Objective Evaluation

For the objective evaluations, the ground truth acousticfeatures were extracted from natural speech utterances us-ing WORLD, and the extraction error from WORLD wasneglected. A speaker-dependent F rage was applied to thefeature extraction of each speaker to improve the extractionaccuracy, and the F range was set following the processin [51]. Since WORLD was developed to extract F indepen-dent spectral features [5], the WORLD-extracted sp featurewas assumed to be independent of the F feature in this paper.Therefore, the ground truth acoustic features for the scaled F scenarios were the same natural spectral features with the F feature scaled by an assigned ratio. The auxiliary features ofthe evaluated vocoders were the ground truth acoustic features.Mel-cepstral distortion (MCD) was applied to measure thespectral reconstruction capability of the vocoders, and theMCD was calculated between the auxiliary mcep and theWORLD-extracted mcep from the generated speech. The pitchaccuracy of the generated speech was evaluated using the OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 9

TABLE IXMCD (dB)

WITH F RAME - BASED

95% C

ONFIDENCE I NTERVAL (CI) OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION

WORLD WNc WNf QPNet Full-size QPNet rQPNet Full-size rQPNet E t - - - cont. disc. cont. disc. cont. disc. cont. disc. × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± × F ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RMSE of the auxiliary F and the WORLD-extracted F value from the generated speech in the logarithmic domain.Theunvoiced/voiced ( U/V ) decision error was also taken intoaccount in the evaluation of the prosodic prediction capability,which was the percentage of the unvoiced/voiced decisiondifference of each utterance.Since speech generation is more complicated than sine wavegeneration, we ﬁrst conducted a objective evaluation of QPNetmodels with different dense factors for speech generation tocheck the consistency of the efﬁcient dense factor value. Asshown in Table VII, the tendency of the objective evaluationis similar to the results of the sinusoid generation evaluation.That is, the QPNets with dense factors from – achievedsimilar generative performance while the speech quality andpitch accuracy of the QPNets with dense factors and markedly degraded because of the much shorter effectivereceptive ﬁeld lengths. Speciﬁcally, as shown in Table VIII,the average effective receptive ﬁeld lengths of the QPNets withthe dense factors and are much shorter than others, andthe lengths were too short to cover at least one cycle of thesignal with 150 Hz, which was the F mean of the SPOKEset. Furthermore, although the QPNet with a dense factor had the longest average effective receptive ﬁeld length andachieved an acceptable MCD, the higher RMSE of log F and U/V error indicate its instability, which was also observed inthe sinusoid generation evaluation. In conclusion, the densefactors of the following QPNet-series models were set to because of the lowest RMSE of log F and U/V error withan acceptable MCD. The internal subjective evaluation resultsalso show the preference of the utterances generated by theQPNet with the dense factor .As shown in Table IX, in terms of spectral predictioncapability, the compact-size (r)QPNet vocoders with the pro-posed PDCNNs signiﬁcantly outperformed the same-size WNcvocoder. The results conﬁrm the effectiveness of the QPstructure to skip some redundant samples using the priorpitch knowledge for a more efﬁcient receptive ﬁeld extension. However, the MCDs of the double-size WNf vocoder are lowerthan that of the compact-size (r)QPNet vocoders, and the full-size (r)QPNet vocoders with the largest network size alsooutperformed the WNf vocoder in terms of MCD. The resultsindicate that the MCD values are highly related to the networksizes, so a deeper network attains a more powerful spectralmodeling capability. Furthermore, the systems with continuouspitch-dependent dilated factors achieved better MCDs thanthose with discrete ones, and the result is consistent with ourinternal subjective evaluation for speech quality. However, theMCD differences of the rQPNet and QPNet vocoders were notreﬂected in the perceptual quality, and they had similar speechqualities according to the internal evaluation.The log F RMSE results in Table X also show thatboth the compact-size QPNet and rQPNet vocoders attainedmarkedly higher pitch accuracy than the same-size WNcvocoder, particularly when conditioned on the unseen F witha large shift. The compact-size QPNet vocoder even achievedhigher pitch accuracies than the WNf vocoder. The resultsindicate that the PDCNNs with the prior pitch knowledgeimproved the pitch controllability of these vocoders againstthe unseen F . However, the pitch accuracies of the full-size QPNet and rQPNet vocoders are lower than that of the(r)QPNet vocoders. The possible reason is that the unbalancedproportion of the adaptive and ﬁxed residual blocks impairedthe pitch controllability. That is, for the full-size (r)QPNetvocoders, the number of the ﬁxed blocks is markedly largerthan the number of the adaptive blocks. Therefore, the networkmight be dominated by the ﬁxed blocks, which degradedthe inﬂuence from the adaptive blocks. Speciﬁcally, for the(r)QPNet vocoders with a dense factor , the receptive ﬁeld length of the ﬁxed blocks is 46 samples (The details of the receptive ﬁeld length can be found in Discussion.), and theaverage effective receptive ﬁeld length of the adaptive blocksis 384 samples as shown in Table VIII. However, for thefull-size (r)QPNet vocoders, the receptive ﬁeld length of theﬁxed blocks is 3070 samples, which was much longer than OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 10

TABLE XL OG F RMSE

WITH U TTERANCE - BASED

95% CI OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION

U/V D ECISION E RROR R ATE (%)

WITH U TTERANCE - BASED

95% CI OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION

WORLD WNc WNf QPNet Full-size QPNet rQPNet Full-size rQPNet E t - - - cont. disc. cont. disc. cont. disc. cont. disc. × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± × F ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± the 384 samples of the extra four adaptive blocks. Therefore,the inﬂuence of the adaptive blocks might be very limited.As shown in Table XI, the compact-size QPNet vocoderattained the lowest U/V decision error among all NN-basedvocoders, and it indicates a higher capability to capture

U/V information. In conclusion, the compact-size QPNet vocoderwith the proposed PDCNNs and continuous pitch-dependentdilated factors attained the highest accuracy of pitch and

U/V information among the evaluated NN-based vocoders.Although the compact-size QPNet vocoder did not achievethe same spectral prediction capability as the WNf vocoderaccording to the MCD results, it is difﬁcult to measure aperceptual quality difference only on the basis of MCD. Asa result, we subjectively evaluated the compact-size QPNet(with continuous pitch-dependent dilated factors), WNc, andWNf vocoders in the next section. Moreover, although theWORLD vocoder had the best objective evaluation results, the WORLD-generated speech usually lacks naturalness andcontains buzz noise, which may not be reﬂected in the objec-tive measurements. Therefore, in our subjective evaluations,we also considered the WORLD vocoder.

D. Subjective Evaluation

The subjective evaluations included the Mean Opinion Score(MOS) test for speech quality and the ABX preference testfor perceptual pitch accuracy. Speciﬁcally, the naturalness ofeach utterance in the evaluation set for the MOS test wasevaluated by several listeners by assigning scores of 1–5 toeach utterance; the higher the score, the greater naturalnessof the utterance. The MOS evaluation set was composedof randomly selected utterances generated on the basis ofthe WORLD, WNf, WNc, and QPNet vocoders, and theauxiliary features with 1/2 F , 3/2 F , and unchanged F .The compact-size QPNet vocoder with the continuous dilated OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 11

45 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ M E AN O P I N I ON S C O R E WORLD WNf WNc QPNet

Fig. 7. Sound quality MOS evaluation of female speakers with 95% CI. factors was adopted and abbreviated as QPNet in the subjectiveevaluations. We randomly selected 20 utterances from the 35test utterances of each condition and each speaker to form theMOS evaluation set, so the number of utterances in the set was960. The MOS evaluation set was divided into ﬁve subsets,and each subset was evaluated by two listeners, so the totalnumber of listeners was 10. All listeners took the test using thesame devices in the same quiet room. Although the listenerswere not native speakers, they had worked on speech or audiogeneration research.In the ABX preference test, the listeners compared two testutterances (A and B) with one reference utterance (X) to evalu-ate which testing utterance had a pitch contour more consistentwith that of the reference utterance. Because the natural speechwith the desired scaled F does not actually exist, and theconventional vocoders usually have high pitch accuracy, wetook the WORLD-generated speech as the reference. The ABXevaluation set consisted of the same generated utterances of theWNf, QPNet, and WORLD vocoders as the MOS evaluationset. The number of ABX utterance pairs was 240, and eachpair was evaluated by two of the same 10 listeners as in theMOS test.As shown in Fig. 7, for the female speaker set, the QP-Net vocoder signiﬁcantly outperformed the same-size WNcvocoder in all cases. Although the QPNet vocoder achievedslightly lower naturalness than the WNf vocoder in the un-changed F (inside) case, the QPNet vocoder still attainedmarkedly better naturalness than the WNf vocoder in the 1/2 F (outside) case. The results indicate that halving the networksize markedly degraded the speech modeling capability of theWN vocoder. However, the proposed PDCNNs signiﬁcantlyimproved it, especially in the 1/2 F case which made QPNetobtain a long effective receptive ﬁeld length. On the otherhand, owing to the small dilated factors caused by the high F values, many of the PDCNNs might degenerate to DCNNsin the 3/2 F case. Speciﬁcally, when the dilated factors areless than or equal to one because of the high F values, thedilation sizes of PDCNN are also less than or equal to DCNN.As a result, while conditioned on the auxiliary features with3/2 F , although the QPNet vocoder still outperformed theWNc vocoder, the speech qualities of the WNf and WORLDvocoders are higher than that of the QPNet vocoder.

345 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ M E AN O P I N I ON S C O R E WORLD WNf WNc QPNet

Fig. 8. Sound quality MOS evaluation of male speakers with 95% CI.

In addition, as shown by the results of the male speakerset in Fig. 8, the QPNet vocoder achieved naturalness com-parable to that of the WNf vocoder in all F cases, which issigniﬁcantly better than that of the WNc vocoder. Speciﬁcally,most of the 3/2 F values of the male speaker are still withinthe range of the normal female F , so the effective receptiveﬁeld lengths of the QPNet vocoder are apparently longer thanthe receptive ﬁeld lengths of the WNc vocoder in all thesubjective evaluations of the male speaker set. On the otherhand, the WORLD vocoder shows almost the same tendencyin the evaluations of both female and male speaker sets. Thatis, it shows lower naturalness than the WNf vocoder in theunchanged F case and much lower speech quality than boththe WNf and QPNet vocoders in the 1/2 F case, whereas thenaturalness of the WORLD vocoder only slightly degrades inthe 3/2 F case.As shown in Figs. 9 and 10, the QPNet vocoder signiﬁcantlyoutperformed the WNf vocoder in terms of pitch accuracy inall F cases and both the female and male sets except in theunchanged F cases of the female set, which may be causedby the naturalness degradation. The results conﬁrm the pitchcontrollability improvement of the QPNet vocoder with thePDCNNs. In summary, the QPNet vocoder with the morecompact network size achieved comparable speech quality tothe WNf vocoder under most conditions except for the femaleset with 3/2 F because the higher F values might makethe PDCNNs degenerate to the DCNNs. The QPNet vocoderconditioned on the unseen F also gets the markedly higherpitch accuracy than the WNf vocoder. Moreover, the QPNetvocoder achieved higher or comparable speech quality than theWORLD vocoder under most conditions except conditioningon the unseen 3/2 female F . E. Discussion

As shown in Fig. 11, the length of the receptive ﬁeld ofWNf is 3070 samples (The receptive ﬁeld length of 10 blocksin each chunk is +2 + · · · +2 = 1023 , so the total length is × with an extra one from the causal layer.), that of WNcis 61 samples (Each chunk contains +2 +2 +2 = 15 , sothe total receptive ﬁeld length is × . ) , and that ofQPNet is 100–1000 samples (The receptive ﬁeld length of the OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 12

59 20 1641 80 840 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ P R E F E R E N C E S C O R E ( % ) WNf QPNet

Fig. 9. Pitch accuracy ABX evaluation of female speakers with 95% CI.

32 29 3968 71 610

Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ P R E F E R E N C E S C O R E ( % ) WNf QPNet

Fig. 10. Pitch accuracy ABX evaluation of male speakers with 95% CI. ﬁxed blocks and the causal layer is × , and thatof the adaptive blocks is × E t . The pitch-dependent dilatedfactor E t with a dense factor receptive ﬁeld lengths of WNfand WNc are constant because of the ﬁxed network structure,and the receptive ﬁeld length of QPNet is time-variant andpitch-dependent because of the QP structure.Fig. 11 also shows the effective receptive ﬁeld lengthdistributions of the female and male speakers of the SPOKEset. We ﬁnd that the effective receptive ﬁeld lengths of bothmale and female speakers of the SPOKE set are apparentlylonger than the receptive ﬁeld length of WNc, which concurswith the evaluation results showing that QPNet signiﬁcantlyoutperforms WNc. Furthermore, most of the effective receptiveﬁeld lengths of the female set are shorter than that of the maleset, and it is caused by the higher F values of the femalespeakers. The distribution results also imply that the effectivereceptive ﬁeld length of QPNet is close to the receptiveﬁeld length of WNc when conditioned on the female 3/2 F because most PDCNNs degenerate to DCNNs. In conclusion,the performance of AR models is highly related to the lengthof the receptive ﬁeld .However, the length of the receptive ﬁeld may be morestrongly correlated to the quality of the generated speech,whereas a balanced proportion of the adaptive and ﬁxedmodules may be an essential factor for the pitch accuracy.Speciﬁcally, although the full-size QPNet has the longest effective receptive ﬁeld length and achieves the lowest MCD,the pitch accuracy of full-size QPNet is still lower than that of 5HFHSWLYHILHOGOHQJWK 3 U R E D E LOL W \ :1F :1I 431HW0 431HW) Fig. 11. Distributions of receptive ﬁeld lengths of different vocoders. compact-size QPNet. The possible reason is that the full-sizeQPNet is dominated by the ﬁxed blocks because the numberof the ﬁxed blocks is much larger than the number of theadaptive blocks while the number of the ﬁxed and adaptiveblocks of the QPNet is more balanced.Furthermore, as shown in Tables I and VI, the number ofthe trainable parameters of the compact-size QPNet model isaround half of that of the WNf model, so only about 75%of the training time and 40% of the generation time wererequired. However, because of the very long effective receptiveﬁeld , the memory usage of QPNet in the training stage was al-most the same as that of WNf. The huge memory requirementin the training process limits the possible ratio of the ﬁxed toadaptive modules, which leads to an unbalanced proportionproblem. Therefore, increasing improving the efﬁciency ofmemory usage will be one of the main tasks of future QPNetresearch. VII. C

ONCLUSION

In this paper, we propose a WaveNet-like audio waveformgeneration model named QPNet, which models quasi-periodicand high-temporal-resolution audio signals on the basis of anNN-based AR model with a novel PDCNN component and acascaded AR structure. Speciﬁcally, the novel PDCNN com-ponent is a variant of a DCNN that dynamically changes thedilation size corresponding to the conditioned F for modelingthe long-term correlations of audio samples. On the basisof the sinusoid generation evaluation results, the PDCNNssigniﬁcantly improves the periodicity-modeling capability ofthe generation network using the introduced prior frequencyinformation. Furthermore, the QPNet model as a vocodermodels the short- and long-term correlations of speech sampleson the basis of the cascaded ﬁxed and adaptive macroblocks,respectively. The speech generation evaluation results indicatethat the proposed QPNet vocoder attains a much higher pitchaccuracy and comparable speech quality to the WN vocoderespecially when conditioning on the unseen auxiliary F values. The network size and generation time requirementsof the QPNet vocoder are only half of those of the WN OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 13 vocoder. In conclusion, the proposed QPNet model with thenovel PDCNN component and compact cascaded networkarchitecture signiﬁcantly improves the pitch controllability ofthe vanilla WN model, and it makes the QPNet vocoder morein line with the deﬁnition of a vocoder. In our future work, wewill explore the improvements in memory usage and optimizethe proportion between the adaptive and ﬁxed blocks.R

EFERENCES[1] H. Dudley, “The vocoder,”

Bell Labs Record , vol. 18, no. 4, pp. 122–126,1939.[2] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech,”

Proc.IEEE , vol. 54, no. 5, pp. 720–734, 1966.[3] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on asinusoidal representation,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 34, no. 4, pp. 744–754, 1986.[4] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuringspeech representations using a pitch-adaptive time–frequency smoothingand an instantaneous-frequency-based f0 extraction: Possible role of arepetitive structure in sounds,”

Speech Communication , vol. 27, no. 3-4,pp. 187–207, 1999.[5] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”

IEICETransactions on Information and Systems , vol. 99, no. 7, pp. 1877–1884,2016.[6] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:A generative model for raw audio,” in

Proc. SSW9 , Sept. 2016, p. 125.[7] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to-end neural audio generation model,” in

Proc. ICLR , Apr. 2017.[8] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-timespeaker-dependent neural vocoder,” in

Proc. ICASSP , Apr. 2018, pp.2251–2255.[9] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, andK. Kavukcuoglu, “Efﬁcient neural audio synthesis,” in

Proc. ICML , July2018, pp. 2415–2424.[10] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe-sis through linear prediction,” in

Proc. ICASSP , May 2019, pp. 5891–5895.[11] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, andD. Hassabis, “Parallel WaveNet: Fast high-ﬁdelity speech synthesis,” in

Proc. ICML , July 2018, pp. 3915–3923.[12] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation inend-to-end text-to-speech,” in

Proc. ICLR , May 2019.[13] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A ﬂow-basedgenerative network for speech synthesis,” in

Proc. ICASSP , May 2019,pp. 3617–3621.[14] S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet : Agenerative ﬂow for raw audio,” in

Proc. ICML , June 2019, pp. 3370–3378.[15] K. Oura, K. Nakamura, K. Hashimoto, Y. Nankaku, and K. Tokuda,“Deep neural network based real-time speech vocoder with periodic andaperiodic inputs,” in

Proc. SSW10 , Sept. 2019, pp. 13–18.[16] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlter-basedwaveform model for statistical parametric speech synthesis,” in

Proc.ICASSP , May 2019, pp. 5916–5920.[17] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-ﬁlter waveformmodels for statistical parametric speech synthesis,”

IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 28, pp. 402–415,2020.[18] S. Dieleman, A. van den Oord, and K. Simonyan, “The challenge ofrealistic music generation: modelling raw audio at scale,” in

Advancesin Neural Information Processing Systems , 2018, pp. 7989–7999.[19] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654 , 2017. [20] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis byconditioning wavenet on mel spectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[21] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang,and T. C. Walters, “Wavenet based low rate speech coding,” in . IEEE, 2018, pp. 676–680.[22] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florˆencio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet.” in

Interspeech ,2017, pp. 2013–2017.[23] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in . IEEE, 2018, pp. 5069–5073.[24] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voiceconversion with wavenet-based waveform generation,” in

Interspeech ,2017, pp. 1138–1142.[25] P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda, “Nuvoice conversion system for the voice conversion challenge 2018,” in

Odyssey , 2018, pp. 219–226.[26] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda, “The nunon-parallel voice conversion system for the voice conversion challenge2018,” in

Odyssey , 2018, pp. 211–218.[27] Y.-C. Wu, K. Kobayashi, T. Hayashi, P. L. Tobing, and T. Toda, “Col-lapsed speech segment detection and suppression for wavenet vocoder,”in

Proc. Interspeech , Sept. 2018, pp. 1988–1992.[28] Y.-C. Wu, P. L. Tobing, K. Kobayashi, T. Hayashi, and T. Toda, “Non-parallel voice conversion system with wavenet vocoder and collapsedspeech suppression,”

IEEE Access , vol. 8, pp. 62 094–62 106, 2020.[29] F. Yu and K. Vladlen, “Multi-scale context aggregation by dilatedconvolutions,” in

Proc. ICLR , May 2016.[30] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent WaveNet vocoder,” in

Proc. Interspeech , Aug.2017, pp. 1118–1122.[31] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “Aninvestigation of multi-speaker training for WaveNet vocoder,” in

Proc.ASRU , Dec. 2017, pp. 712–718.[32] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “An investigation ofnoise shaping with perceptual weighting for wavenet-based speech gen-eration,” in . IEEE, 2018, pp. 5664–5668.[33] N. Adiga, V. Tsiaras, and Y. Stylianou, “On the use of wavenetas a statistical vocoder,” in . IEEE, 2018, pp.5674–5678.[34] S. Singhal and B. Atal, “Improving performance of multi-pulse LPCcoders at low bit rates,” in

ICASSP’84. IEEE International Conferenceon Acoustics, Speech, and Signal Processing , vol. 9. IEEE, 1984, pp.9–12.[35] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP):High-quality speech at very low bit rates,” in

Proc. ICASSP , vol. 10,Apr. 1985, pp. 937–940.[36] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi-periodic WaveNet vocoder: A pitch dependent dilated convolution modelfor parametric speech generation,” in

Proc. Interspeech , Sept. 2019, pp.196–200.[37] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda,“Statistical voice conversion with quasi-periodic wavenet vocoder,” in

Proc. SSW10 , Sept. 2019.[38] Y. Ai, H.-C. Wu, and Z.-H. Ling, “SampleRNN-based neural vocoderfor statistical parametric speech synthesis,” in

Proc. ICASSP , Apr. 2018,pp. 5659–5663.[39] Y. Cui, X. Wang, L. He, and F. K. Soong, “A new glottal neural vocoderfor speech synthesis.” in

Interspeech , 2018, pp. 2017–2021.[40] L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi,and P. Alku, “Speaker-independent raw waveform model for glottalexcitation,” in

Interspeech , 2018.[41] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independentcomponents estimation,” arXiv preprint arXiv:1410.8516 , 2014.[42] D. J. Rezende and S. Mohamed, “Variational inference with normalizingﬂows,” arXiv preprint arXiv:1505.05770 , 2015.[43] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, andM. Welling, “Improved variational inference with inverse autoregressiveﬂow,” in

Advances in neural information processing systems , 2016, pp.4743–4751.

OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 14 [44] D. P. Kingma and P. Dhariwal, “Glow: Generative ﬂow with invertible1x1 convolutions,” in

Advances in Neural Information Processing Sys-tems , 2018, pp. 10 215–10 224.[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Proc. NIPS , Dec. 2014, pp. 2672–2680.[46] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[47] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,” in

Proc. ICLR , May 2015.[48] J. Kominek and A. W. Black, “The CMU ARCTIC speech databasesfor speech synthesis research,” in

Tech. Rep. CMU-LTI- 03-177 , 2003.[49] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018:Promoting development of parallel and nonparallel methods,” in

Proc.Odyssey , June 2018, pp. 195–202.[50] Y.-C. Wu,

QPNet demo , Accessed: 2019. [Online]. Available:https://bigpon.github.io/QuasiPeriodicWaveNet demo/[51] K. Kobayashi and T. Toda, “sprocket: Open-source voice conversionsoftware.” in