Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network
Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda
JJOURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 1
Quasi-Periodic WaveNet: An Autoregressive RawWaveform Generative Model with Pitch-dependentDilated Convolution Neural Network
Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, and Tomoki Toda
Abstract —In this paper, a pitch-adaptive waveform generativemodel named Quasi-Periodic WaveNet (QPNet) is proposed toimprove the pitch controllability of vanilla WaveNet (WN) usingpitch-dependent dilated convolution neural networks (PDCNNs).Specifically, as a probabilistic autoregressive generation modelwith stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signalsdegrade the pitch controllability of WN. For instance, it is difficultfor WN to precisely generate the periodic components of audiosignals when the given auxiliary fundamental frequency ( F )features are outside the F range observed in the training data. Toaddress this problem, QPNet with two novel designs is proposed.First, the PDCNN component is applied to dynamically changethe network architecture of WN according to the given auxiliary F features. Second, a cascaded network structure is utilizedto simultaneously model the long- and short-term dependencesof quasi-periodic signals such as speech. The performances ofsingle-tone sinusoid and speech generations are evaluated. Theexperimental results show the effectiveness of the PDCNNs forunseen auxiliary F features and the effectiveness of the cascadedstructure for speech generation. Index Terms —WaveNet, pitch-dependent dilated convolution,quasi-periodic structure, vocoder, pitch controllability.
I. I
NTRODUCTION R AW waveform generation of audio signals like speechand music is a commonly used technique as the coreof many applications such as text-to-speech (TTS), voiceconversion (VC), and music synthesis. However, because ofthe extremely high temporal resolution (sampling rates areusually higher than 16kHz) and the very long term dependenceof audio signals, directly modeling the raw waveform signalsis challenging. To overcome these difficulties, in conventionalsynthesis techniques, audio signals are usually encoded intolow temporal resolution acoustic features and then audio wave-forms are decoded on the basis of these acoustic features. The
This work was supported in part by theJapan Science and TechnologyAgency (JST), Precursory Research for Embryonic Science and Technology(PRESTO) under Grant JPMJPR1657, in part by the JST, CREST underGrant JPMJCR19A3, and in part by the Japan Society for the Promotionof Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) underGrant 17H06101.Y.-C. Wu and P. L. Tobing are with Graduate School ofInformatics, Nagoya University, Aichi, Japan (e-mail: { yichiao.wu,patrick.lumbantobing } @g.sp.m.is.nagoya-u.ac.jp)T. Hayashi is with Graduate School of Information Science, NagoyaUniversity, Aichi, Japan (e-mail: [email protected])K. Kobayashi is with Information Technology Center, Nagoya University,Aichi, Japan (e-mail: [email protected])T. Toda is with Information Technology Center, Nagoya University, Aichi,Japan (e-mail: [email protected]) analysis-synthesis (encoding-decoding) technique is called thevocoder [1], [2], which is often built on a source-filter [3]speech production model including source excitations andvocal tracts. However, because of the oversimplified assump-tions of the speech generation mechanism, the lost temporaldetails and phase information lead to the serious qualitydegradation of conventional vocoders such as STRAIGHT [4]and WORLD [5].Owing to the recent development of deep learning, manyneural-based audio generation models [6]–[17] have beenproposed to generate raw audio waveforms without the var-ious assumptions imposed on conventional vocoders. Thatis, advanced, and deep network architectures directly modelthe long-term dependence of high-temporal-resolution audiowaveforms. In this paper, we focus on WaveNet (WN) [6],which is one of the state-of-the-art audio generation modelsand has been applied to a variety of applications such as musicgeneration [18], text-to-speech (TTS) [19], [20], speech cod-ing [21], speech enhancement [22], [23], and voice conversion(VC) [24]–[28]. The main core of WN is an autoregressive(AR) network modeling the probability distribution of eachaudio sample conditioned on auxiliary features and a specificnumber of previous samples called a receptive field . To handlethe very long term dependence of audio signals, a stackeddilated convolution network (DCNN) [29] structure is utilizedto efficiently extend the receptive field . Furthermore, the WNvocoder [30]–[33], which conditions WN on the acousticfeatures extracted by conventional vocoders to recover the lostinformation, achieves significant speech quality improvementsfor speech generation by replacing the synthesis process oftraditional vocoders.Although WN attains excellent performance in high-fidelityspeech generation, the fixed architecture is inefficient andthe lack of prior audio-related knowledge limits the pitchcontrollability of the WN vocoder. Specifically, because ofthe quasi-periodicity of speech, each sample may have aspecific dependent field related to its periodicity instead of afixed receptive field that presumably includes many redundantprevious samples. The requirement of a long receptive field for modeling speech dependency will lead to a huge networkand high demands for computation power. The data-drivenarchitecture without prior speech knowledge only implicitlymodels the relationship between the periodicity of waveformsignals and the auxiliary fundamental frequency ( F ) features,which may not explicitly generate speech with the precisepitch corresponding to the auxiliary F values, especially in a r X i v : . [ ee ss . A S ] J u l OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 2 an unseen F case. However, the pitch controllability is anessential feature for the definition of a vocoder.To address these problems, inspired by the source-filter model [3] and code-excited linear prediction (CELP)codec [34], [35], we propose Quasi-Periodic WaveNet (QP-Net) [36], [37] with a pitch-dependent dilated convolutionneural network (PDCNN). Specifically, the generation processof periodic signals can be modeled as the generation of a singlepitch cycle signal (short-term correlation) and then extendingthis single cycle signal to form the whole periodic sequenceson the basis of pitches (long-term correlation). As a result, wedevelop QPNet including two cascaded WNs with differentDCNNs. Vanilla WN with fixed DCNNs is the first stage,which is used to model the relationship between the currentsample and a specific segment of the nearest previous samples,and the second stage utilizes the PDCNNs to link the corre-lations of the relevant segments in the current and previouscycles. The Pitch-adaptive architecture allows each sample tohave an exclusive receptive field length corresponding to theauxiliary F features and improves the pitch controllabilityby introducing the periodicity information into the network.The proposed QPNet with the improved pitch controllabilityis more line with the definition of a vocoder. Furthermore, amore compact network size while achieving acceptable qualitysimilar to that of vanilla WN is feasible for QPNet because ofthe more efficient way the receptive field is extended, whichis highly related to the modeling capability.The paper is organized as follows. In Section II, we reviewrecent neural-based speech generation models. In Section III,a brief introduction to WN is presented. In Section IV, wedescribe the concepts and details of QPNet. In Sections Vand VI, we report objective and subjective experimental re-sults to evaluate the effectiveness of QPNet for generatinghigh-temporal-resolution periodic sinusoid signals and quasi-periodic speech, respectively. Finally, the conclusion is givenin Section VII. II. R ELATED W ORK
Recent mainstream speech generation techniques use ARmodels such as WN [6] and SampleRNN [7] to model the verylong term dependence of speech signals with high temporalresolution. In contrast to conditioned on linguistic and F features to generate speech like vanilla WN, taking an ARmodel as a vocoder is a more efficient way to train the ARmodel and make it generate the desired speech conditionedon handcrafted acoustic features. Many acoustic featureshave been applied to these AR vocoders such as the Mel-cepstral coefficients ( mcep ) with band aperiodicity ( ap ) and F features, which are extracted from WORLD [30]–[32] orSTRAIGHT [38], and Mel-spectrograms with F features [33].Furthermore, to achieve acceptable speech quality, the basicAR vocoders usually require a huge network for the long receptive field . However, although the speech qualities of thesebasic AR vocoders are significantly higher than those of thetraditional vocoders, the AR mechanism and the complicatednetwork structure make these AR vocoders difficult to generatespeech in real-time [6], [7]. To tackle this issue, the authors of FFTNet [8] and WaveRNN [9] proposed more compact ARvocoders with specific network structures based on speech-related knowledge and efficient computation mechanisms.Moreover, AR models generating glottal excitation [39], [40]and linear predictive coding (LPC) residual [10] signals havebeen proposed to ease the burden of modeling speaker identityand spectral information. Because of the speaker-independentcharacteristic of these source signals, the requirements forthe network capacity and speaker adaptation of these glottalvocoders and LPCNet are greatly reduced.In addition, flow-based [41], [42] non-AR vocoders havebeen proposed for efficient parallel generations. For ex-ample, parallel WaveNet [11] and ClariNet [12] with in-verse autoregressive flow (IAF) [43] and WaveGlow [13]and FloWaveNet [14] with Glow [44] model an invertibletransformation between a simple probability distribution ofnoise signals and a target distribution of speech signals forgenerating waveforms from a known noise sequence.Non-AR vocoders with mixed sine-based excitation inputsproduced on the basis of F and Gaussian noise [16], [17]or periodic sinusoid signals and aperiodic Gaussian noiseinputs [15] have also been proposed to simultaneously generatewhole waveforms while attaining pitch controllability via themanipulation of the periodic inputs. However, to synchronizethe phases of generated and ground truth waveforms duringtraining, these models need a handcrafted design of the inputsignal or a GAN [45] structure, which increases the complexityof the models. Moreover, directly applying these models torelated applications such as music generation is not straight-forward because of the tailored architectures.Instead of the carefully designed inputs and specific net-works, we proposed a simple module PDCNNs, which can beeasily applied to any DCNN-based generative model to im-prove its audio signal modeling capability by introducing pitchinformation into the network. We applied PDCNNs to WN todevelop a pitch-dependent adaptive network QPNet [36], [37]for speech generation with arbitrary F values. In this paper,we further evaluate the periodical modeling capability of QP-Net with PDCNNs for nonspeech sinusoid signals generationand comprehensively explore the effectiveness of the QPNetmodel with different cascade orders, network structures, andadaptive dilation sizes.III. W AVE N ET FOR S PEECH G ENERATION
A. WaveNet
Because an audio waveform is a sequential signal with astrong long-term dependency, WN [6] is used to model audiosignals in an AR manner that predicts the distribution of eachwaveform sample on the basis of its previous samples. Theconditional probability function can be formulated as P ( x ) = T (cid:89) t =1 P ( x t | x t − r , . . . , x t − ) (1)where t is the sample index, x t is the current audio sample, and r is a specific length of the previous samples called a receptivefield . Instead of the general recurrent structure for AR model-ing, WN applies stacked convolution neural networks (CNNs) OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 3 with a dilated mechanism and a causal structure to model thevery long term dependence and causality of audio signals.Since the modeling capability of WN is highly related to theamounts of the previous samples taken into consideration forpredicting the current sample, the dilated mechanism improvesthe efficiency of extending the receptive field length. Moreover,a categorical distribution is applied to model the conditionalprobability whereas audio signals are encoded into 8 bitsby using the µ -law algorithm. The categorical distribution isflexible to model an arbitrary distribution of target speech.Taken together, the data flow of WN is as follows: previousaudio samples pass through a causal layer and several residualblocks with DCNNs, gated structures, and residual and skipconnections. Specifically, the gated structure for enhancing themodeling capability of the network is formulated as z (o) = tanh (cid:16) V f,k ∗ z (i) (cid:17) (cid:12) σ (cid:16) V g,k ∗ z (i) (cid:17) (2)where z (i) and z (o) are the input and output feature maps ofthe gated structure, respectively. V is a trainable convolutionfilter, ∗ is the convolution operator, (cid:12) is an element-wisemultiplication operator, σ is a sigmoid function, k is the layerindex, and f and g are the filter and gate, respectively. Finally,the summation of all skip connections is processed by twoReLU [46] activations with × convolutions and one softmaxlayer to output the predicted distribution of the current audiosample.Furthermore, to guide the WN model to generate desiredcontents, the vanilla WN is conditioned on not only previoussamples but also linguistic and F features. The conditionalprobability is modified as P ( x | h ) = T (cid:89) t =1 P ( x t | x t − r , . . . , x t − , h ) (3)where h is the vector of the auxiliary features (linguistic and F features), and the gated activation with auxiliary featuresbecomes z (o) = tanh (cid:16) V (1) f,k ∗ z (i) + V (2) f,k ∗ h (cid:48) (cid:17) (cid:12) σ (cid:16) V (1) g,k ∗ z (i) + V (2) g,k ∗ h (cid:48) (cid:17) (4)where V (1) and V (2) are trainable convolution filters, and h (cid:48) is the temporal extended auxiliary features, whose temporalresolution matches to the speech samples. B. WaveNet Vocoder
Many conventional vocoders [4], [5] are built on the basisof a source-filter architecture [3], which models the speechgeneration process as a spectral filter driven by the sourceexcitation signal. However, the oversimplified assumptions,such as analysis windows with a fixed length, time-invariantlinear filters, and stationary Gaussian processing, make thevocoders lose some essential information of speech such asphase and temporal details, and it causes marked qualitydegradation. To address this problem, the authors of [30],[31] proposed the WN vocoder, which conditions WN onthe auxiliary acoustic features extracted by a conventional
Skip connection
Fixed blockFixed blockAdaptiveblockAdaptive block ...
Fixed / Adaptive
InputAuxiliary featuresOutputSkip connection
Input
Fixed/Adaptive residual block ...
UpsampleAcoustic features 1×11×1 1×1
Auxiliary features
Macroblock 1Macroblock 0
Fig. 1. Quasi-Periodic WaveNet vocoder architecture. vocoder to generate raw speech waveforms. That is, the WNvocoder replaces the synthesis part of conventional vocoders tosynthesize high-fidelity speech on the basis of the prosodic andspectral acoustic features extracted by conventional vocoders.Furthermore, conditioning WN on the acoustic features greatlyreduce the requirements of the amounts of the training data,and it makes WN more tractable.
C. Problems in Using WaveNet as A Vocoder
As a vocoder, WN achieves high speech quality, but it lackspitch controllability, which is an essential feature of conven-tional vocoders. Specifically, the WN vocoder has difficultiesin generating speech with precise pitch conditioning on the F values that are not observed in the F range of trainingdata [36]. Even though the F and spectral features are withinthe observed range, an unseen combination of the auxiliaryfeatures still markedly degrades the generation performanceof the WN vocoder [24]–[28]. The possible reasons for thisproblem are that WN lacks prior speech knowledge and doesnot explicitly model the relationship between the auxiliary F feature and pitch. The defect makes the WN vocoderinconsistent with the definition of a vocoder. Moreover, sincethe fixed WN architecture assumes each sample has the samelength of the receptive field , the inefficient receptive field extending may lead to the costly requirements of a hugenetwork and lots of computation power.IV. Q UASI -P ERIODIC W AVE N ET To improve the efficiency of extending the receptive field and pitch controllability, QPNet introduces the prior pitchinformation into WN by dynamically changing the networkstructure according to the auxiliary F features. Specifically,as shown in Fig. 1, the main differences between WN andQPNet are the pitch-dependent dilated convolution mechanismhandling the periodicity of audio signals and the cascadedstructures simultaneously modeling the long- and short-termcorrelations. The pitch filtering in CELP, which is the basisof the PDCNN, and the details of QPNet are described asfollows. OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 4
Innovation code book
Short delay predictorLong delay predictor
Weighted mean-square errorPitch Spectral envelope NaturalPredicted
Fig. 2. Code-excited linear prediction system. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4 structure modeling the hierarchical correlations is also applied to QPNet. B. Pitch-dependent Dilated Convolution
The main idea of the PDCNN is that since audio signals attain the quasi-periodic property, the network architecture can be dynamically optimized using the prior pitch information. Specifically, the dilated convolution can be formulated as o c i p it t t d X W X W X , (6) where i X is the input and o X is the output of the DCNN layer. The trainable 1×1 convolution filters c W and p W are respectively for the current and past samples. The dilation size d is constant for the vanilla DCNN but time-variant for the PDCNN. As shown in Figs. 3 and 4, instead of a fixed length of past speech samples, the effective receptive field of the PDCNN includes a pitch-variant length of speech samples. Specifically, although the sinusoids with different frequencies in Figs. 4 (a) and (b) have the same sampling rate and the number of sample points taken into account in both receptive fields is the same, the different F -dependent dilation sizes lead to different effective receptive field lengths. Moreover, to extend the receptive field , vanilla WN utilizes stacked chunks including DCNN layers with different dilation sizes. Specifically, each chunk contains a specific number of DCNN layers, and each layer (except the first layer) twice the dilation size of the last one. The dilation sizes of the first layers of the chunks are set to one, so the dilation size in each chunk exponentially increases with base two. For QPNet, the dilation sizes of PDCNN layers in stacked adaptive modules follow the same extension rule but multiplied by an extra dilated factor to match the pitch of the current sample. The pitch-dependent factor E t is derived from t s t E F F a , (7) where F s is the utterance-wise constant sampling rate, F t is the fundamental frequency with speech sample index t , and a is a hyperparameter called the dense factor , which indicates the number of samples in one cycle taken into consideration when predicting the current sample. Therefore, with the same dense factor , different effective receptive fields include the same number of past cycles, as shown in Fig. 4. In summary, the pitch-dependent structure allows each sample to have an exclusive effective receptive field length and efficiently extends it according to the corresponding F value. C. Cascaded Autoregressive Network
Most audio signals are sequential and quasi-periodic, so the generation network should simultaneously model the long-term (periodicity) and short-term (aperiodicity) correlations of audio samples. As shown in Fig. 1, the proposed OPNet model utilizes a cascaded architecture that contains fixed and adaptive (pitch-dependent) modules. The fixed module models the sequential relationship between the current sample and a segment of the most recent samples. The adaptive module models the periodic correlations of the current and related past segments in the successive cycles. Moreover, the fixed module of QPNet is composed of a causal layer, several stacked residual blocks with fixed DCNNs, conditional auxiliary features, gated activations, and residual and skip connections, similarly to vanilla WN. The adaptive module also contains several similar stacked residual blocks but with fixed DCNNs replaced by PDCNNs. In summary, the cascaded structure of QPNet presumably mimics a similar mechanism of CELP for quasi-periodic audio signals. V. P ERIODIC S IGNAL G ENERATION E VALUATION
To evaluate the frequency controllability of the proposed QPNet with the PDCNN, we first evaluated the generation quality of simple periodic but high-temporal-resolution signals. That is, the training data of QPNet were sine waves within a specific frequency range and the corresponding F values. In the test phase, QPNet was conditioned on outside F values and a small piece of the related sine wave for the initial receptive field to generate sinusoid waveforms. A. Sinusoid Evaluation Setting
Because the pitch range of speech is 80–400 Hz, the training sine waves were set to be in the same range with a step size of 20 Hz (ex: 80, 100, 120 … Hz). QPNet had a related one-dimensional F value as its auxiliary feature. To increase the robustness of QPNet, both sinusoid and auxiliary signals were mixed with white noise. The signal-to-noise ratio (SNR) of the sine waves was around 20 dB, and the noise of the auxiliary feature was a random sequence between -1 and 1. Random initial phases were also applied to the sinusoid signals. DCNN
Receptive fieldEffective receptive field Output
Dilation = 2 × E T Hidden layer
Dilation = 1 × E T InputOutput
Dilation = 2
Hidden layer
Dilation = 1
InputPitch-dependent dilated factors E T PDCNN
Receptive fieldEffective receptive field E t =2 E t-1 =3... Fig. 3. Fixed and pitch-dependent dilated convolution. Fig 4. Effective receptive fields with different F values Fig. 3. Fixed and pitch-dependent dilated convolution.
A. Pitch Filtering in CELP
Fig. 2 shows a flowchart of the CELP system [35], which in-cludes an innovation signal codebook and two cascaded time-varying linear recursive filters. First, each innovation signalin the codebook is scaled and passed to the pitch filter (longdelay) to generate the pitch periodicity of the speech, and thenthe linear-prediction filter (short delay) restores the spectralenvelope to obtain the synthesized speech. Secondly, the mean-square errors between the original and synthesized speechsignals are weighted by a linear filter to attenuate/amplifyfrequency components that are less/more perceptually impor-tant. Finally, the optimum innovation signal and the scaledfactor are determined by minimizing the weighted mean-square error. To be more specific, the pitch-filtering processcan be formulated as c (o) t = g × c (i) t + b × c (o) t − t d (5)where c (i) is the input, c (o) is the output, t d is the pitch delay, g is the gain, and b is the pitch filter coefficient. This periodicfeedback structure handling the periodicity of signals is thebasis of the proposed PDCNN, and the cascaded recursivestructure modeling the hierarchical correlations is also appliedto QPNet. B. Pitch-dependent Dilated Convolution
The main idea of the PDCNN is that since audio signalshave the quasi-periodic property, the network architecture canbe dynamically adapted using the prior pitch information.Specifically, the dilated convolution can be formulated as y (o) t = W (c) ∗ y (i) t + W (p) ∗ y (i) t − d + W (f) ∗ y (i) t + d , (6)where y (i) and y (o) are the input and output of the DCNNlayer. The trainable × convolution filters W (c) and W (p) $ P S O L W X G H D ( I I H F W L Y H U H F H S W L Y H I L H O G R I ) U H T 6 D P S O H L Q G H [ $ P S O L W X G H E ( I I H F W L Y H U H F H S W L Y H I L H O G R I ) U H T Fig. 4.
Effective receptive fields with different F values. are respectively for the current and previous samples. ∗ is theconvolution operator. The dilation size d is constant for thevanilla DCNN but time-variant for the PDCNN.To extend the receptive field length, the vanilla WN utilizesstacked chunks including DCNN layers with different dilationsizes. Specifically, each chunk contains a specific number ofDCNN layers, and each layer (except the first layer) twicethe dilation size of the last one. The dilation sizes of the firstlayers of the chunks are set to one, so the dilation size ineach chunk exponentially increases with base two. As shownin Fig. 3, the dilation sizes of PDCNN layers in the stackedadaptive chunks of QPNet follow the same extension rule butmultiplied by an extra dilated factor to match the instantaneouspitch of the current sample. The pitch-dependent dilated factor E t is derived from E t = F s / ( F ,t × a ) , (7)where F s is the utterance-wise constant sampling rate, F ,t isthe fundamental frequency with speech sample index t , and a is a hyperparameter called the dense factor , which indicatesthe number of samples in one cycle taken into considerationwhen predicting the current sample.Specifically, the grid sampling locations of each DCNN iscontrolled by the dilation size d , and the dilation size d (cid:48) ofeach PDCNN is controlled by the dilated factor E t as d (cid:48) = E t × d. (8)By setting the F values and the dense factor a , the networkcan control the sparsity of the CNN sampling grids to attainthe desired effective receptive field length. As shown in Fig. 4,since the sinusoids in Figs. 4 (a) and (b) have the same densefactors and sampling rates, even though the frequencies ofthem are different, the numbers of cycles in their effectivereceptive fields are still the same. The difference is thetemporal sparsity of the effective receptive field . That is, fixingthe number of sampling grids in each cycle by the dense factor and changing the gaps between the grid sampling locations bythe instantaneous F values lead to pitch-dependent and time-variant effective receptive field lengths.In summary, the dilated factor E t is the expanded ratioof the effective receptive field length to the receptive field length, and the ratio of the receptive field length to the dense OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 5
TABLE IA
RCHITECTURE OF S INUSOIDAL G ENERATIVE M ODEL
WNf WNc (r)QPNet pQPNetFixed chunk 3 4 3 -Fixed block 10 4 4 -Adaptive chunk - - 1 4Adaptive block - - 4 4CNN channel 128CNN channel 128CNN channel 64Size ( × ) 2.4 1.5 1.5 1.5 Causal and dilated CNN × CNN in residual block × CNN in output layer factor a is the number of past cycles in the effective receptivefield . With the pitch-dependent structure, each sample has anexclusive effective receptive field length, which is efficientlyextended according to the auxiliary F values. In addition,since speech has voiced and unvoiced segments, we have triedto set E t to one or the value calculated by interpolating the F values of the adjacent voiced segments for the unvoicedsegments, and the results in Section VI show that QPNet withthe continuous E t from interpolated F values achieves higherspeech quality. C. Cascaded Autoregressive Network
Most audio signals are sequential and quasi-periodic, sothe audio generative models usually simultaneously model thelong-term (periodicity) and short-term (aperiodicity) correla-tions of audio samples. As shown in Fig. 1, the proposedOPNet utilizes a cascaded architecture that contains a fixedand an adaptive (pitch-dependent) macroblocks. The fixedmacroblock models the sequential relationship between thecurrent sample and a segment of the most recent samples.The adaptive macroblock models the periodic correlations ofthe current and related past segments in the successive cycles.Specifically, the fixed macroblock (macroblock 0 in Fig. 1)of the QPNet is composed of several fixed chunks. Eachfixed chunk consists of several stacked residual blocks withDCNNs (fixed blocks), conditional auxiliary features, gatedactivations, and residual and skip connections, similarly to thevanilla WN. The adaptive macroblock (macroblock 1 in Fig. 1)also contains several adaptive chunks, which also have similarstacked residual blocks but with PDCNNs (adaptive blocks). Insummary, the cascaded structure of QPNet presumably mimicsa similar generative procedure of CELP for quasi-periodicaudio signals generation.V. P
ERIODIC S IGNAL G ENERATION E VALUATION
To evaluate pitch controllability of the proposed QPNetwith the PDCNNs, we first evaluated the generation qualityof simple periodic but high-temporal-resolution signals. Thatis, the training data of QPNet were sine waves within a specificfrequency range and the corresponding F values. In the testphase, QPNet was conditioned on an F value and a small piece of the related sine wave for the initial receptive field togenerate sinusoid waveforms. A. Model Architecture
In this section, to evaluate the effectiveness of the PDCNN,we compared three types of QPNet with two types of WN interms of sine wave generation. Specifically, in addition to thebasic QPNet, because a sinusoid is a simple periodic signalthat can be modeled well by a pitch-dependent structure, theQPNet model with only adaptive residual blocks (pQPNet)was taken into account. The QPNet model with the reverseorder of the fixed and adaptive macroblocks (rQPNet) wasalso considered. Moreover, a compact-size WN (WNc) and afull-size WN (WNf) models were evaluated as the references.The details of the network architectures are shown inTable I. Since the numbers of CNN channels were the same forall models, the model sizes were proportional to the numbersof the chunks and residual blocks. For instance, the WNfcontained 3 chunks and each chunk included 10 residualblocks, so the model size of the WNf was larger than thatof the WNc, which only had 4 chunks with 4 residual blocksin each chunk. The learning rate was × − without decay,the minibatch size was one, the batch length was 22050, thetraining epochs were two, and the optimizer was Adam [47]for all models. B. Evaluation Setting
Because the pitch range of most speech is around 80–400 Hz, the training sine waves were set to be in the samerange with a step size of 20 Hz (ex: 80, 100, 120 Hz). Eachmodel had a related one-dimensional F value as its auxiliaryfeature. Since the single-tone generation was evaluated, theauxiliary features of all samples in one utterance were thesame. To prevent the networks from suboptimal training andlacking the generality for sinusoid generations with unseen F values, both sinusoid and auxiliary signals were mixed withwhite noise.The signal-to-noise ratio (SNR) of the sine waves wasaround 20 dB, and the noise of the auxiliary feature was arandom sequence between -1 and 1. Random initial phaseswere also applied to the sinusoid signals. The number oftraining utterances was 4000, and each utterance was onesecond. The ground truths were clean sinusoid signals, soeach model was trained as a denoising network. The test dataincluded 20 different F values, which were 10–80 Hz with astep size of 10 Hz, 100–400 Hz with a step size of 100 Hz,and 450–800 Hz with a step size of 50 Hz, and each F valuecontained 10 test utterances with different phase shifts. Bothtraining and test data were encoded using the µ -law into 8 bits,and the sampling rate was 22,050 Hz.In the test stage, the initial receptive field of each networkwas fed with the noisy test sine wave, and the length of thegenerated sinusoid was set to 1s. The quality of each generatedwaveform was evaluated on the basis of the SNR and the root-mean-square error (RMSE) of the log F value measured fromthe peak of the power spectral density (PSD). Moreover, thetest data were divided into 10–40 Hz (under / L ), 50–80 Hz OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 6
TABLE IISNR (dB) OF S INUSOID G ENERATION WITH D IFFERENT D ENSE F ACTORS
Dense a Under / L / L / U / U -8.1 -0.8 -0.3 2.7 TABLE IIIL OG F RMSE OF S INUSOID G ENERATION WITH D IFFERENT D ENSE F ACTORS
Dense a Under / L / L / U / U (above / L ), 100–400 Hz (inside), 450600 Hz (under / U ),and 650800 (above / U ) subsets. L is the lower bound and U is the upper bound of the inside F range, which was the F range of the training data. As a result, the under / L andabove / L F ranges are the lower outside F range, and theunder / U and above / U F ranges are the higher outside F range. C. Dense Factor
To explore the efficient dense factor value of the PDCNNs,the sinusoid generative qualities of the pQPNet models withdifferent dense factors was evaluated. Since the chunk andblock numbers of pQPNet were set to four, the length of the receptive field was 61 samples. That is, the receptive field included from 61 past cycles to less than one cycle accordingto the dense factors from to . Moreover, in contrast tocontaining a fixed number of past cycles for sinusoids witharbitrary pitch, the receptive field of WNf contained 11 pastcycles for 80 Hz sinusoids and 56 past cycles for 400 Hzsinusoids when the sampling rate was 22,050 Hz. As a result,the effective receptive fields of the pQPNet with a dense factor dense factors from to was two. For dense factors of and , pQPNet required at least 10 training epochs to attainstable results. As shown in Tables II and III, the network withthe dense factor of was very unstable even when alreadytrained with 10 epochs. The results indicate that although thesmall dense factor made the network have a long effectivereceptive field , the overbrief information of each past cyclemight make it difficult to model signals well. For the insideand lower outside F ranges, the networks with dense factors TABLE IVSNR (dB) OF S INUSOID G ENERATION WITH D IFFERENT M ODELS
WNc WNf pQPNet QPNet rQPNetUnder / L -18.1 / L Inside 28.8 / U / U -14.1 -0.4 -18.6 -8.2Average 3.7 19.8 19.4 7.2 17.0 TABLE VL OG F RMSE OF S INUSOID G ENERATION WITH D IFFERENT M ODELS
WNc WNf pQPNet QPNet rQPNetUnder / L / L Inside 0.01
Under / U / U greater than achieved high SNR values. However, theperformance of the network with a dense factor of markedlydegraded when the auxiliary F values were in the higheroutside F range. The possible reason is that the PDCNNs ofthe network degenerated to DCNNs because the E t becameone when the dense factor was and the F values werehigher than 350 Hz. Moreover, the log F RMSE results showa similar tendency to the SNR results. The networks with densefactors of and achieved the lowest pitch accuracies whilethe networks with dense factors of and achieved thehighest pitch accuracies.In conclusion, the PDCNN with an appropriate dense factor was found to be robust against the conditions in the outside F range, especially in the lower outside F range conditions.For the higher outside F range conditions, the networks stillhad acceptable quality until the F value exceeded 600 Hz.Therefore, we set the dense factors to for the models inthe following evaluations because of the balance between thegenerative performance and the number of past cycles coveredin its receptive field . D. Network Comparison
As shown in Tables IV and V, the PDCNNs significantlyimproved pitch controllability. The PDCNNs made the QP-series networks achieve much higher SNR and lower log F RMSE values than the same-size WNc network in both higherand lower outside F ranges, and it shows the effectivenessof the PDCNNs to extend the effective receptive field length.Although full-size WNf attained similar SNRs to pQPNet, thelog F RMSE of WNf was much higher in the outside F ranges. This indicates that WNf tended to generate the signalsin the inside F range instead of being consistent with theauxiliary F feature, so the generated waveform of WNf mightstill be a perfect sinusoid signal but with an incorrect pitch.The OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 7 7 L P H V $ P S O L W X G H D ) U H T X H Q F \ + ] 3 6 ' G % E 3 H D N + ] 6 1 5 G % 7 L P H V $ P S O L W X G H F ) U H T X H Q F \ + ] 3 6 ' G % G 3 H D N + ] 6 1 5 G % 7 L P H V $ P S O L W X G H H ) U H T X H Q F \ + ] 3 6 ' G % I 3 H D N + ] 6 1 5 G % Fig. 5. Waveform and PSD of 500 Hz sinusoid generated by pQPNets with dense factors ((a), (b)), ((c), (d)), and ((e), (f)). results also imply that the PDCNNs improved the periodicalmodeling capability using prior pitch knowledge.In addition, because of the simple periodic signal generationscenario, pQPNet with the longest effective receptive field and the pure PDCNN structure attained the best generativeperformance among all QP-series networks. QPNet and rQP-Net showed some quality degradations when the auxiliary F values were far away from the inside F range, but they stilloutperformed WNc in both measurements and WNf in termsof log F RMSE.
E. Discussion
In this section, several sinusoid generation examples arepresented for looking into the physical phenomena behind theobjective results. As shown in Figs. 5 (a) and (b), the pQPNetwith a dense factor generated clear sine waves with anSNR 23.7 dB when conditioned on an outside auxiliary valueof 500 Hz (under / U ). The PSD of this generated signal hasa peak value of 502 Hz, which is very close to the ground truthand the log F error is less than 0.01. However, the results inFigs. 5 (c) and (d) show that the sine wave generated by thepQPNet with a dense factor includes much harmonic noise,which result in a low SNR. Even if the generated sine waveis still like a periodic signal, the wrong peak value from thesecond harmonic component of the PSD also causes a high 7 L P H V $ P S O L W X G H D ) U H T X H Q F \ + ] 3 6 ' G % E 3 H D N + ] 6 1 5 G % 7 L P H V $ P S O L W X G H F ) U H T X H Q F \ + ] 3 6 ' G % G 3 H D N + ] 6 1 5 G % 7 L P H V $ P S O L W X G H H ) U H T X H Q F \ + ] 3 6 ' G % I 3 H D N + ] 6 1 5 G % Fig. 6. Waveform and PSD of 20 Hz sinusoid generated by pQPNet with a dense factor ((a), (b)), WNf ((c), (d)), and WNc ((e), (f)). log F error. Moreover, the results in Figs. 5 (e) and (f) showthat the pQPNet with a dense factor generated a very noisysignal, which has a very low SNR and a wrong peak value.In addition, as shown in Figs. 6 (a) and (b), the pQPNet witha dense factor still generated a clear sine wave with an SNR23.3 dB and a correct peak value of its PSD when conditionedon an outside 20 Hz (under / L ) auxiliary value. However,the same-size WNc could not generate any meaningful signal,and the SNR of its generated signal is very low as shown inFigs. 6 (c) and (d). By contrast, the WNf still generated a clearsine wave with an SNR 33 dB but its frequency is incorrect asshown in Figs. 6 (e) and (f). Specifically, the PSD peak valueis 120 Hz, and it implies that the WNf tends to generate seensignals even if conditioned on an unseen auxiliary feature.In conclusion, the SNRs are related to the noisy degrees ofthe generated signals, which indicate the generated signals areclear sinusoids or not. Since it was a single-tone sinusoid gen-eration test, the high log F RMSEs imply that the generatedsignals may include much harmonic noise or the frequenciesof these signals are incorrect. As a result, the generated signalwith a high SNR and a high RMSE is a clear sinusoid with aninaccurate frequency like the signal shown in Fig. 6 (e). Thegenerated signal with a low SNR and a high RMSE is a noisysinusoid with much harmonic noise like the signal shown inFig. 5 (c). The generated signal with a very low SNR is a
OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 8
TABLE VIA
RCHITECTURE OF S PEECH G ENERATIVE M ODEL
WNf WNc (r)QPNet Full-size(r)QPNetFixed chunk 3 4 3 3Fixed block 10 4 4 10Adaptive chunk - - 1 1Adaptive block - - 4 4CNN channel 512CNN channel 512CNN channel 256Size ( × ) 44 24 24 50 Causal and dilated CNN × CNN in residual block × CNN in output layer noise-like signal as shown in Figs. 5 (e) or 6 (c).VI. S
PEECH G ENERATION E VALUATIONS
In this section, we evaluate the effectiveness of the PDCNNsfor speech generation. The appropriate proportions of adaptiveand fixed residual blocks, the continuous pitch-dependentdilated factor, and the order of the macroblocks are explored.
A. Model Architecture
The quality of speech generation was evaluated on the basisof 11 vocoders, which included three types of vocoder, QP-Net, WN, and WORLD. Specifically, to explore the efficient receptive field extension by the PDCNNs, the compact-sizeQPNet vocoders were compared with the same-size WNcand double-size WNf vocoders. Furthermore, the evaluationsincluded several variants of QPNet such as the models withdifferent types of pitch-dependent dilated factor E t and theorder of the fixed and adaptive macroblocks. Specifically, theQPNet and rQPNet vocoders with the continuous and discrete E t sequences were evaluated. For the unvoiced frames, thediscrete E t sequence was set to ones, and the continuous E t sequence was calculated using interpolated F values asmentioned in Section IV. In addition, the full-size QPNetand rQPNet vocoders, which were full-size WN vocoderscascaded with four extra adaptive residual blocks, were alsotaken into consideration to explore the effect of the ratio ofadaptive to fixed residual blocks. The network architecturesand model sizes are shown in Table VI. The learning ratewas × − without decay, the minibatch size was one, thebatch length was 20,000, and the optimizer was Adam [47]for all models. Since even the compact-size WNc had tens ofmillions parameters, which was the same order of magnitudeas that of WNf, the training iterations were empirically set to200,000 for all models. Note that we did not evaluate speechgeneration using the pQPNet model because it failed to modelthe short-term correlation of speech according to our internalexperiments. B. Evaluation Setting
All models were trained in a multispeaker manner. Thetraining corpus of these multispeaker NN-based vocoders
TABLE VIIC
OMPARISON OF
QPN
ET WITH D IFFERENT D ENSE F ACTORS
Dense a MCD (dB) 4.05 F RMSE 0.23 0.17 0.15
U/V (%) 21.8 16.0 14.2
TABLE VIIIC
OMPARISON OF E FFECTIVE R ECEPTIVE F IELD L ENGTH ( SAMPLES ) Dense a Length ± ± ± ± ± ± ± consisted of the training sets of the ”bdl” and ”slt” speakersof CMU-ARCTIC [48] and all speakers of VCC2018 [49].The total number of training utterances was around 3000,and the total training data length was around four hours.The evaluation corpus was composed of the SPOKE set ofVCC2018, which included two female and two male speakers,and each speaker had 35 test utterances. All speech data wereset to a sampling rate of 22,050 Hz and a 16-bit resolution.The waveform signals for the categorical output of the NN-based vocoders were further encoded into 8 bits using the µ -law. The 513-dimensional spectral ( sp ) and ap and one-dimensional F features were extracted using WORLD. The sp feature was further parameterized into 34-dimensional mcep , ap was coded into two-dimensional components, and F wasconverted into continuous F and the voice/unvoice ( U/V )binary code for the auxiliary features [30]. The F range ofthe SPOKE set was around 40–330 Hz, and the F mean wasaround 150 Hz. The unseen outside auxiliary features weresimulated by replacing the original F values of the acousticfeatures with the scaled F values, and the scaling ratios were1/2, 3/4, 5/4, 3/2, and 2. A demo and open-source QPNetimplementation can be found in [50]. C. Objective Evaluation
For the objective evaluations, the ground truth acousticfeatures were extracted from natural speech utterances us-ing WORLD, and the extraction error from WORLD wasneglected. A speaker-dependent F rage was applied to thefeature extraction of each speaker to improve the extractionaccuracy, and the F range was set following the processin [51]. Since WORLD was developed to extract F indepen-dent spectral features [5], the WORLD-extracted sp featurewas assumed to be independent of the F feature in this paper.Therefore, the ground truth acoustic features for the scaled F scenarios were the same natural spectral features with the F feature scaled by an assigned ratio. The auxiliary features ofthe evaluated vocoders were the ground truth acoustic features.Mel-cepstral distortion (MCD) was applied to measure thespectral reconstruction capability of the vocoders, and theMCD was calculated between the auxiliary mcep and theWORLD-extracted mcep from the generated speech. The pitchaccuracy of the generated speech was evaluated using the OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 9
TABLE IXMCD (dB)
WITH F RAME - BASED
95% C
ONFIDENCE I NTERVAL (CI) OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION
WORLD WNc WNf QPNet Full-size QPNet rQPNet Full-size rQPNet E t - - - cont. disc. cont. disc. cont. disc. cont. disc. × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± × F ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RMSE of the auxiliary F and the WORLD-extracted F value from the generated speech in the logarithmic domain.Theunvoiced/voiced ( U/V ) decision error was also taken intoaccount in the evaluation of the prosodic prediction capability,which was the percentage of the unvoiced/voiced decisiondifference of each utterance.Since speech generation is more complicated than sine wavegeneration, we first conducted a objective evaluation of QPNetmodels with different dense factors for speech generation tocheck the consistency of the efficient dense factor value. Asshown in Table VII, the tendency of the objective evaluationis similar to the results of the sinusoid generation evaluation.That is, the QPNets with dense factors from – achievedsimilar generative performance while the speech quality andpitch accuracy of the QPNets with dense factors and markedly degraded because of the much shorter effectivereceptive field lengths. Specifically, as shown in Table VIII,the average effective receptive field lengths of the QPNets withthe dense factors and are much shorter than others, andthe lengths were too short to cover at least one cycle of thesignal with 150 Hz, which was the F mean of the SPOKEset. Furthermore, although the QPNet with a dense factor had the longest average effective receptive field length andachieved an acceptable MCD, the higher RMSE of log F and U/V error indicate its instability, which was also observed inthe sinusoid generation evaluation. In conclusion, the densefactors of the following QPNet-series models were set to because of the lowest RMSE of log F and U/V error withan acceptable MCD. The internal subjective evaluation resultsalso show the preference of the utterances generated by theQPNet with the dense factor .As shown in Table IX, in terms of spectral predictioncapability, the compact-size (r)QPNet vocoders with the pro-posed PDCNNs significantly outperformed the same-size WNcvocoder. The results confirm the effectiveness of the QPstructure to skip some redundant samples using the priorpitch knowledge for a more efficient receptive field extension. However, the MCDs of the double-size WNf vocoder are lowerthan that of the compact-size (r)QPNet vocoders, and the full-size (r)QPNet vocoders with the largest network size alsooutperformed the WNf vocoder in terms of MCD. The resultsindicate that the MCD values are highly related to the networksizes, so a deeper network attains a more powerful spectralmodeling capability. Furthermore, the systems with continuouspitch-dependent dilated factors achieved better MCDs thanthose with discrete ones, and the result is consistent with ourinternal subjective evaluation for speech quality. However, theMCD differences of the rQPNet and QPNet vocoders were notreflected in the perceptual quality, and they had similar speechqualities according to the internal evaluation.The log F RMSE results in Table X also show thatboth the compact-size QPNet and rQPNet vocoders attainedmarkedly higher pitch accuracy than the same-size WNcvocoder, particularly when conditioned on the unseen F witha large shift. The compact-size QPNet vocoder even achievedhigher pitch accuracies than the WNf vocoder. The resultsindicate that the PDCNNs with the prior pitch knowledgeimproved the pitch controllability of these vocoders againstthe unseen F . However, the pitch accuracies of the full-size QPNet and rQPNet vocoders are lower than that of the(r)QPNet vocoders. The possible reason is that the unbalancedproportion of the adaptive and fixed residual blocks impairedthe pitch controllability. That is, for the full-size (r)QPNetvocoders, the number of the fixed blocks is markedly largerthan the number of the adaptive blocks. Therefore, the networkmight be dominated by the fixed blocks, which degradedthe influence from the adaptive blocks. Specifically, for the(r)QPNet vocoders with a dense factor , the receptive field length of the fixed blocks is 46 samples (The details of the receptive field length can be found in Discussion.), and theaverage effective receptive field length of the adaptive blocksis 384 samples as shown in Table VIII. However, for thefull-size (r)QPNet vocoders, the receptive field length of thefixed blocks is 3070 samples, which was much longer than OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 10
TABLE XL OG F RMSE
WITH U TTERANCE - BASED
95% CI OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION
WORLD WNc WNf QPNet Full-size QPNet rQPNet Full-size rQPNet E t - - - cont. disc. cont. disc. cont. disc. cont. disc. × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± × F ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TABLE XI
U/V D ECISION E RROR R ATE (%)
WITH U TTERANCE - BASED
95% CI OF D IFFERENT G ENERATION M ODELS FOR S PEECH G ENERATION
WORLD WNc WNf QPNet Full-size QPNet rQPNet Full-size rQPNet E t - - - cont. disc. cont. disc. cont. disc. cont. disc. × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± / × F ± ± ± ± ± ± ± ± ± ± ± × F ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± the 384 samples of the extra four adaptive blocks. Therefore,the influence of the adaptive blocks might be very limited.As shown in Table XI, the compact-size QPNet vocoderattained the lowest U/V decision error among all NN-basedvocoders, and it indicates a higher capability to capture
U/V information. In conclusion, the compact-size QPNet vocoderwith the proposed PDCNNs and continuous pitch-dependentdilated factors attained the highest accuracy of pitch and
U/V information among the evaluated NN-based vocoders.Although the compact-size QPNet vocoder did not achievethe same spectral prediction capability as the WNf vocoderaccording to the MCD results, it is difficult to measure aperceptual quality difference only on the basis of MCD. Asa result, we subjectively evaluated the compact-size QPNet(with continuous pitch-dependent dilated factors), WNc, andWNf vocoders in the next section. Moreover, although theWORLD vocoder had the best objective evaluation results, the WORLD-generated speech usually lacks naturalness andcontains buzz noise, which may not be reflected in the objec-tive measurements. Therefore, in our subjective evaluations,we also considered the WORLD vocoder.
D. Subjective Evaluation
The subjective evaluations included the Mean Opinion Score(MOS) test for speech quality and the ABX preference testfor perceptual pitch accuracy. Specifically, the naturalness ofeach utterance in the evaluation set for the MOS test wasevaluated by several listeners by assigning scores of 1–5 toeach utterance; the higher the score, the greater naturalnessof the utterance. The MOS evaluation set was composedof randomly selected utterances generated on the basis ofthe WORLD, WNf, WNc, and QPNet vocoders, and theauxiliary features with 1/2 F , 3/2 F , and unchanged F .The compact-size QPNet vocoder with the continuous dilated OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 11
45 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ M E AN O P I N I ON S C O R E WORLD WNf WNc QPNet
Fig. 7. Sound quality MOS evaluation of female speakers with 95% CI. factors was adopted and abbreviated as QPNet in the subjectiveevaluations. We randomly selected 20 utterances from the 35test utterances of each condition and each speaker to form theMOS evaluation set, so the number of utterances in the set was960. The MOS evaluation set was divided into five subsets,and each subset was evaluated by two listeners, so the totalnumber of listeners was 10. All listeners took the test using thesame devices in the same quiet room. Although the listenerswere not native speakers, they had worked on speech or audiogeneration research.In the ABX preference test, the listeners compared two testutterances (A and B) with one reference utterance (X) to evalu-ate which testing utterance had a pitch contour more consistentwith that of the reference utterance. Because the natural speechwith the desired scaled F does not actually exist, and theconventional vocoders usually have high pitch accuracy, wetook the WORLD-generated speech as the reference. The ABXevaluation set consisted of the same generated utterances of theWNf, QPNet, and WORLD vocoders as the MOS evaluationset. The number of ABX utterance pairs was 240, and eachpair was evaluated by two of the same 10 listeners as in theMOS test.As shown in Fig. 7, for the female speaker set, the QP-Net vocoder significantly outperformed the same-size WNcvocoder in all cases. Although the QPNet vocoder achievedslightly lower naturalness than the WNf vocoder in the un-changed F (inside) case, the QPNet vocoder still attainedmarkedly better naturalness than the WNf vocoder in the 1/2 F (outside) case. The results indicate that halving the networksize markedly degraded the speech modeling capability of theWN vocoder. However, the proposed PDCNNs significantlyimproved it, especially in the 1/2 F case which made QPNetobtain a long effective receptive field length. On the otherhand, owing to the small dilated factors caused by the high F values, many of the PDCNNs might degenerate to DCNNsin the 3/2 F case. Specifically, when the dilated factors areless than or equal to one because of the high F values, thedilation sizes of PDCNN are also less than or equal to DCNN.As a result, while conditioned on the auxiliary features with3/2 F , although the QPNet vocoder still outperformed theWNc vocoder, the speech qualities of the WNf and WORLDvocoders are higher than that of the QPNet vocoder.
345 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ M E AN O P I N I ON S C O R E WORLD WNf WNc QPNet
Fig. 8. Sound quality MOS evaluation of male speakers with 95% CI.
In addition, as shown by the results of the male speakerset in Fig. 8, the QPNet vocoder achieved naturalness com-parable to that of the WNf vocoder in all F cases, which issignificantly better than that of the WNc vocoder. Specifically,most of the 3/2 F values of the male speaker are still withinthe range of the normal female F , so the effective receptivefield lengths of the QPNet vocoder are apparently longer thanthe receptive field lengths of the WNc vocoder in all thesubjective evaluations of the male speaker set. On the otherhand, the WORLD vocoder shows almost the same tendencyin the evaluations of both female and male speaker sets. Thatis, it shows lower naturalness than the WNf vocoder in theunchanged F case and much lower speech quality than boththe WNf and QPNet vocoders in the 1/2 F case, whereas thenaturalness of the WORLD vocoder only slightly degrades inthe 3/2 F case.As shown in Figs. 9 and 10, the QPNet vocoder significantlyoutperformed the WNf vocoder in terms of pitch accuracy inall F cases and both the female and male sets except in theunchanged F cases of the female set, which may be causedby the naturalness degradation. The results confirm the pitchcontrollability improvement of the QPNet vocoder with thePDCNNs. In summary, the QPNet vocoder with the morecompact network size achieved comparable speech quality tothe WNf vocoder under most conditions except for the femaleset with 3/2 F because the higher F values might makethe PDCNNs degenerate to the DCNNs. The QPNet vocoderconditioned on the unseen F also gets the markedly higherpitch accuracy than the WNf vocoder. Moreover, the QPNetvocoder achieved higher or comparable speech quality than theWORLD vocoder under most conditions except conditioningon the unseen 3/2 female F . E. Discussion
As shown in Fig. 11, the length of the receptive field ofWNf is 3070 samples (The receptive field length of 10 blocksin each chunk is +2 + · · · +2 = 1023 , so the total length is × with an extra one from the causal layer.), that of WNcis 61 samples (Each chunk contains +2 +2 +2 = 15 , sothe total receptive field length is × . ) , and that ofQPNet is 100–1000 samples (The receptive field length of the OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 12
59 20 1641 80 840 Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ P R E F E R E N C E S C O R E ( % ) WNf QPNet
Fig. 9. Pitch accuracy ABX evaluation of female speakers with 95% CI.
32 29 3968 71 610
Unchanged 𝐹 ₀ 1/2 𝐹 ₀ 3/2 𝐹 ₀ P R E F E R E N C E S C O R E ( % ) WNf QPNet
Fig. 10. Pitch accuracy ABX evaluation of male speakers with 95% CI. fixed blocks and the causal layer is × , and thatof the adaptive blocks is × E t . The pitch-dependent dilatedfactor E t with a dense factor receptive field lengths of WNfand WNc are constant because of the fixed network structure,and the receptive field length of QPNet is time-variant andpitch-dependent because of the QP structure.Fig. 11 also shows the effective receptive field lengthdistributions of the female and male speakers of the SPOKEset. We find that the effective receptive field lengths of bothmale and female speakers of the SPOKE set are apparentlylonger than the receptive field length of WNc, which concurswith the evaluation results showing that QPNet significantlyoutperforms WNc. Furthermore, most of the effective receptivefield lengths of the female set are shorter than that of the maleset, and it is caused by the higher F values of the femalespeakers. The distribution results also imply that the effectivereceptive field length of QPNet is close to the receptivefield length of WNc when conditioned on the female 3/2 F because most PDCNNs degenerate to DCNNs. In conclusion,the performance of AR models is highly related to the lengthof the receptive field .However, the length of the receptive field may be morestrongly correlated to the quality of the generated speech,whereas a balanced proportion of the adaptive and fixedmodules may be an essential factor for the pitch accuracy.Specifically, although the full-size QPNet has the longest effective receptive field length and achieves the lowest MCD,the pitch accuracy of full-size QPNet is still lower than that of 5 H F H S W L Y H I L H O G O H Q J W K 3 U R E D E L O L W \ : 1 F : 1 I 4 3 1 H W 0 4 3 1 H W ) Fig. 11. Distributions of receptive field lengths of different vocoders. compact-size QPNet. The possible reason is that the full-sizeQPNet is dominated by the fixed blocks because the numberof the fixed blocks is much larger than the number of theadaptive blocks while the number of the fixed and adaptiveblocks of the QPNet is more balanced.Furthermore, as shown in Tables I and VI, the number ofthe trainable parameters of the compact-size QPNet model isaround half of that of the WNf model, so only about 75%of the training time and 40% of the generation time wererequired. However, because of the very long effective receptivefield , the memory usage of QPNet in the training stage was al-most the same as that of WNf. The huge memory requirementin the training process limits the possible ratio of the fixed toadaptive modules, which leads to an unbalanced proportionproblem. Therefore, increasing improving the efficiency ofmemory usage will be one of the main tasks of future QPNetresearch. VII. C
ONCLUSION
In this paper, we propose a WaveNet-like audio waveformgeneration model named QPNet, which models quasi-periodicand high-temporal-resolution audio signals on the basis of anNN-based AR model with a novel PDCNN component and acascaded AR structure. Specifically, the novel PDCNN com-ponent is a variant of a DCNN that dynamically changes thedilation size corresponding to the conditioned F for modelingthe long-term correlations of audio samples. On the basisof the sinusoid generation evaluation results, the PDCNNssignificantly improves the periodicity-modeling capability ofthe generation network using the introduced prior frequencyinformation. Furthermore, the QPNet model as a vocodermodels the short- and long-term correlations of speech sampleson the basis of the cascaded fixed and adaptive macroblocks,respectively. The speech generation evaluation results indicatethat the proposed QPNet vocoder attains a much higher pitchaccuracy and comparable speech quality to the WN vocoderespecially when conditioning on the unseen auxiliary F values. The network size and generation time requirementsof the QPNet vocoder are only half of those of the WN OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 13 vocoder. In conclusion, the proposed QPNet model with thenovel PDCNN component and compact cascaded networkarchitecture significantly improves the pitch controllability ofthe vanilla WN model, and it makes the QPNet vocoder morein line with the definition of a vocoder. In our future work, wewill explore the improvements in memory usage and optimizethe proportion between the adaptive and fixed blocks.R
EFERENCES[1] H. Dudley, “The vocoder,”
Bell Labs Record , vol. 18, no. 4, pp. 122–126,1939.[2] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech,”
Proc.IEEE , vol. 54, no. 5, pp. 720–734, 1966.[3] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on asinusoidal representation,”
IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 34, no. 4, pp. 744–754, 1986.[4] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuringspeech representations using a pitch-adaptive time–frequency smoothingand an instantaneous-frequency-based f0 extraction: Possible role of arepetitive structure in sounds,”
Speech Communication , vol. 27, no. 3-4,pp. 187–207, 1999.[5] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”
IEICETransactions on Information and Systems , vol. 99, no. 7, pp. 1877–1884,2016.[6] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:A generative model for raw audio,” in
Proc. SSW9 , Sept. 2016, p. 125.[7] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to-end neural audio generation model,” in
Proc. ICLR , Apr. 2017.[8] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-timespeaker-dependent neural vocoder,” in
Proc. ICASSP , Apr. 2018, pp.2251–2255.[9] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, andK. Kavukcuoglu, “Efficient neural audio synthesis,” in
Proc. ICML , July2018, pp. 2415–2424.[10] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthe-sis through linear prediction,” in
Proc. ICASSP , May 2019, pp. 5891–5895.[11] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo,F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, andD. Hassabis, “Parallel WaveNet: Fast high-fidelity speech synthesis,” in
Proc. ICML , July 2018, pp. 3915–3923.[12] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation inend-to-end text-to-speech,” in
Proc. ICLR , May 2019.[13] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-basedgenerative network for speech synthesis,” in
Proc. ICASSP , May 2019,pp. 3617–3621.[14] S. Kim, S.-G. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet : Agenerative flow for raw audio,” in
Proc. ICML , June 2019, pp. 3370–3378.[15] K. Oura, K. Nakamura, K. Hashimoto, Y. Nankaku, and K. Tokuda,“Deep neural network based real-time speech vocoder with periodic andaperiodic inputs,” in
Proc. SSW10 , Sept. 2019, pp. 13–18.[16] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-basedwaveform model for statistical parametric speech synthesis,” in
Proc.ICASSP , May 2019, pp. 5916–5920.[17] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveformmodels for statistical parametric speech synthesis,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 28, pp. 402–415,2020.[18] S. Dieleman, A. van den Oord, and K. Simonyan, “The challenge ofrealistic music generation: modelling raw audio at scale,” in
Advancesin Neural Information Processing Systems , 2018, pp. 7989–7999.[19] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654 , 2017. [20] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural tts synthesis byconditioning wavenet on mel spectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[21] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang,and T. C. Walters, “Wavenet based low rate speech coding,” in . IEEE, 2018, pp. 676–680.[22] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florˆencio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet.” in
Interspeech ,2017, pp. 2013–2017.[23] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in . IEEE, 2018, pp. 5069–5073.[24] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voiceconversion with wavenet-based waveform generation,” in
Interspeech ,2017, pp. 1138–1142.[25] P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda, “Nuvoice conversion system for the voice conversion challenge 2018,” in
Odyssey , 2018, pp. 219–226.[26] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda, “The nunon-parallel voice conversion system for the voice conversion challenge2018,” in
Odyssey , 2018, pp. 211–218.[27] Y.-C. Wu, K. Kobayashi, T. Hayashi, P. L. Tobing, and T. Toda, “Col-lapsed speech segment detection and suppression for wavenet vocoder,”in
Proc. Interspeech , Sept. 2018, pp. 1988–1992.[28] Y.-C. Wu, P. L. Tobing, K. Kobayashi, T. Hayashi, and T. Toda, “Non-parallel voice conversion system with wavenet vocoder and collapsedspeech suppression,”
IEEE Access , vol. 8, pp. 62 094–62 106, 2020.[29] F. Yu and K. Vladlen, “Multi-scale context aggregation by dilatedconvolutions,” in
Proc. ICLR , May 2016.[30] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent WaveNet vocoder,” in
Proc. Interspeech , Aug.2017, pp. 1118–1122.[31] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “Aninvestigation of multi-speaker training for WaveNet vocoder,” in
Proc.ASRU , Dec. 2017, pp. 712–718.[32] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “An investigation ofnoise shaping with perceptual weighting for wavenet-based speech gen-eration,” in . IEEE, 2018, pp. 5664–5668.[33] N. Adiga, V. Tsiaras, and Y. Stylianou, “On the use of wavenetas a statistical vocoder,” in . IEEE, 2018, pp.5674–5678.[34] S. Singhal and B. Atal, “Improving performance of multi-pulse LPCcoders at low bit rates,” in
ICASSP’84. IEEE International Conferenceon Acoustics, Speech, and Signal Processing , vol. 9. IEEE, 1984, pp.9–12.[35] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP):High-quality speech at very low bit rates,” in
Proc. ICASSP , vol. 10,Apr. 1985, pp. 937–940.[36] Y.-C. Wu, T. Hayashi, P. L. Tobing, K. Kobayashi, and T. Toda, “Quasi-periodic WaveNet vocoder: A pitch dependent dilated convolution modelfor parametric speech generation,” in
Proc. Interspeech , Sept. 2019, pp.196–200.[37] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda,“Statistical voice conversion with quasi-periodic wavenet vocoder,” in
Proc. SSW10 , Sept. 2019.[38] Y. Ai, H.-C. Wu, and Z.-H. Ling, “SampleRNN-based neural vocoderfor statistical parametric speech synthesis,” in
Proc. ICASSP , Apr. 2018,pp. 5659–5663.[39] Y. Cui, X. Wang, L. He, and F. K. Soong, “A new glottal neural vocoderfor speech synthesis.” in
Interspeech , 2018, pp. 2017–2021.[40] L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi,and P. Alku, “Speaker-independent raw waveform model for glottalexcitation,” in
Interspeech , 2018.[41] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independentcomponents estimation,” arXiv preprint arXiv:1410.8516 , 2014.[42] D. J. Rezende and S. Mohamed, “Variational inference with normalizingflows,” arXiv preprint arXiv:1505.05770 , 2015.[43] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, andM. Welling, “Improved variational inference with inverse autoregressiveflow,” in
Advances in neural information processing systems , 2016, pp.4743–4751.
OURNAL OF L A TEX CLASS FILES, VOL. 0, NO. 0, JULY 2020 14 [44] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible1x1 convolutions,” in
Advances in Neural Information Processing Sys-tems , 2018, pp. 10 215–10 224.[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Proc. NIPS , Dec. 2014, pp. 2672–2680.[46] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in
Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[47] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,” in
Proc. ICLR , May 2015.[48] J. Kominek and A. W. Black, “The CMU ARCTIC speech databasesfor speech synthesis research,” in
Tech. Rep. CMU-LTI- 03-177 , 2003.[49] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018:Promoting development of parallel and nonparallel methods,” in
Proc.Odyssey , June 2018, pp. 195–202.[50] Y.-C. Wu,
QPNet demo , Accessed: 2019. [Online]. Available:https://bigpon.github.io/QuasiPeriodicWaveNet demo/[51] K. Kobayashi and T. Toda, “sprocket: Open-source voice conversionsoftware.” in