HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
HH I F I S INGER : T
OWARDS H IGH -F IDELITY N EURAL S INGING V OICE S YNTHESIS
Jiawei Chen, Xu Tan ∗ , Jian Luan, Tao Qin, Tie-Yan Liu Microsoft STC Asia & Microsoft Research Asia { t-jiawch,xuta,jianluan,taoqin,tyliu } @microsoft.com A BSTRACT
High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz,compared with 16kHz or 24kHz in speaking voices) with large range of frequencyto convey expression and emotion. However, higher sampling rate causes thewider frequency band and longer waveform sequences and throws challenges forsinging modeling in both frequency and time domains in singing voice synthe-sis (SVS). Conventional SVS systems that adopt moderate sampling rate (e.g.,16kHz or 24kHz) cannot well address the above challenges. In this paper, wedevelop HiFiSinger, an SVS system towards high-fidelity singing voice using48kHz sampling rate. HiFiSinger consists of a FastSpeech based neural acousticmodel and a Parallel WaveGAN based neural vocoder to ensure fast training andinference and also high voice quality. To tackle the difficulty of singing modelingcaused by high sampling rate (wider frequency band and longer waveform), weintroduce multi-scale adversarial training in both the acoustic model and vocoderto improve singing modeling. Specifically, 1) To handle the larger range of fre-quencies caused by higher sampling rate (e.g., 48kHz vs. 24kHz), we propose anovel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splitsthe full 80-dimensional mel-frequency into multiple sub-bands (e.g. low, middleand high frequency bands) and models each sub-band with a separate discrimi-nator. 2) To model longer waveform sequences caused by higher sampling rate,we propose a multi-length GAN (ML-GAN) for waveform generation to modeldifferent lengths of waveform sequences with separate discriminators. 3) We alsointroduce several additional designs and findings in HiFiSinger that are crucial forhigh-fidelity voices, such as adding F0 (pitch) and V/UV (voiced/unvoiced flag) asacoustic features, choosing an appropriate window/hop size for mel-spectrogram,and increasing the receptive field in vocoder for long vowel modeling in singingvoices. Experiment results show that HiFiSinger synthesizes high-fidelity singingvoices with much higher quality: 0.32/0.44 MOS gain over 48kHz/24kHz baselineand 0.83 MOS gain over previous SVS systems. Audio samples are available at https://speechresearch.github.io/hifisinger/ . NTRODUCTION
Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices basedon musical score information, and attracts a lot of attention in both industry and academia (especiallyin the machine learning and speech signal processing community) (Umbert et al., 2015; Nishimuraet al., 2016; Blaauw & Bonada, 2017; Nakamura et al., 2019; Hono et al., 2019; Chandna et al.,2019; Lee et al., 2019; Lu et al., 2020; Blaauw & Bonada, 2020; Gu et al., 2020; Ren et al., 2020b).Singing voice synthesis shares similar pipeline with text to speech synthesis, and has achieved rapidprogress (Blaauw & Bonada, 2017; Nakamura et al., 2019; Lee et al., 2019; Blaauw & Bonada, 2020;Gu et al., 2020) with the techniques developed in text to speech synthesis (Shen et al., 2018; Renet al., 2019; 2020a; Yamamoto et al., 2020).Most previous works on SVS (Lee et al., 2019; Gu et al., 2020) adopt the same sampling rate (e.g.,16kHz or 24kHz) as used in text to speech, where the frequency bands or sampling data points are ∗ Corresponding author a r X i v : . [ ee ss . A S ] S e p ot enough to convey expression and emotion as in high-fidelity singing voices. However, simplyincreasing the sampling rate will cause several challenges in singing modeling. First, the audio withhigher sampling rate contains wider and higher frequency bands , which throws challenges whenpredicting these frequency spectrums in acoustic model. Second, the audio with higher sampling ratecontains longer waveform points and much fine-grained fluctuations in a fixed period of time , whichalso increases the difficulty of vocoder modeling in time domain. As a consequence, even if someprevious works (Hono et al., 2019; Chandna et al., 2019; Wu et al., 2019; Nakamura et al., 2020; Luet al., 2020) adopt higher sampling rate (e.g. 44.1kHz or 48kHz), they either leverage coarse-grainedMFCC (Zheng et al., 2001) as acoustic features in slow autoregressive neural vocoder (Oord et al.,2016), or use non-neural vocoder such as Griffin-Lim (Griffin & Lim, 1984) and WORLD (Moriseet al., 2016) to generate waveform, which do not fully exploit the potential of high sampling rate andthus cannot yield good voice quality.In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voices. HiFiSingeradopts FastSpeech (Ren et al., 2019) as the acoustic model and Parallel WaveGAN (Yamamoto et al.,2020) as the vocoder since they are popular in speech synthesis (Hayashi et al., 2020; Ren et al.,2020a; Blaauw & Bonada, 2020; Lu et al., 2020) to ensure fast training and inference speed and alsohigh quality. To address the challenges of high sampling rate in singing modeling (wider frequencyband and longer waveform), we design multi-scale adversarial training on both acoustic model andvocoder, and introduce several additional systematic designs and findings that are crucial to improvesinging modeling: • To handle larger range of frequencies caused by higher sampling rate (e.g., 0 ∼ ∼ • To model longer waveform caused by higher sampling rate, we propose a multi-length GAN (ML-GAN) on waveform generation, which randomly crops different lengths of waveform sequenceand model them with separate discriminators. As a result, singing voices can be modeled indifferent length granularities to avoid the issues (e.g., glitches and vibrations) occurred in a singlediscriminator with a fixed length of waveform sequence. • We further introduce several designs and findings in HiFiSinger that are important to achievehigh-fidelity synthesis: 1) Besides mel-spectrogram, we add pitch (fundamental frequency, F0) andV/UV (voiced/unvoiced flag) as acoustic features to better model singing voices; 2) We carefullystudy the window and hop size in acoustic features and choose an appropriate value to better alignwith the range of pitches in singing voices and also trade off the modeling difficulty betweenacoustic model and vocoder; 3) We increase the receptive field in vocoder to cover long vowel insinging voices.We conduct experiments on our internal singing voice synthesis datasets that contain 11 hours high-fidelity singing recordings with 48kHz sampling rate. Experiment results demonstrate the advantagesof our developed HiFiSinger over previous singing voice synthesis system. Further ablation studiesverify the effectiveness of each design in HiFiSinger to generate high-fidelity voices.
ACKGROUND
In this section, we briey introduce the background of this work, including the comparison betweensinging voice synthesis (SVS) and text to speech (TTS), the challenges of high fidelity singing voicesynthesis.
SVS vs. TTS
Text to speech (TTS) aims to synthesize speech voice from a given text, which hasevolved quickly from early concatenative synthesis (Hunt & Black, 1996), statistical parametric According to Nyquist-Shannon sampling theorem (Millette, 2013), a sampling rate f s can cover thefrequency band up to f s / . Therefore, the frequency band for the audio with 48kHz sampling rate spans from0 ∼ ∼ ∼ For example, a 1 second audio waveform contains 48,000 sampling points when sampling rate is 48kHz.
High-Fidelity SVS
Singing voices usually leverage high sampling rate to convey high-fidelityexpression. For example, popular music websites such as Spotify, Apple Music, SoundCloud, QQMusic and NetEase Music all use high sampling rate (44.1kHz or higher). However, high samplingrate increases the difficulty of singing modeling: 1) high sampling rate causes wider spectrum band infrequency domain, where different frequency bands with distinctive characteristics make it hard foracoustic model; 2) high sampling rate causes longer waveform in a fixed period of time, where moresampling points and finer-grained fluctuations make it difficult for vocoder. Most previous works onSVS usually adopt 16kHz or 24kHz sampling rate as used in TTS. There indeed exist some worksusing 44.1kHz or 48kHz sampling rate (Hono et al., 2019; Chandna et al., 2019; Wu et al., 2019;Nakamura et al., 2020; Lu et al., 2020). However, they either leverage coarse-grained MFCC (Zhenget al., 2001) as acoustic features in slow autoregressive neural vocoder (Oord et al., 2016), or usenon-neural vocoder such as Griffin-Lim (Griffin & Lim, 1984) and WORLD (Morise et al., 2016) togenerate waveform, which cannot fully exploit the potential of high sampling rate and thus cannotyield good voice quality.
ETHOD
In this section, we first introduce the overall architecture of HiFiSinger, and then describe the specificdesigns to address the distinctive challenges caused by high sampling rate in singing modeling,including sub-frequency GAN (SF-GAN) for wider frequency band, multi-length GAN (ML-GAN)for longer waveform, and several systematic designs and findings that are important for high qualitysinging voices.3.1 S
YSTEM O VERVIEW
A typical SVS system consists of an acoustic model to convert the music score into acoustic features,and a vocoder to generate audio waveform from acoustic features. As illustrated in Figure 1(a), toensure high-quality synthesized voice and fast training and inference speed, HiFiSinger consists ofan acoustic model based on FastSpeech (Ren et al., 2019; 2020a) and a vocoder based on ParallelWaveGAN (Yamamoto et al., 2020), both of which are non-autoregressive generation models. Weintroduce the details of the data input and model structure as follows.
Music Score Input
In order to generate high quality singing voice with good pronunciation, tone,rhythm and timbre, we use music score that contains lyrics, note pitch and note duration as theinput of acoustic model. Specifically, we process the music score as follows: 1) We convert thecharacter (e.g., Chinese) or syllable (e.g., English) in lyrics into phoneme using grapheme-to-phonemeconversion (Taylor, 2005; Sun et al., 2019). 2) We convert each note into pitch ID according to theMIDI standard . 3) We quantize the note duration according to the music tempo and then convertit to represent the number of frames of mel-spectrograms . We repeat the pitch and duration ID tomatch the length of phoneme in each character or syllable. Thus, the musical score input can be . For example, the pitch ID corresponds to note C4 is 60, about 262Hz. For example, given a tempo 120, there are 120 beats in one minute and one beat in 0.5 second. For a timesignature 4/4, a quarter note has a duration of 0.5 second. If the hop size of mel-spectrogram is 5ms, then aquarter note corresponds to 100 frames. x ∈ R N × , where N is the number of phonemes, and 3 represents the threeIDs for phoneme, pitch and duration, which are embedded in dense vectors, added together as theinput of acoustic model. Acoustic Model and Vocoder
The acoustic model is built on FastSpeech, which uses a feed-forward Transformer (FFT) block (Vaswani et al., 2017) as the basic structure of the encoder anddecoder. Since the singing voices do not exactly follow the duration in the music score, we need toexplicitly predict the duration for natural and expressive singing voice. We use a duration predictor topredict how many frames of mel-spectrograms that each phoneme corresponds to in the singing voice,and expand the phoneme hidden sequence to match the length of mel-spectrograms. The vocoderconsists of a parallel generator as used in Parallel WaveGAN.3.2 M
ODELING W IDE F REQUENCY WITH
SF-GANIn order to generate highly expressive and high-fidelity singing voice, larger sampling rate is neededto cover more high-frequency details, which has wider frequency bands in mel-spectrograms. As aconsequence, it increases the difficulty of mel-spectrogram modeling since wide frequency bands arefull of diverse and complicated patterns, especially in the additional high frequency band 12 ∼ ∼
120 is used to cover additional high-frequency information. However, we have tried inexperiments and found no obvious improvements in voice quality. Actually, the key is not to increasethe bins of mel-spectrogram , but how to better model the diverse frequency details in a wide rangeof frequency band. Simply increasing mel bins will not bring much information unless increasing the STFT (short-time Fouriertransformation) filter size at the same time. However, since there is a trade-off between the resolutions offrequency and time (Landau & Pollak, 1961), increasing the frequency bins equals to increase the frequencyresolution, which requires the sacrifice of time resolution (related to window size). But according to our carefulexperiment studies, the optimal window size is 20ms. Using other window sizes will cause worse voice quality. min G am E x [ (cid:88) f ∈{ low , mid , high } (1 − D f ( G am ( x )) )] , (1) min D f E y [(1 − D f ( y )) ] + E x [ D f ( G am ( x )] , ∀ f ∈ { low , mid , high } , (2)where the GAN loss follows LS-GAN (Mao et al., 2017) considering it is popular in speech, x and y represent music score input and mel-spectrogram output respectively, G am represents theacoustic model and D f represents the discriminator for frequency band f . For example, for an80-dimensional mel-spectrogram, we split it into low, medium and high frequency band, where thelowest 40-dimension (0 to 40) as low-frequency, the middle 40-dimension (20 to 60) as mid-frequency,and the highest 40-dimension (40 to 80) as high-frequency, and each frequency band has overlap withadjacent bands.3.3 M ODELING L ONG W AVEFORM WITH
ML-GANFor a high sampling rate audio, it not only means that a wider frequency band in frequency domain, butalso a longer waveform in time domain, which means more fine-grained and complicated fluctuationsin fixed range of time. Previous vocoders (Yamamoto et al., 2020) usually adopt a single discriminatorto distinguish the entire audio clip, which cannot well handle the fluctuation patterns in different timeranges in the long waveform sequence. Therefore, we design a multi-length GAN (ML-GAN) inHiFiSinger, as shown in Figure 1(c), which uses multiple discriminators to distinguish the samplingpoints in different lengths. The formulation of ML-GAN is shown in Equation 3 and 4: min G voc E y [ (cid:88) t ∈ (0 ,len ( w )) (1 − D t ( G voc ( y )) )] , (3) min D t E w [(1 − D t ( w )) ] + E y [ D t ( G voc ( y )] , ∀ t ∈ (0 , len ( w )) , (4)where the GAN loss follows LS-GAN (Mao et al., 2017), y and w represent acoustic feature input(including mel-spectrogram, F0 and V/UV) and waveform output respectively, G voc represents thevocoder and D t represents the discriminator for different time length t . The benefits of ML-GAN aretwofold: 1) it reduces the difficulty of longer waveform modeling (caused by high sampling rate) bymodeling shorter waveform sequence; 2) it can better capture the dynamic phoneme duration (toolong or too short) in singing voices via modeling different lengths of waveform sequences.3.4 O THER S YSTEMATIC D ESIGNS
Compared with speaking voices, singing voices have a larger range of pitches and phoneme durations,which also throws challenges in singing modeling. Therefore, we further introduce some systematicdesigns and findings in HiFiSinger that are crucial to improve the voice quality, including using pitchand U/UV as additional acoustic features, carefully studying window size and hop size to trade offbetween acoustic model and vocoder, and increasing the receptive field in vocoder to better modellong vowels in singing voices. We describe them as follows: • Pitch and V/UV. Singing voices heavily rely on pitch for voice quality. Therefore, besides mel-spectrograms, our acoustic model also predicts pitch where we use the original note pitch inmusic score as shortcut input to let the model focus on learning the residual pitch value, as shwonin Figure 1(a). Besides, we also make a voiced/unvoiced (V/UV) flag to help correct the pitchvalues and avoid electronic noise as shown in the experiment section. The vocoder takes themel-spectrogram, pitch and V/UV as input to generate waveform with better quality.5
Window/Hop size. There are two considerations in the choices of window and hop size: 1) Thewindow size of mel-spectrogram during short-time Fourier transformation needs careful studysince larger pitch prefers smaller window size while smaller pitch prefers larger window size . Thepitch in singing voices is usually higher (sometimes maybe lower) than speaking voices, and thusthe window size needs to be smaller than that in speaking voices. 2) A smaller hop size will causethe acoustic features more fine-grained and longer in sequence length, which is more difficult foracoustic model to predict but beneficial to vocoder due to more fine-grained input. On the otherhand, a larger hop size will ease the acoustic model training but will increase the difficulty ofvocoder training. After careful study, we set window size as 20ms and hop size as 5ms (under arelationship of 4:1 following the common practice (Shen et al., 2018; Ren et al., 2019)). • Large receptive field. Furthermore, unlike speaking voices, the duration in the music note andcorresponding lyric may vary a lot, causing a larger range of phoneme duration (usually longer onvowels). To better model the large range of duration, we use a larger kernel size in the vocoder toenlarge the receptive field to cover such long vowels.
XPERIMENTS AND R ESULTS
In this section, we first describe the experimental setup, and then report the experiment results,including audio quality, ablation study and analysis of our proposed system.4.1 E
XPERIMENTAL S ETUP
Datasets
Our singing datasets contains Chinese Mandarin pop songs collected from a female singer,who sings with the accompaniment in a professional recording studio. All the singing recordings aresampled at 48kHz, quantized with 16 bits and split into pieces between 3 and 10 seconds. The finaldatasets contain 6817 pieces, about 11 hours of data. We randomly choose 340 pieces for validationand 340 for test. When extracting mel-spectrogram features, the window size and hop size are set to20ms and 5ms and the number of mel bins are set to 80. We extract the F0 and V/UV label from thesinging audio and get the phoneme duration label (used in the duration predictor) with HMM-basedforced alignment (Sj¨olander, 2003). Both the mel-spectrogram and F0 features are normalized tohave zero mean and unit variance before training, respectively. Model Conguration
The backbone of the acoustic model is based on FastSpeech, where both theencoder and decoder consist of 6 FFT blocks. In each block, the hidden size of self-attention is set to384 and the kernel width/input size/output size in the two-layer 1D-convolution are set to 3/384/1536and 1/1536/384 respectively. On top of the last FFT block, a linear layer is used to generate the80-dimensional mel-spectrogram, a one-dimensional F0 (float value) and a one-dimensional V/UV(0-1 value) as the acoustic features. The basic structure of the vocoder is based on WaveNet, where10 non-causal dilated 1D-convolution layers with dilations of 1, 2, 4, ..., 512 are stacked 3 times. Thechannel size for dilations, residual blocks, and skip-connections are 64, 128, and 64, respectively.Specially, the kernel size of each 1D-convolution layer is set to 13 to model high sampling rate audio,as described in Section 3.4.We then describe the discriminator in SF-GAN (acoustic model) and ML-GAN (vocoder) respec-tively. SF-GAN consists of three discriminators for low (0 ∼ ∼
60) and high (40 ∼ Usually, the window size should cover 2 ∼ ∼
80 ms. As we can see, the window size in speaking voices isusually set to 50ms (Shen et al., 2018; Ren et al., 2019), which falls into this range. We extract F0 using Parselmouth from https://github.com/YannickJadoul/Parselmouth ,and set a voiced label if F0 <
3, otherwise unvoiced.
Training and Inference
We train the acoustic model and vocoder separately. The acoustic modelis trained for 60k steps with minibatch size of 32 using Adam optimizer ( β = 0 . , β = 0 . , (cid:15) =10 − ) and the same learning rate schedule in Ren et al. (2019). The vocoder is trained for 400k stepswith minibatch size of 4 using RAdam (Liu et al., 2019) optimizer. The initial learning rate is setto 0.0001, and was reduced by half for every 200k steps. Note that the discriminators are turnedon starting from 10k steps in SF-GAN and 100k steps in ML-GAN to warm up the generators inacoustic model and vocoder. During training, we use the ground-truth label of the phoneme durationin acoustic model and the ground-truth mel-spectrogram, F0 and V/UV as in the input of vocoder,while during inference we use the corresponding predicted values.4.2 A UDIO Q UALITY
Table 1: The MOS with 95% condence inter-vals. 48kHz sampling rate is used unless otherwisestated.
Method MOS
Recording . ± . Recording (24kHz) . ± . XiaoiceSing (Lu et al., 2020) . ± . Baseline (24kHz) . ± . Baseline (24kHz upsample) . ± . Baseline . ± . HiFiSinger (24kHz) . ± . HiFiSinger . ± . To verify the effectiveness of the proposed Hi-Fisinger system, we conduct the MOS (meanopinion score) evaluation on the test set (werandomly choose 100 pieces from the test setfor evaluation) to measure the quality of thesynthesized singing voices. Each audio is lis-tened by at least 20 judgers. We mainly compareHiFiSinger with the following settings and sys-tems: 1) Recording, the original singing record-ings; 2) Recording (24kHz), the original singingrecordings downsampled to 24kHz; 3) Xiaoic-eSing (Lu et al., 2020), a previous SVS systemthat also adopts 48kHz sampling rate but lever-ages WORLD vocoder; 4) Baseline (24kHz), abaseline SVS system that uses the basic modelbackbone of HiFiSinger (FastSpeech based acoustic model and Parallel WaveGAN based vocoder)but without any of our improvements in HiFiSinger (SF-GAN, ML-GAN and other systematicimprovements as described in Section 3), and only uses 24kHz sampling rate; 5) Baseline (24kHzupsample), waveform generated by Baseline (24kHz) is upsampled to 48kHz; 6) Baseline (48kHz),the same baseline system as in 4) but uses 48kHz sampling rate; 7) HiFiSinger (24kHz), our proposedHiFiSinger system but uses 24kHz sampling rate; 8) HiFiSinger, our final HiFiSinger system with48kHz sampling rate .Experiments results are shown in Table 1. We have several observations: 1) HiFiSinger outperformsXiaoiceSing and Baseline by 0.83 MOS and 0.32 MOS respectively at the sampling rate of 48kHz,which demonstrates the effectiveness of HiFiSinger for singing voices with high sampling rate. 2)When increasing the audio sampling rate from 24kHz to 48kHz, Baseline has only 0.12 MOS gain(3.44 vs. 3.32) while HiFiSinger has 0.29 MOS gain (3.76 vs. 3.47), which also demonstrates thepotential of HiFiSinger for high sampling rate. 3) HiFiSinger with 48kHz sampling rate even achieveshigher MOS score than the 24kHz recordings, and only has 0.27 MOS gap to the 48kHz recordings,which verifies the high-fidelity voices synthesized by HiFiSinger.4.3 A BLATION S TUDIES
We conduct ablation studies to verify the effectiveness of several components in HiFiSinger, including1) sub-frequency GAN (SF-GAN), 2) multi-length GAN (ML-GAN), 3) pitch and V/UV, 4) windowand hop size, 5) large receptive field. We mainly conduct CMOS evaluation to compare two differentsettings, where each of the randomly chosen 100 evaluation pieces in the test set are listened by 20judgers. The audio samples are available at https://speechresearch.github.io/hifisinger/ n SF-GAN represents there are n disciminatorshandling different frequency bands in SF-GAN. System CMOS
HiFiSinger (default 3 SF-GAN) 0HiFiSinger with 0 SF-GAN -0.22HiFiSinger with 1 SF-GAN -0.28HiFiSinger with 5 SF-GAN -0.06
SF-GAN
We explore the performance whenvarying the number of discriminators in SF-GAN (described in Section 3.2). We make thetotal number of parameters of the discriminatorsin different settings comparable (e.g., the totalparameters of 3 discriminator is same as that of1 discriminator) . From Table 2, it can be seenthat HiFiSinger with 3 SF-GAN (default) outper-forms other settings with 1) 0 SF-GAN (withoutany discriminator), which shows the advantagesof adversarial training; 2) 1 SF-GAN, which shows that a single discriminator cannot handle thecomplicated and diverse patterns in low, middle and high frequency band; 3) 5 SF-GAN (the melbins of the 5 sub-frequency band is ∼ , ∼ , ∼ , ∼ , ∼ ), which shows thatusing more discriminators slightly hurt the quality. Therefore, we choose 3 discriminators as thedefault setting. As discussed in Section 3.2, another possible idea is to increase the number of melbins to cover more high-frequency bands. Therefore, we conduct experiments to evaluate the voicequality when increasing the number of mel bins from 80 to 120 (both using a single discriminator),and find there is only 0.02 CMOS gain, which demonstrates that simply increasing the number ofmel bins cannot well model the diverse frequency details over a wider band.Figure 2: The mel-spectrogram comparisons for SF-GAN.Figure 2 shows the generated mel-spectrograms of HiFiSinger, HiFiSinger without discriminator,and the ground truth. It can be seen that HiFiSinger without discriminator generates over-smoothingmel-spectrogram, and after adding SF-GAN, the me-spectrogram have more high frequency detailsand are closer to the ground truth. Table 3: The CMOS results for ML-GAN andsingle length GAN. System CMOS
HiFiSinger with ML-GAN 0HiFiSinger with 0.25s length -0.21HiFiSinger with 0.50s length -0.38HiFiSinger with 0.75s length -0.15HiFiSinger with 1.00s length -0.12
ML-GAN
We further study the effectivenessof ML-GAN in modeling long waveform causedby high sampling rate. As shown in Table 3, itcan be seen that only using a single discriminatoron a certain length of waveform sequence (0.25s,0.50s, 0.75s or 1.00s length, i.e., w/o ML-GAN)performs worse than HiFiSinger with ML-GAN(multiple discriminators on 0.25/0.5/0.75/1s) .Figure 3 shows a sample case. It can be seenthat there is a glitch in the long vowel generatedby HiFiSinger w/o ML-GAN (use a single length of 1s discriminator), while HiFiSinger can generatestable long vowel similar to the ground truth, thanks to the finer granularity modeling of long vowelby multi-length discriminators. We conduct experiments to make the parameters in 1 SF-GAN to be 1/3 times of that in 3 SF-GAN, whichcauses even worse voice quality. Therefore, to be fair, we keep the total parameters of each setting the same. According to our case analyses, a single discriminator with a small length results in expressive voice butelectronic noise, while a single discriminator with a big length results in less expressive voice but few electronicnoise, and all single length usually has glitches and vibrations in long vowel. However, ML-GAN can combinethe advantages of discriminators with different lengths of waveform and avoid these issues.
Other System Designs
Next, we study the effectiveness of other system designs to improve thehigh fidelity singing quality, including adding pitch and V/UV as the vocoder input, window and hopsize choice and larger receptive field.Figure 4: The mel-spectrograms comparisons of HiFiSinger w/ and w/o pitch and V/UV, where“silence” and “unvoiced” represent the silence frames and unvoiced frames.
Pitch and V/UV
F0 and V/UV can help the vocoder model the pitch and differentiate the speechwith voiced and unvoiced frames. We conduct the CMOS evaluation on pitch and V/UV, as shownin Table 4. Removing pitch and V/UV from the vocoder input results in a 0.34 CMOS drop andonly removing V/UV causes a 0.28 CMOS drop, which demonstrates the effectiveness of pitch andV/UV. As shown in Figure 4, removing F0 and V/UV make the unvoiced part (including silence andunvoiced frames) less informative and over-smoothing, which causes electronic noise according toour experimental observations. Besides, pitch can make the vocoder more controllable and morerobust to larger pitch range. We show in the demo page that we can change the pitch (increase ordecrease several semitones, e.g., increasing 4 semitones on note C to get note E) and can still obtainhigh-quality singing voices.Table 4: CMOS for pitch and V/UV.
System CMOS
HiFiSinger 0HiFiSinger without V/U input -0.28HiFiSinger without F0 and V/U input -0.34 Table 5: CMOS under different window/hop sizes.
System CMOS
HiFiSinger (default, 20ms/5ms) 0HiFiSinger with 12ms/3ms -0.36HiFiSinger with 50ms/12.5ms -0.12
Window/Hop Size
As analyzed in Section 3.4, the window/hop size need be carefully chosen toconsider the characteristics of singing voices as well as the trade-off of the model difficulty betweenacoustic model and vocoder. We study different window/hop size (we always set the ratio between9indow size and hop size to 4:1 following the common practice) in Table 5. It can be seen that largeror smaller window/hop size will cause quality drop, which demonstrates the effectiveness of ourchoice on window and hop size. Table 6: CMOS under different receptive fields.
System
CMOSHiFiSinger (default, 13 kernel size) 0HiFiSinger with 5 kernel size -0.39HiFiSinger with 9 kernel size -0.25
Receptive Field
Unlike speaking voice, the du-ration in the music note and corresponding lyricmay vary a lot, causing a large range of phonemeduration (usually longer and sometimes shorter),mainly on vowels. To better model the largerange of duration, we use a larger kernel sizein the vocoder to enlarge the receptive field tocover such long vowels. We conduct the CMOSevaluation on different sizes of convolution ker-nel. As shown in Table 6, a kernel size with larger receptive field can lead to improvement in audioquality.
ONCLUSION
In this paper, we have developed HiFiSinger, an SVS system to synthesize high-fidelity singing voice.To address the challenges caused by high sampling rate, we designed a SF-GAN on acoustic modelto better model the wider frequency band, a ML-GAN on vocoder to better model longer waveformsequences, and introduced several systematic designs and findings that are important to improvesinging modeling. Experiment results show that HFiSinger synthesizes singing voices with muchhigher quality than previous systems. For future work, we will continue to close the quality gapbetween the synthesized voices and recordings, and also apply our fidelity solution in HiFiSinger totext to speech synthesis. R EFERENCES
Sercan ¨Omer Arik, Mike Chrzanowski, Adam Coates, Gregory Frederick Diamos, Andrew Gibiansky,Yongguo Kang, Xian Li, John Miller, Andrew Y Ng, Jonathan Raiman, et al. Deep voice: Real-timeneural text-to-speech. In
ICML , 2017.Mikołaj Bi´nkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande,Luis C Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. In
International Conference on Learning Representations , 2019.Merlijn Blaauw and Jordi Bonada. A neural parametric singing synthesizer modeling timbre andexpression from natural songs.
Applied Sciences , 7(12):1313, 2017.Merlijn Blaauw and Jordi Bonada. Sequence-to-sequence singing synthesis using the feed-forwardtransformer. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , pp. 7229–7233. IEEE, 2020.Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia G´omez. Wgansing: A multi-voicesinging voice synthesizer based on the wasserstein-gan. In , pp. 1–5. IEEE, 2019.Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform.
IEEETransactions on Acoustics, Speech, and Signal Processing , 32(2):236–243, 1984.Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang,and Zejun Ma. Bytesing: A chinese singing voice synthesis system using duration allocatedencoder-decoder acoustic models and wavernn vocoders. arXiv preprint arXiv:2004.11012 , 2020.Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, TomokiToda, Kazuya Takeda, Yu Zhang, and Xu Tan. Espnet-tts: Unified, reproducible, and integrat-able open source end-to-end text-to-speech toolkit. In
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7654–7658. IEEE, 2020.10ukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Singingvoice synthesis based on generative adversarial networks. In
ICASSP 2019 , pp. 6955–6959. IEEE,2019.Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using alarge speech database. In , volume 1, pp. 373–376. IEEE, 1996.Lauri Juvela, Bajibabu Bollepalli, Manu Airaksinen, and Paavo Alku. High-pitched excitationgeneration for glottal vocoding in statistical parametric speech synthesis using a deep neuralnetwork. In , pp. 5120–5124. IEEE, 2016.Hideki Kawahara. Straight, exploitation of the other aspect of vocoder: Perceptually isomorphicdecomposition of speech sounds.
Acoustical science and technology , 27(6):349–353, 2006.Henry J Landau and Henry O Pollak. Prolate spheroidal wave functions, fourier analysis anduncertaintyii.
Bell System Technical Journal , 40(1):65–84, 1961.Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. Adversariallytrained end-to-end korean singing voice synthesis system. arXiv preprint arXiv:1908.01919 , 2019.Hao Li, Yongguo Kang, and Zhenyu Wang. Emphasis: An emotional phoneme-based acoustic modelfor speech synthesis system.
Proc. Interspeech 2018 , pp. 3077–3081, 2018.Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and JiaweiHan. On the variance of the adaptive learning rate and beyond. In
International Conference onLearning Representations , 2019.Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. Xiaoicesing: A high-quality and integratedsinging voice synthesis system. arXiv preprint arXiv:2006.06261 , 2020.Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Leastsquares generative adversarial networks. In
Proceedings of the IEEE international conference oncomputer vision , pp. 2794–2802, 2017.Pierre A Millette. The heisenberg uncertainty principle and the nyquist-shannon sampling theorem.
Progress in Physics , 9(3):9–14, 2013.Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: a vocoder-based high-quality speechsynthesis system for real-time applications.
IEICE TRANSACTIONS on Information and Systems ,99(7):1877–1884, 2016.Kazuhiro Nakamura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda.Singing voice synthesis based on convolutional neural networks. arXiv preprint arXiv:1904.06868 ,2019.Kazuhiro Nakamura, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and KeiichiTokuda. Fast and high-quality singing voice synthesis system based on convolutional neuralnetworks. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 7239–7243. IEEE, 2020.Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda.Singing voice synthesis based on deep neural networks. In
Interspeech , pp. 2478–2482, 2016.Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for rawaudio. arXiv preprint arXiv:1609.03499 , 2016.Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast,robust and controllable text to speech. In
Advances in Neural Information Processing Systems , pp.3171–3180, 2019. 11i Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fastand high-quality end-to-end text to speech. arXiv , pp. arXiv–2006, 2020a.Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. Deepsinger: Singing voicesynthesis with data mined from the web. arXiv preprint arXiv:2007.04590 , 2020b.Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioningwavenet on mel spectrogram predictions. In , pp. 4779–4783. IEEE, 2018.K˚are Sj¨olander. An hmm-based system for automatic segmentation and alignment of speech. In
Proceedings of Fonetik , volume 2003, pp. 93–96, 2003.Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. Token-levelensemble distillation for grapheme-to-phoneme conversion. In
INTERSPEECH , 2019.Paul Taylor. Hidden markov models for grapheme to phoneme conversion. In
Ninth EuropeanConference on Speech Communication and Technology , 2005.Marti Umbert, Jordi Bonada, Masataka Goto, Tomoyasu Nakano, and Johan Sundberg. Expressioncontrol in singing voice synthesis: Features, approaches, evaluation, and challenges.
IEEE SignalProcessing Magazine , 32(6):55–73, 2015.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In
Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, and Dong Yu.Synthesising expressiveness in peking opera via duration informed attention network. arXivpreprint arXiv:1912.12010 , 2019.Zhizheng Wu, Oliver Watts, and Simon King. Merlin: An open source neural network speechsynthesis system.Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generationmodel based on generative adversarial networks with multi-resolution spectrogram. In
ICASSP2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ,pp. 6199–6203. IEEE, 2020.Fang Zheng, Guoliang Zhang, and Zhanjiang Song. Comparison of different implementations ofmfcc.