A Deep Learning Based Analysis-Synthesis Framework For Unison Singing
AA DEEP LEARNING BASED ANALYSIS-SYNTHESIS FRAMEWORK FORUNISON SINGING
Pritish Chandna Helena Cuesta Emilia Gómez , Music Technology Group, Universitat Pompeu Fabra, Barcelona European Commission, Joint Research Centre, Seville {pritish.chandna, helena.cuesta, emilia.gomez}@upf.edu
ABSTRACT
Unison singing is the name given to an ensemble of singerssimultaneously singing the same melody and lyrics. Whileeach individual singer in a unison sings the same princi-ple melody, there are slight timing and pitch deviationsbetween the singers, which, along with the ensemble oftimbres, give the listener a perceived sense of "unison".In this paper, we present a study of unison singing in thecontext of choirs; utilising some recently proposed deep-learning based methodologies, we analyse the fundamen-tal frequency (F ) distribution of the individual singers inrecordings of unison mixtures. Based on the analysis, wepropose a system for synthesising a unison signal from an a cappella input and a single voice prototype representa-tive of a unison mixture. We use subjective listening teststo evaluate perceptual factors of our proposed system forsynthesis, including quality, adherence to the melody aswell the degree of perceived unison.
1. INTRODUCTION
Throughout history, singing has been an important cul-tural activity for humans, serving for propagation of be-liefs and ideas amongst the masses as well as for socialentertainment. The social aspect led to gatherings of peo-ple singing in a group, which evolved into polyphonicensemble singing with multiple voices singing counter-point melodies in complex harmonies. A group of peo-ple singing in such an ensemble is commonly termed asa choir and the focus of our study is on one setting ofsuch choirs consisting of four voices known as Soprano,Alto, Tenor and Bass (SATB). Each voice within an SATBensemble has its own function and melodic range in thewhole. SATB is one of the most widely studied, docu-mented, and practiced forms of choirs with numerous ded-icated conservatories across Europe, highlighting the cul-tural importance of the art form. Within each of the SATBvoices, it is common to have multiple singers of similar c (cid:13) P. Chandna, H. Cuesta and E. Gómez. Licensed undera Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution:
P. Chandna, H. Cuesta and E. Gómez, “A Deep LearningBased Analysis-Synthesis Framework For Unison Singing”, in
Proc. ofthe 21st Int. Society for Music Information Retrieval Conf.,
Montréal,Canada, 2020. vocal range singing the same melody simultaneously, in aform known as unison singing. While all the singers in aunison sing the same melody, it is impossible for a groupof two or more people to perfectly synchronize and singthe exact same pitch line. Each singer has their own nat-ural micro-deviations, both in terms of timing and pitch,from the prescribed score and their own distinct timbre.The combination of micro-deviations and the ensemble oftimbres leads to the perception of unison, wherein sev-eral singers are perceived to be singing a single pitch con-tour [1], and is the main focus of our study.Pitch and fundamental frequency (F ) are related butnot equivalent terms. While the F generally refers to thephysical frequency of vibrations of the vocal folds for asinging voice signal, pitch refers to an abstract perceptualconcept which has been found to be closely correlated tothe F . Frequency is usually measured in Hertz, represent-ing the number of cycles of a periodic signal per second,whereas pitch is described in terms of perceptual units like cents . The cent is a unit defined on a logarithmic scale, as ameasure of the ratio between the frequency in Hertz and abase frequency, commonly chosen to be
440 Hz , as shownin Equation 1. f cents = 1200 · log f hertz (1)Thus defined, the cent is correlated to the perceptually rele-vant musical unit of an equally tempered semitone. Specif-ically, one semitone spans
100 cents . Examined individ-ually, the pitch of the singers in a unison can be repre-sented by the F of each individual singer’s vocal signal,this can be tested by synthesising a time-varying sinusoidwith the frequency of the signal. However, when the in-dividual signals are added the resultant pitch is not merelythe sum of F value, and the methodology of synthesising asum of sinusoids as with single singers fails to produce thesame perceptual result due to physical phenomena such asbeating, among others. Past studies have utilized artificialunison mixes created by the use of a vowel only singingvoice synthesizer to study the perception of a single pitchcontour in a unison [1]. Other areas of past research re-lated to unison singing include single voice to unison syn-thesis models, based on creating voice clones with varia-tions in the input [2, 3]. Fuelled by the deep learning rev-olution, singing voice synthesizers have evolved over the a r X i v : . [ ee ss . A S ] S e p ast few years, allowing us to take a step further both intoexploring the perception of unison and into effective solovoice to unison synthesis. We build on the work done byTernström [1] by leveraging recently proposed synthesismethodologies to synthesize a single voice prototype rep-resenting the melodic and linguistic content of a unisonmixture. This allows use to further test the hypothesis ofa single F contour representative of the perceived pitchof a unison via subjective listening tests. We also verifythe author’s findings by analysing a set of real recordingsof unison singing. In addition, we propose a methodologycombining previous research and recently developed tech-niques to synthesize a unison mixture from a single voiceinput. We follow the basic methodology used by Schnellet al. [2] to create voice clones with variations in three as-pects; pitch, timing and timbre, and use perceptual evalua-tion tests to evaluate the effect of each of these parameterson the perception of the sense of a unison.The rest of the paper is structured as follows. Section 2discusses previous works pertinent to our study. We thenpresent the analysis of the choir recordings in Section 3,including a description of the dataset of choir recordings,the methodology used for the analysis and the results of theanalysis. The synthesis methodology we use for synthesiz-ing voice clones and the single voice prototype of the uni-son mixture is described in Section 4. Section 4.3 presentsthe perceptual evaluation methodology used and the resultsof the perceptual tests. Finally, we present a discussion onour findings in the analysis of the choir and the perceptualevaluation of the synthesis in Section 6.
2. RELATED WORK
We divide the description of related works into three sec-tions: past studies into the analysis of the perception of uni-son, previous works on synthesising unison mixtures fromchoirs and recently proposed deep-learning based method-ologies what we will use for analysis and synthesis.
The perception of pitch dispersion has previously beenstudied in [1], wherein the author used synthesized singingvoice stimuli to investigate the preferences of expert listen-ers in unisons. In the study, pitch dispersion is defined asthe bandwidth of the fundamental frequency and its har-monic partials across individual singers in a unison. Itis suggested that this dispersion is related to the flutter —small variations in F that are too fast to be perceived aspitch variations. The concept of pitch scatter is presentedin the study as the standard deviation over voices in themean F : the average F computed over the duration ofeach tone of a song. The study concludes that a scatter of – was preferred by the participants while a scatter of –
14 cents was seen as the limit of toler-ance before dissonance was reported. In addition, the au-thor also highlights several differences between solo andensemble singing. For instance, a single performer pro-duces tones with well-defined properties: pitch, timing, loudness, timbre, while an ensemble of performers pro-duces sounds with statistical distributions of each of theseproperties.A similar method for modelling scatter in choir sec-tions was presented by Cuesta et al. [4] using small win-dows to compute the standard deviation between individ-ual F s in the unison. This study used real recordingsof choral singing instead of synthesised stimuli, present-ing a mathematical model for dispersion rather than a per-ceptual evaluation. For the dataset used in the research,F (or pitch) dispersion was found to be in the rangeof
20 cents –
30 cents for all SATB voice sections, beingslightly larger in the Bass.Another recent study focused on the analysis of F invocal music is work by Weiss et al. [5], where the au-thors proposed an approach to measure intonation qualityof choir recordings. They create an ideal -tone equaltemperament grid, and then calculate the deviation of eachF and its partials to their theoretical position in the grid.The overall deviation is computed as a weighted sum ofeach partial’s deviation. This method enables the analysisof the overall intonation of a full choir recording, but doesnot account for the deviations within sections of the choir. Signal processing techniques have previously been utilisedto synthesize choir unison by adding several clones ofa monophonic a cappella signal with uncorrelated pitch,timing, and timbre deviations. Most particularly, PitchSynchronous Overlap Add (PSOLA) methods [2] havebeen exploited as an analysis-synthesis framework to de-compose the vocal signal into a set of constituent wave-forms representing successive pitch periods of the signal.Pitch and timing deviations are added to the vocal signalusing time stretching and pitch shifting techniques to cre-ate voice clones, which are combined to form the outputunison signal.Other proposed methodologies for creating a unisonoutput from an a cappella signal include morphing thespectral and pitch components of the vowels of the inputsignal as in [3]. The methodology’s effectiveness is con-strained to low tempo inputs . Random modulation of beat-ing partials to create a choral effect [6] has also been used.
For our work, we build on the work done in [1] and [4],modelling the perceptual pitch contour of a unison mix-ture as a single F contour. To this end, we use a recentlyproposed Convolutional Representation for Pitch Estima-tion (CREPE) methodology [7] for extracting F contoursfrom real world recordings of individual singers in a choirsetting as well the F contour of unison mixture created bycombining the individual voices. This methodology uses aseries of convolutional operations on the waveform of theinput signal and outputs a probability distribution over adiscrete representation of the underlying F contour of thesignal across a series of time-frames.o synthesize the single voice prototype representing aunison mixture output and the voice clones for creatinga unison mixture from a single voice input, we adapt themethodology proposed by Chandna et al. [8], which allowsfor the re-synthesis of a solo single voice from a musicalmixture input via the underlying linguistic features. Thismethodology builds on the idea of re-synthesizing a vocalsignal from a musical mixture by estimating the parametersof a vocoder synthesizer [9] and uses an encoder built of abank of bi-directional long short-term memory (LSTM) re-current neural networks (RNNs) to estimate a continuousrepresentation of the underlying linguistic features presentin the input mixture signal. The continuous representa-tion is singer-independent and language agnostic, and wasinitially proposed for zero-shot voice conversion via an au-toencoder network [10]. The linguistic features can then beused to generate the spectral envelope of the vocal signalin the mixture, providing the singer identity. The authorsof [8] proposed two decoders for this process, a Singer De-pendent Network (SDN) which takes the singer identity asa one-hot vector, and a Singer Independent Network (SIN)which intrinsically learns the singer identity from the giveninput. The spectral envelope is then combined with the F ,extracted via an external algorithm, to synthesize the vocalsignal. While the original framework was proposed for ex-tracting a singing voice from a pop/rock musical mixture,we adapt the SDN network for synthesising a unison mix-ture from an a cappella input and the SIN network for syn-thesizing an a cappella singing voice from a unison mix-ture. The adaptations we apply are described in Section 4.The SIN and SDN models [8] were trained on a proprietarydataset with hours of data comprising songs by male and female singers, and we have obtained a copy ofthe trained model with permission from the relevant au-thorities for our study.
3. ANALYSIS OF CHOIR RECORDINGS
We analyse the variations between individual singers in aunison in terms of variance in pitch and timing. Below, wepresent the dataset that we use in our analysis, followed bythe methodology used for analysis, and finally the resultsof our analysis.
We analyse the Choral Singing Dataset (CSD) [4], whichincludes monophonic recordings of choral pieces: NiñoDios d’Amor Herido , written by Francisco Guerrero,
Lo-cus Iste , written by Anton Bruckner, and
El Rossinyol , aCatalan popular song. There are different singers foreach song with singers for each of the four parts; So-prano, Alto, Tenor and Bass. The dataset also includesmanually corrected F annotations for each track. To analyse inter-singer variance in pitch , the first step isthe extraction of an F contour from a unison mixture ofsingers. We aim to study the behavior of a monophonic F extractor in such cases, assuming that we have a suf-ficiently balanced unison performance, where the contri-bution of each singer is similar in terms of volume andenergy. To this end, we use CREPE [7] to extract the fun-damental frequency of the unison mixture created by sum-ming and normalizing all corresponding individual singersin each vocal part of the recordings. This is termed as EstF U .We then measure the resemblance of the estimated EstF U to each of the manually annotated F tracks andto the mean F m . We use standard evaluation metrics formelody extraction including Raw Pitch Accuracy (RPA),Overall Accuracy (OA), Voicing Recall (VR) and
VoicingFalse Alarm (VFA) between the
EstF U , the average themean F m , and each individual singer curve, GT F i Once we have verified the accuracy of the extractionsystem, we build a statistical model for the individual con-tours in the unison, as suggested by [1]. In our model, theframewise F of an individual singer, F i , can be repre-sented as a distribution of values around the mean F m with a deviation of F devi , as shown in Equation 2 F i = F M + F devi (2)This equation also allows us to define the F i +1 of asinger in terms of the F i of another singer in the unisonas F i +1 = F m + F devi +1 F i +1 − F i = F devi +1 − F devi F i +1 = F i + F devi +1 − F devi F i +1 = F i + ∆ F s (3)Where we define ∆ F s as the inter-singer deviation, rep-resented by Equation 4. For each pair of singers in the uni-son, we compute the frame-wise difference between thecorresponding F contours in cents. For this calculation,only frames with positive F values, also known as voiced frames, were considered. We average these inter-singer de-viations across time and songs, and obtain a single valuefor each group, i.e., SATB. ∆ F s = n (cid:88) i =1 n (cid:88) j = i +1 (cid:12)(cid:12) F i − F j (cid:12)(cid:12)(cid:0) n (cid:1) (4)where the sub-index s indicates the choir section, s ∈ [ S, A, T, B ] , and n is the number of singers. In our usecase, n = 4 .To study timing deviations, we focus in the transitionsfrom voiced to unvoiced , and vice-versa—where singersare not entirely in sync. We call these regions transition re-gions , where some of the singers in the mixture are voiced Note that the average F value has to be adjusted for timing differ-ences between the individual singers. To this end, we define the averageto be zero (unvoiced frame) if and only if all individual values for thatframe are zero. for all other cases, the average is calculated only account-ing for the non-zero values. We use the mir_eval library [11] for this evaluation, and we use apitch tolerance of
30 cents
RPA VR VFA OA
SOPRANOS
Individual F0/Unison F0 Average F0/Unison F0
RPA VR VFA OA
SOPRANOS
Individual F0/Unison F0 Average F0/Unison F0
RPA VR VFA OA
TENORS
Individual F0/Unison F0 Average F0/Unison F0
RPA VR VFA OA
ALTOS
Individual F0/Unison F0 Average F0/Unison F0
RPA VR VFA OA
BASSES
Individual F0/Unison F0 Average F0/Unison F0
Figure 1 . Resemblance of the estimated unison
EstF U estimation to each individual GT F i contour (green) andthe average (blue) using pitch evaluation metrics averagedacross each choir section. Sopranos Altos Tenors Basses010 C e n t s ( f _ r e f = . H z ) Inter-singer F0 deviations
Figure 2 . Inter-singer deviations in cents averaged acrossthe whole dataset for each choir section. Deviations arecalculated using Equation 4.and others are unvoiced, with a positive or zero F . Wemeasure the length of all the transition regions in everyunison from the CSD, and then average across choir sec-tions. A summary of the results of the comparison be-tween the fundamental frequency extracted by CREPE,
EstF U , and the manually corrected fundamental fre-quency, GT F i is illustrated in Figure 1, along with acomparison with the mean, F m . We observe that all sec-tions follow the same pattern with similar metric values,and the unison F estimated by CREPE, EstF U , is closerto the average F m ,than to the individual contours. Inaddition, all metrics improve when we compare the aver-age F curve to the extracted F contour from the unison:RPA, VR and OA are higher in the blue plots, while VFAis lower. We can thus use the pitch estimated by CREPE, EstF U , as a representative of the mean single pitch con-tour perceived in a unison mixture [1].The calculated ∆ F s is shown in Figure 2. We observean inter-singer deviation in the range of –
50 cents ,with a mean of around
20 cents . This value, representingthe inter-singer deviation in the unison mixtures, is com-parable to the pitch dispersion studied by Cuesta et al. [4].While the methodology for modelling is different, these re-
Section Average Timing Deviation ± Standard DeviationSoprano . ± . secAlto . ± . secTenor . ± . secBass . ± . sec Table 1 . Timing deviations averaged across the CSD.These values measure the time span in which all singersin the unison transition from voiced to unvoiced, and vice-versa, averaged across all transitions in each song.sults are in accordance with their reported per-section pitchdispersion: larger in the bass section, smaller in the sopra-nos, and very similar for altos and tenors.Table 1 shows the results of the timing analysis. Weobserve an average timing deviation of . between thevoices in the unison for all parts of the choir.
4. SYNTHESIS METHODOLOGY
We present two synthesis models, Solo To Unison (STU) tosynthesize voice clones for creating a unison mixture froma single voice input, and Unison to Solo (UTS) for syn-thesizing single voice prototype representing the melodicand linguistic content of a unison mixture input . Simi-lar to the work presented by Schnell et al [2], we decom-pose the input signal into the F , harmonic spectral enve-lope, and aperiodicity envelope. However, instead of usingPitch Synchronous Overlap Add (PSOLA) methods, weutilize the WORLD vocoder [12], which has been shownto be an effective vocoding system for singing voice syn-thesis [13, 14, 14]. Similar to [14], we use truncated fre-quency warping in the cepstral domain [15] to reduce thedimensions of the harmonic components from to log Mel-Frequency Spectral Coefficients (MFSCs) with anall-pole filter with warping coefficient α = 0 . . In addi-tion, we use bandwise aperiodic analysis to reduce the di-mensionality of aperiodic features to . For the rest of thispaper, we refer to these features together as the spectralenvelope. As shown in Figure 3, we first perform a short-time Fouriertransform (STFT) to extract a spectrogram from the in-put waveform. The magnitude part of the spectrogram ispassed through the encoder proposed in [8] to extract acontinuous representation of the linguistic features presentin the unison mixture input. The linguistic features aredecoded via the SIN network [8] to generate the spectralenvelope for vocal synthesis. This envelope is combined Audio examples are provided as complementary material at https://pc2752.github.io/unison_analysis_synthesis_examples/ and the source code with the trained models are available at https://github.com/MTG/content_choral_separation ith the pitch contour output from CREPE [7], F U tosynthesise the single voice prototype representing the uni-son mixture input. The analysis part of the STU case follows a similarmethodology, as we extract the linguistic features and the
EstF i contour from the input a cappella voice signal.To create voice clones with pitch and timing deviations,we add randomly sampled noise from a normal distribu-tion with a mean of and a variable standard deviation,termed as std . This represents the inter-singer deviation, ∆ F s , and allows us to model the F i +1 of the clone asper Equation 3. Timing deviations are added by shiftingthe voiced portions of the input signal or the portions be-tween tow blocks of silence of more than
80 ms by a vari-able amount, randomly sampled from a normal distributionof mean and standard deviation ts . The values of std and ts are based on our analysis of the Choral Singing Datasetpresented in Section 3.3Finally, for variations in timbre, we generate the spec-tral envelope of a variable number singers, ns , of the samegender as the input using the SDN network proposed in [8].This is based on our analysis presented in Section 3.3.There was no overlap between singers in the set used fortraining the synthesis model and the singers in the ChoralSinging Dataset used for evaluation. The various voiceclones are added together and normalized in amplitude toproduce the final unison output. We evaluate various com-binations of std , ts , and ns on their impact of the percep-tion of unison. We used subjective listening tests with low andhigh anchors, as modified versions of the MUSHRA-methodology [16] to evaluate subjective criteria of the syn-thesis produced by our analysis synthesis framework.While there are several aspects that could be evaluated,we focused on three keys aspects: adherence to melody,perception of unison, and audio quality. For each aspect,the participants were presented with questions, one foreach part of the SATB choir, and were asked to rate the testcases in the question on a continuous scale of – with re-spect to a presented reference. The test case and referencesprovided pertained to the the same section of the song andwere between –
10 s each. The parameters used for thesetests are described below for each aspect.
For this aspect, we wanted to see the similarity of the per-ceived pitch contour of the output for both the UTS andSTU cases to that of a ground truth unison mixture. Tothis end, the reference provided to the participant was aground truth unison sample made by summing the corre-sponding four individual singers of a part to form a unisonmixture. This reference is referred to as
REFU . The partic-ipants were asked to rate test samples which included thesingle voice prototype of the unison as output by the UTS system, referred to as
UTS . In addition, we evaluated theoutput of STU with a pitch deviation with parameter std set to
50 cents , the acceptable limit of pitch deviations, asshown by our analysis in Section 3.3 and suggested by [1].Four singers were used for generating this test case, withparameter ns set to , and it is referred to as STU_PS .We also evaluated the output of the UTS system with bothpitch and timing deviations with parameter ts set to
40 ms .While our analysis in Section 3.3 suggests that higher val-ues of ts could have been used, we found that increasingthe value beyond
40 ms leads to a unacceptable level ofdegradation in output quality. We refer to this test case as
STU_PTS . We also provided a lower anchor of a sample ofthe same length from another vocal part.
Unison is a loosely defined perceptual aspect, the cogni-tion of which we aim to study here. For this, we pro-vide a reference of a ground truth unison sample createdin the same manner as described above,
REFU . Given this,participants were asked to rate outputs from the STU sys-tem based on their similarity to the reference in terms ofthe perception of unison. In addition to the
STU_PTS and
STU_PS cases with pitch, timing and timbre variance, wealso tested the case for just timing and singer variation, re-ferred to as
STU_TS and a case with just pitch and timingdeviations, referred to as
STU_PT , timbral changes werenot done for the voice clones used for creating this testcase. The a cappella sample of a single singer singing thesame example as the reference was provided as a lower an-chor.
Audio quality is another subjective measure that is well de-fined in literature but not easily understood by non-expertparticipants. For the evaluation, we set an upper limit ofaudio quality to the resynthesis of a single voice recordingwith the WORLD vocoder
REFS and a lower limit to theresynthesis of a unison mixture with the same
RESSYN-THU . The test cases provided to the participants were thesame as those provided for the adherence to melody case,except that the lower anchor was changed.
There were participants in our evaluation, of which had prior musical training. To account for inter-participantvariance in subjective evaluation, the opinion score foreach question was normalized over ratings for the refer-ence and the lower anchor before calculating the meanopinion scores (MOS) and the standard deviations in opin-ion scores, presented in Table 2.The subjective nature of the perceptual aspects evalu-ated must be taken into account for the evaluation and themean opinion scores are indicative of preferences ratherthan absolute measures of quantity. It can be observed thatthe perceived adherence to melody for the prototypical acappella voice synthesized by the UTS model has higherpreference than the STU models, although a high variance TFTInput Waveform Pitch Contour Output WaveformSynthesisSinger Independent Linguistic Features Encoder Spectral EnvelopeDecoderSinger Identity Pitch deviations for STUCREPE [7]
Figure 3 . The synthesis framework. The magnitude part of the spectrogram of the input is passed through the encoderfrom [8] to extract linguistic content. For the UTS case, this is passed to the decoder along with a learned embedding to theSIN decoder [8] to generate the spectral envelope. For STU, the linguistic features are passed to the SDN decoder alongwith a one-hot vector of the same gender as the input. F is extracted from the input waveform via CREPE [7]. Both theenvelope and the F are used to synthesize the output voice. For the STU case, timing deviations are further added beforesumming with the input and normalizing.Test Adherence To Unison AudioCase Melody Perception QualityUTS . ± .
93 2 . ± . STU_PS . ± .
83 2 . ± .
85 2 . ± . STU_PTS . ± .
14 3 . ± .
96 3 . ± . STU_TS . ± . STU_PT . ± . Table 2 . Mean Opinion Score (MOS) ± Standard Devia-tion for the perceptual listening tests across the test casesprovided. The models shown corresdond to the Unison toSolo (UTS) model, the Solo to Unison with pitch, timingand singer variations, indicated by the addition of the let-ters P,T and S as suffixes to the abbreviation, respectively.The scores for each question were normalized by the re-sponses to the upper and lower limits for the responses de-fined in section 4.3.is observed in the ratings for the same. The unison percep-tion evaluation shows that the variations in either timingor pitch alone are not as preferred as variations in both as-pects together. Timbre variations do not have as significantan effect on perception of unison as variances in timing andpitch. The evaluation of audio quality shows room for im-provement in the synthesis of the voice signals. This canpartly be attributed to the use of the WORLD vocoder [12]and we believe that this can be improved on in the futureusing recently proposed neural synthesis techniques.
5. CONCLUSIONS
We have presented an analysis of the Choral SingingDataset, building on the work presented in [1]. In accor-dance with the analysis done by [4], we observe devia-tion in the range of –
50 cents between the F con-tours of the individual singers in the unison mixtures inthe dataset. We further note an timing deviation of . between singers in unison in the dataset.We then used this analysis along with recently proposeddeep-learning based methodologies to present a synthesis system for a unison mixture from a single voice input anda single voice prototype synthesis representing the melodicand linguistic content of a unison mixture input. Based onthese systems, we were able to conduct a perceptual eval-uation of the unison, further supporting the claim of [1]that the a mixture of different voices singing in unison isperceived to have a single pitch. In addition, we foundthat pitch and timing deviations together are important forthe perception of the unison, and that variations in eitheraspect alone is insufficient for such. However, timbre vari-ations were not found to be as relevant.We present this work as the first step into the analysis ofan under-explored research area, hoping to fuel further dis-cussion on the topic. While interesting from an academicstandpoint, the systems we present also have several com-mercial applications such as creating a unison choral effectto be used in music production as well as for transpositionand transcription, in conjunction with the work presentedin [17]. We also plan to incorporate the presented workwith [18], for complete source separation for choral record-ings.
6. ACKNOWLEDGEMENTS
The TITANX used for this research was donated by theNVIDIA Corporation. This work is partially supported bythe Towards Richer Online Music Public-domain Archives(TROMPA H2020 770376) project. Helena Cuesta is sup-ported by the FI Predoctoral Grant from AGAUR (Gener-alitat de Catalunya).
7. REFERENCES [1] S. Ternström, “Perceptual evaluations of voice scatterin unison choir sounds,”
STL-Quarterly Progress andStatus Report , vol. 32, pp. 041–049, 1991.[2] N. Schnell, G. Peeters, S. Lemouton, P. Manoury, andX. Rodet, “Synthesizing a choir in real-time using pitchsynchronous overlap add (psola).” in
ICMC , 2000.3] J. Bonada, “Voice solo to unison choir transformation,”in
Proceedings of the 118th Audio Engineering SocietyConvention , Barcelona, Spain, May 2005.[4] H. Cuesta, E. Gómez, A. Martorell, and F. Loáiciga,“Analysis of intonation in unison choir singing,” in
Proceedings of the International Conference of Mu-sic Perception and Cognition (ICMPC) , Graz, Austria,2018, pp. 125–130.[5] C. Weiss, S. J. Schelcht, S. Rosenzweig, andM. Müller, “Towards measuring intonation quality ofchoir recordings: A case study on bruckner’s locusiste.” in
Proceedings of the 20th International Societyfor Music Information Retrieval Conference (ISMIR) ,Delft, The Netherlands, 2019, pp. 276–283.[6] M. B. Dolson, “A tracking phase vocoder and its use inthe analysis of ensemble sounds,” Ph.D. dissertation,California Institute of Technology, 1983.[7] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe:A convolutional representation for pitch estimation,”in
Proceedings of the IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) ,Calgary, Canada, 2018, pp. 161–165.[8] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez,“Content based singing voice extraction from a musi-cal mixture,” in
Proceedings of the 45th IEEE Inter-national Conference on Acoustics, Speech, and SignalProcessing (ICASSP) . IEEE, 2020, pp. 781–785.[9] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “Avocoder based method for singing voice extraction,”in
Proceedings of the 44th IEEE International Con-ference on Acoustics, Speech, and Signal Processing(ICASSP 2019) , IEEE. Brighton, UK: IEEE, 2019.[Online]. Available: https://arxiv.org/abs/1903.07554[10] K. Qian, Y. Zhang, S. Chang, X. Yang, andM. Hasegawa-Johnson, “Autovc: Zero-shot voice styletransfer with only autoencoder loss,” in
InternationalConference on Machine Learning , 2019, pp. 5210–5219.[11] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,O. Nieto, D. Liang, and D. P. Ellis, “mir_eval: A trans-parent implementation of common mir metrics,” in
InProceedings of the 15th International Society for Mu-sic Information Retrieval Conference, ISMIR . Cite-seer, 2014.[12] M. Morise, F. Yokomori, and K. Ozawa, “World: avocoder-based high-quality speech synthesis systemfor real-time applications,”
IEICE TRANSACTIONS onInformation and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[13] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez,“WGANsing: A multi-voice singing voice synthesizerbased on the wasserstein-gan,” in . IEEE,2019, pp. 1–5.[14] M. Blaauw and J. Bonada, “A Neural ParametricSinging Synthesizer Modeling Timbre and Expressionfrom Natural Songs,”
Applied Sciences , vol. 7, no.1313, 12/2017 2017.[15] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai,“Mel-generalized cepstral analysis-a unified approachto speech spectral estimation,” in ,1994.[16] C. Völker, T. Bisitz, R. Huber, B. Kollmeier, and S. M.Ernst, “Modifications of the multi stimulus test withhidden reference and anchor (mushra) for use in audi-ology,”
International journal of audiology , vol. 57, no.sup3, pp. S92–S104, 2018.[17] H. Cuesta, B. McFee, and E. Gómez, “Multiple F0 Es-timation in Vocal Ensembles using Convolutional Neu-ral Networks,” in
Proceedings of the 21st InternationalSociety for Music Information Retrieval Conference(ISMIR) , Montreal, Canada (Virtual), 2020.[18] D. Petermann, P. Chandna, H. Cuesta, J. Bonada, andE. Gómez, “Deep Learning Based Source SeparationApplied To Choir Ensembles,” in