[PDF] A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

Abstract

Unison singing is the name given to an ensemble of singers simultaneously singing the same melody and lyrics. While each individual singer in a unison sings the same principle melody, there are slight timing and pitch deviations between the singers, which, along with the ensemble of timbres, give the listener a perceived sense of "unison". In this paper, we present a study of unison singing in the context of choirs; utilising some recently proposed deep-learning based methodologies, we analyse the fundamental frequency (F0) distribution of the individual singers in recordings of unison mixtures. Based on the analysis, we propose a system for synthesising a unison signal from an a cappella input and a single voice prototype representative of a unison mixture. We use subjective listening tests to evaluate perceptual factors of our proposed system for synthesis, including quality, adherence to the melody as well the degree of perceived unison.

Full PDF

AA DEEP LEARNING BASED ANALYSIS-SYNTHESIS FRAMEWORK FORUNISON SINGING

Pritish Chandna Helena Cuesta Emilia Gómez , Music Technology Group, Universitat Pompeu Fabra, Barcelona European Commission, Joint Research Centre, Seville {pritish.chandna, helena.cuesta, emilia.gomez}@upf.edu

ABSTRACT

Unison singing is the name given to an ensemble of singerssimultaneously singing the same melody and lyrics. Whileeach individual singer in a unison sings the same princi-ple melody, there are slight timing and pitch deviationsbetween the singers, which, along with the ensemble oftimbres, give the listener a perceived sense of "unison".In this paper, we present a study of unison singing in thecontext of choirs; utilising some recently proposed deep-learning based methodologies, we analyse the fundamen-tal frequency (F ) distribution of the individual singers inrecordings of unison mixtures. Based on the analysis, wepropose a system for synthesising a unison signal from an a cappella input and a single voice prototype representa-tive of a unison mixture. We use subjective listening teststo evaluate perceptual factors of our proposed system forsynthesis, including quality, adherence to the melody aswell the degree of perceived unison.

1. INTRODUCTION

Throughout history, singing has been an important cul-tural activity for humans, serving for propagation of be-liefs and ideas amongst the masses as well as for socialentertainment. The social aspect led to gatherings of peo-ple singing in a group, which evolved into polyphonicensemble singing with multiple voices singing counter-point melodies in complex harmonies. A group of peo-ple singing in such an ensemble is commonly termed asa choir and the focus of our study is on one setting ofsuch choirs consisting of four voices known as Soprano,Alto, Tenor and Bass (SATB). Each voice within an SATBensemble has its own function and melodic range in thewhole. SATB is one of the most widely studied, docu-mented, and practiced forms of choirs with numerous ded-icated conservatories across Europe, highlighting the cul-tural importance of the art form. Within each of the SATBvoices, it is common to have multiple singers of similar c (cid:13) P. Chandna, H. Cuesta and E. Gómez. Licensed undera Creative Commons Attribution 4.0 International License (CC BY 4.0).

Attribution:

P. Chandna, H. Cuesta and E. Gómez, “A Deep LearningBased Analysis-Synthesis Framework For Unison Singing”, in

Proc. ofthe 21st Int. Society for Music Information Retrieval Conf.,

Montréal,Canada, 2020. vocal range singing the same melody simultaneously, in aform known as unison singing. While all the singers in aunison sing the same melody, it is impossible for a groupof two or more people to perfectly synchronize and singthe exact same pitch line. Each singer has their own nat-ural micro-deviations, both in terms of timing and pitch,from the prescribed score and their own distinct timbre.The combination of micro-deviations and the ensemble oftimbres leads to the perception of unison, wherein sev-eral singers are perceived to be singing a single pitch con-tour [1], and is the main focus of our study.Pitch and fundamental frequency (F ) are related butnot equivalent terms. While the F generally refers to thephysical frequency of vibrations of the vocal folds for asinging voice signal, pitch refers to an abstract perceptualconcept which has been found to be closely correlated tothe F . Frequency is usually measured in Hertz, represent-ing the number of cycles of a periodic signal per second,whereas pitch is described in terms of perceptual units like cents . The cent is a unit deﬁned on a logarithmic scale, as ameasure of the ratio between the frequency in Hertz and abase frequency, commonly chosen to be

440 Hz , as shownin Equation 1. f cents = 1200 · log f hertz (1)Thus deﬁned, the cent is correlated to the perceptually rele-vant musical unit of an equally tempered semitone. Specif-ically, one semitone spans

100 cents . Examined individ-ually, the pitch of the singers in a unison can be repre-sented by the F of each individual singer’s vocal signal,this can be tested by synthesising a time-varying sinusoidwith the frequency of the signal. However, when the in-dividual signals are added the resultant pitch is not merelythe sum of F value, and the methodology of synthesising asum of sinusoids as with single singers fails to produce thesame perceptual result due to physical phenomena such asbeating, among others. Past studies have utilized artiﬁcialunison mixes created by the use of a vowel only singingvoice synthesizer to study the perception of a single pitchcontour in a unison [1]. Other areas of past research re-lated to unison singing include single voice to unison syn-thesis models, based on creating voice clones with varia-tions in the input [2, 3]. Fuelled by the deep learning rev-olution, singing voice synthesizers have evolved over the a r X i v : . [ ee ss . A S ] S e p ast few years, allowing us to take a step further both intoexploring the perception of unison and into effective solovoice to unison synthesis. We build on the work done byTernström [1] by leveraging recently proposed synthesismethodologies to synthesize a single voice prototype rep-resenting the melodic and linguistic content of a unisonmixture. This allows use to further test the hypothesis ofa single F contour representative of the perceived pitchof a unison via subjective listening tests. We also verifythe author’s ﬁndings by analysing a set of real recordingsof unison singing. In addition, we propose a methodologycombining previous research and recently developed tech-niques to synthesize a unison mixture from a single voiceinput. We follow the basic methodology used by Schnellet al. [2] to create voice clones with variations in three as-pects; pitch, timing and timbre, and use perceptual evalua-tion tests to evaluate the effect of each of these parameterson the perception of the sense of a unison.The rest of the paper is structured as follows. Section 2discusses previous works pertinent to our study. We thenpresent the analysis of the choir recordings in Section 3,including a description of the dataset of choir recordings,the methodology used for the analysis and the results of theanalysis. The synthesis methodology we use for synthesiz-ing voice clones and the single voice prototype of the uni-son mixture is described in Section 4. Section 4.3 presentsthe perceptual evaluation methodology used and the resultsof the perceptual tests. Finally, we present a discussion onour ﬁndings in the analysis of the choir and the perceptualevaluation of the synthesis in Section 6.

2. RELATED WORK

We divide the description of related works into three sec-tions: past studies into the analysis of the perception of uni-son, previous works on synthesising unison mixtures fromchoirs and recently proposed deep-learning based method-ologies what we will use for analysis and synthesis.

The perception of pitch dispersion has previously beenstudied in [1], wherein the author used synthesized singingvoice stimuli to investigate the preferences of expert listen-ers in unisons. In the study, pitch dispersion is deﬁned asthe bandwidth of the fundamental frequency and its har-monic partials across individual singers in a unison. Itis suggested that this dispersion is related to the ﬂutter —small variations in F that are too fast to be perceived aspitch variations. The concept of pitch scatter is presentedin the study as the standard deviation over voices in themean F : the average F computed over the duration ofeach tone of a song. The study concludes that a scatter of – was preferred by the participants while a scatter of –

14 cents was seen as the limit of toler-ance before dissonance was reported. In addition, the au-thor also highlights several differences between solo andensemble singing. For instance, a single performer pro-duces tones with well-deﬁned properties: pitch, timing, loudness, timbre, while an ensemble of performers pro-duces sounds with statistical distributions of each of theseproperties.A similar method for modelling scatter in choir sec-tions was presented by Cuesta et al. [4] using small win-dows to compute the standard deviation between individ-ual F s in the unison. This study used real recordingsof choral singing instead of synthesised stimuli, present-ing a mathematical model for dispersion rather than a per-ceptual evaluation. For the dataset used in the research,F (or pitch) dispersion was found to be in the rangeof

20 cents –

30 cents for all SATB voice sections, beingslightly larger in the Bass.Another recent study focused on the analysis of F invocal music is work by Weiss et al. [5], where the au-thors proposed an approach to measure intonation qualityof choir recordings. They create an ideal -tone equaltemperament grid, and then calculate the deviation of eachF and its partials to their theoretical position in the grid.The overall deviation is computed as a weighted sum ofeach partial’s deviation. This method enables the analysisof the overall intonation of a full choir recording, but doesnot account for the deviations within sections of the choir. Signal processing techniques have previously been utilisedto synthesize choir unison by adding several clones ofa monophonic a cappella signal with uncorrelated pitch,timing, and timbre deviations. Most particularly, PitchSynchronous Overlap Add (PSOLA) methods [2] havebeen exploited as an analysis-synthesis framework to de-compose the vocal signal into a set of constituent wave-forms representing successive pitch periods of the signal.Pitch and timing deviations are added to the vocal signalusing time stretching and pitch shifting techniques to cre-ate voice clones, which are combined to form the outputunison signal.Other proposed methodologies for creating a unisonoutput from an a cappella signal include morphing thespectral and pitch components of the vowels of the inputsignal as in [3]. The methodology’s effectiveness is con-strained to low tempo inputs . Random modulation of beat-ing partials to create a choral effect [6] has also been used.

For our work, we build on the work done in [1] and [4],modelling the perceptual pitch contour of a unison mix-ture as a single F contour. To this end, we use a recentlyproposed Convolutional Representation for Pitch Estima-tion (CREPE) methodology [7] for extracting F contoursfrom real world recordings of individual singers in a choirsetting as well the F contour of unison mixture created bycombining the individual voices. This methodology uses aseries of convolutional operations on the waveform of theinput signal and outputs a probability distribution over adiscrete representation of the underlying F contour of thesignal across a series of time-frames.o synthesize the single voice prototype representing aunison mixture output and the voice clones for creatinga unison mixture from a single voice input, we adapt themethodology proposed by Chandna et al. [8], which allowsfor the re-synthesis of a solo single voice from a musicalmixture input via the underlying linguistic features. Thismethodology builds on the idea of re-synthesizing a vocalsignal from a musical mixture by estimating the parametersof a vocoder synthesizer [9] and uses an encoder built of abank of bi-directional long short-term memory (LSTM) re-current neural networks (RNNs) to estimate a continuousrepresentation of the underlying linguistic features presentin the input mixture signal. The continuous representa-tion is singer-independent and language agnostic, and wasinitially proposed for zero-shot voice conversion via an au-toencoder network [10]. The linguistic features can then beused to generate the spectral envelope of the vocal signalin the mixture, providing the singer identity. The authorsof [8] proposed two decoders for this process, a Singer De-pendent Network (SDN) which takes the singer identity asa one-hot vector, and a Singer Independent Network (SIN)which intrinsically learns the singer identity from the giveninput. The spectral envelope is then combined with the F ,extracted via an external algorithm, to synthesize the vocalsignal. While the original framework was proposed for ex-tracting a singing voice from a pop/rock musical mixture,we adapt the SDN network for synthesising a unison mix-ture from an a cappella input and the SIN network for syn-thesizing an a cappella singing voice from a unison mix-ture. The adaptations we apply are described in Section 4.The SIN and SDN models [8] were trained on a proprietarydataset with hours of data comprising songs by male and female singers, and we have obtained a copy ofthe trained model with permission from the relevant au-thorities for our study.

3. ANALYSIS OF CHOIR RECORDINGS

We analyse the variations between individual singers in aunison in terms of variance in pitch and timing. Below, wepresent the dataset that we use in our analysis, followed bythe methodology used for analysis, and ﬁnally the resultsof our analysis.

We analyse the Choral Singing Dataset (CSD) [4], whichincludes monophonic recordings of choral pieces: NiñoDios d’Amor Herido , written by Francisco Guerrero,

Lo-cus Iste , written by Anton Bruckner, and

El Rossinyol , aCatalan popular song. There are different singers foreach song with singers for each of the four parts; So-prano, Alto, Tenor and Bass. The dataset also includesmanually corrected F annotations for each track. To analyse inter-singer variance in pitch , the ﬁrst step isthe extraction of an F contour from a unison mixture ofsingers. We aim to study the behavior of a monophonic F extractor in such cases, assuming that we have a suf-ﬁciently balanced unison performance, where the contri-bution of each singer is similar in terms of volume andenergy. To this end, we use CREPE [7] to extract the fun-damental frequency of the unison mixture created by sum-ming and normalizing all corresponding individual singersin each vocal part of the recordings. This is termed as EstF U .We then measure the resemblance of the estimated EstF U to each of the manually annotated F tracks andto the mean F m . We use standard evaluation metrics formelody extraction including Raw Pitch Accuracy (RPA),Overall Accuracy (OA), Voicing Recall (VR) and

VoicingFalse Alarm (VFA) between the

EstF U , the average themean F m , and each individual singer curve, GT F i Once we have veriﬁed the accuracy of the extractionsystem, we build a statistical model for the individual con-tours in the unison, as suggested by [1]. In our model, theframewise F of an individual singer, F i , can be repre-sented as a distribution of values around the mean F m with a deviation of F devi , as shown in Equation 2 F i = F M + F devi (2)This equation also allows us to deﬁne the F i +1 of asinger in terms of the F i of another singer in the unisonas F i +1 = F m + F devi +1 F i +1 − F i = F devi +1 − F devi F i +1 = F i + F devi +1 − F devi F i +1 = F i + ∆ F s (3)Where we deﬁne ∆ F s as the inter-singer deviation, rep-resented by Equation 4. For each pair of singers in the uni-son, we compute the frame-wise difference between thecorresponding F contours in cents. For this calculation,only frames with positive F values, also known as voiced frames, were considered. We average these inter-singer de-viations across time and songs, and obtain a single valuefor each group, i.e., SATB. ∆ F s = n (cid:88) i =1 n (cid:88) j = i +1 (cid:12)(cid:12) F i − F j (cid:12)(cid:12)(cid:0) n (cid:1) (4)where the sub-index s indicates the choir section, s ∈ [ S, A, T, B ] , and n is the number of singers. In our usecase, n = 4 .To study timing deviations, we focus in the transitionsfrom voiced to unvoiced , and vice-versa—where singersare not entirely in sync. We call these regions transition re-gions , where some of the singers in the mixture are voiced Note that the average F value has to be adjusted for timing differ-ences between the individual singers. To this end, we deﬁne the averageto be zero (unvoiced frame) if and only if all individual values for thatframe are zero. for all other cases, the average is calculated only account-ing for the non-zero values. We use the mir_eval library [11] for this evaluation, and we use apitch tolerance of

30 cents

RPA VR VFA OA

SOPRANOS