Multichannel-based learning for audio object extraction
MMULTICHANNEL-BASED LEARNING FOR AUDIO OBJECT EXTRACTION
Daniel Arteaga Jordi Pons
Dolby Laboratories
ABSTRACT
The current paradigm for creating and deploying immersive audiocontent is based on audio objects, which are composed of an au-dio track and position metadata. While rendering an object-basedproduction into a multichannel mix is straightforward, the reverseprocess involves sound source separation and estimating the spatialtrajectories of the extracted sources. Besides, cinematic object-basedproductions are often composed by dozens of simultaneous audio ob-jects, which poses a scalability challenge for audio object extraction.Here, we propose a novel deep learning approach to object extractionthat learns from the multichannel renders of object-based productions,instead of directly learning from the audio objects themselves. Thisapproach allows tackling the object scalability challenge and alsooffers the possibility to formulate the problem in a supervised or anunsupervised fashion. Since, to our knowledge, no other works havepreviously addressed this topic, we first define the task and propose anevaluation methodology, and then discuss under what circumstancesour methods outperform the proposed baselines.
Index Terms — object-based audio, source separation
1. INTRODUCTION
Audio objects, composed by an audio track and position metadata,are rendered to specific listening layouts (e.g., 5.1 or stereo) duringplayback, offering more flexibility, adaptability, and immersivenessthan traditional multichannel productions. The development of audioobject extraction technologies allows for commercially interestingapplications, such as converting old multichannel movies into au-dio formats based on objects—allowing them to get the most outof modern immersive reproduction venues, and paving the way forapplications such as object modification or remapping. However,object-based productions are composed by dozens of simultaneousaudio objects. To overcome this object scalability challenge, sev-eral assumptions can be made. For example, some assume know-ing (in advance) which sources are present in the multichannel mix.Accordingly, source-specific models can be trained to extract suchobjects [1–3].This approach presents the difficulty that in cinematic object-based productions all kinds of audio objects can be present in themix, making it difficult to assume (in advance) which sources willbe present. Another direction relies on the assumption that a fixedset (e.g., 1 or 10) of unknown objects are present in the mix. Whilethese models are not source-specific, the number of audio objects in arealistic production is often large and those are limited in the amountof simultaneous objects they can extract [4–6].Such object scalability issue also poses challenges from the modeloptimization perspective. It is challenging to define the referenceobjects for object-based supervised learning. The problem of permu-tation ambiguity, described in the speech [7–9] and universal [3, 10]
The authors want to thank Giulio Cengarle for his insightful suggestions.
Fig. 1 . Multichannel-based learning: supervised and unsupervised.For supervised learning, a set of multichannel mixes, rendered froma reference object-based production, are compared. In contrast, unsu-pervised learning only relies on the 5.1 mixes and does not require areference object-based production.source separation literature, also arises here. The output to groundtruth pairs required for supervised learning can not be arbitrarilyassigned due to the source- (or speaker-) independent nature of thetask. Note that in cinematic audio productions, the number of po-tential object permutations can be very large, since these typicallycontain dozens of simultaneous objects. For this reason, permutationinvariant training, used for speech and universal source separation, isimpractical. Aggravating the problem, since audio engineers can com-bine several objects to present an audio event, some of the referenceobjects may only correspond to parts of real auditory events.To overcome these challenges, we propose an approach basedon multichannel-based learning: our references for supervised learn-ing are not objects, but rather the multichannel mixes rendered fromthose objects. Multichannel-based learning is inspired by the wayhumans evaluate cinematic productions, by which two object-basedproductions are considered to be similar if their renderings to multi-channel layouts are also similar—even though the number of objectsin the two mixes might be different. We propose to extract (i) a smallnumber of objects, typically 1–3, corresponding to the most promi-nent auditory events; and (ii) a multichannel remainder, called “bedchannels”, containing the audio not embedded in the objects. Hence,we study source-independent deep learning models extracting up to 3objects, in addition to the bed channels.Unsupervised learning has also been useful to extract a fixed set ofunknown objects from a mix. For example, low-rank approximationtechniques, such as non-negative matrix factorization, have beenquite successful at estimating a few object sources from a givenmix [4–6]. Given the success of unsupervised methods in the past, wealso explore an unsupervised version of the multichannel-based deeplearning approach we propose. Our model relies on inductive biases inthe form of loss and architectural constraints to facilitate disentanglingaudio objects, including position metadata, from the mix [11]. Ourloss constraints impose a set of desired properties for the objectsand our architecture constraints rely on a fully-differentiable (but notlearnable) renderer that converts the estimated objects and position a r X i v : . [ c s . S D ] F e b ig. 2 . The multichannel-based audio object extraction model. Forboth inference and training the learnable object extractor (encoder,see Fig. 3) extracts the objects and bed channels out of the input5.1. For training, the non-trainable differentiable renderer decodesthem to a number of layouts. For unsupervised training, the objectivefunction is based on the 5.1 mixes only (blue boxes). For supervisedtraining, other renders are additionally taken into account (blue andyellow boxes). Fig. 3 . The encoder, which extracts the objects and bed channels froma 5.1 mix. As in in Fig. 2, the red box denotes the learnable part ofthe model, and green boxes represent the non-learnable differentiabledigital signal processing parts.metadata into a multichannel mix. As we further discuss in Sect. 2, therenderer (a fully-differentiable implementation [12] of Dolby Atmos[13]) enables multichannel-based learning, as opposed to object-basedlearning, because it dictates a fully-determined object-based format atthe encoder’s output. Finally, note that multichannel-based learningalso bears resemblances with self-supervised learning [14,15]: insteadof directly learning from objects, our loss targets a proxy signal basedon multichannel renders.In the following, we introduce supervised and unsupervisedmultichannel-based learning (section 2), explain the inductive bi-ases we employ on both the architecture (section 3) and on the lossfunction (section 4), and introduce a novel evaluation methodologybased on Freesound Datasets data (section 5).
2. MULTICHANNEL LEARNING
We design and build a neural network which, given a multichannelexcerpt, extracts a fixed number of audio objects, position metadata,and a multichannel reminder (the bed channels). For simplicity, weassume a 5.1 input—although our method can be extended to anyother multichannel input format. As introduced above, our trainingobjective does not rely on object-based supervised learning. Instead,our training objective is designed to learn from multichannel rendersin a supervised or unsupervised fashion (see Fig. 2): •
Supervised learning . An object-based reference mix is re-quired to render a set of pre-defined multichannel layouts(e.g., 2.0, 5.1, 7.1, 9.1). The obtained renders are used asreference for a reconstruction loss that is defined in the mul-tichannel layout domain. This loss consists on a weightedaverage of the reconstruction losses obtained for each mul-tichannel format. This supervised configuration requires atraining set of object-based productions, from which all refer-ence multichannel renderings are derived.•
Unsupervised learning : the unsupervised configuration canbe understood from the perspective of an autoencoder, wherethe 5.1 input is encoded into a pre-defined latent space (thespace of object-based productions, see Fig. 2), and decoded toreconstruct the 5.1 input. Hence, a training set of object-basedproductions is not required, since the unsupervised reconstruc-tion loss is just defined over 5.1 signals.Multichannel-based learning is enabled by the structure and theinductive biases of the model. For example, the neural network weemploy dictates a fully-determined object-based format in the latentspace—including, among others, the number of objects, the presenceof a bed channel, and the specifics of the renderer (see Figs. 2 and 3).Furthermore, the position, motion and object content of the extractedobjects are further constrained by penalties added to the loss function(see Sect. 4). These penalties encourage the extracted objects to havean independent semantic meaning and to behave consistently withhow an audio engineer would mix the objects. Hence, multichannel-based learning is enabled by the structured way we approach theproblem. We enforce such structure via the architecture of the model,and via additional regularization terms in the loss function.Note that unsupervised learning could also be used to fit a spe-cific multichannel excerpt. Fitting a specific 5.1 mix via unsupervisedlearning enables audio object extraction without requiring any train-ing database. This “unsupervised fit” case can be seen as a 5.1 to 5.1autoencoder that overfits a specific example, where the structure ofthe model and the guidance of the regularization loss terms tailor themodel towards extracting meaningful audio objects in the latent. Dueto the intriguing properties of the “unsupervised fit” case, our unsu-pervised learning experiments focus on understanding the viability ofthe “unsupervised fit” approach. Furthermore, supervised and “unsu-pervised fit” learning can be combined: a model can be pre-trainedwith an object-based dataset via supervised learning, which can befine-tuned to a specific 5.1 excerpt with “unsupervised fit” training.We refer to this combined configuration as “fine-tuned”.
3. MODEL ARCHITECTURE
Our model consists of an object extractor module (encoder) and arenderer module (decoder)—see Fig. 2. The encoder (Fig. 3) performsaudio object extraction and converts the 5.1 input into an object-basedformat. The goal of the decoder is to render the extracted objectsand bed channels into multichannel mixes to allow supervised orunsupervised multichannel-based learning.The encoder (Fig. 3) is composed of (i) the mask-estimationblock, a trainable deep neural network that estimates object and bedchannel masks; (ii) the rest of object-extraction blocks, which extractthe audio objects (including position metadata) and bed channelsout of the estimated masks. The mask-estimation block (i) operatesover a 5.1 mel power spectrogram excerpt and extracts n objectmasks and a bed channel mask. Our source separation model isbased on U-Net [16]. However, note that our method could also beextended to waveform-based models such as Wavenet [1,17] or Wave-U-Net [18,19]. The object-extraction blocks (ii) rely on differentiableigital signal processing layers to further process the object masks andbed channels masks for reconstructing the objects and bed channels.Important blocks among them are the de-panner and the de-trimmer.The de-panner extracts the position metadata from the 5.1 objectaudio. Our current implementation is based on a differentiable digitalsignal processing layer; however, it could also be extended to be alearnable deep neural network. The de-trimer reverses the “trimming”process (the reduction of the object level for objects far from thefrontal position, applied during rendering). The decoder is just afully-differentiable audio-object renderer, which renders the objectand bed channels to specific multichannel layouts. While all thelayers of our model are differentiable, only the mask-estimation block(i) is trainable.For inference, only the encoder is used: the output of the modelare the n objects (including audio and spatial metadata) and the bedchannels estimated by the encoder. For training, the objects and bedchannels out of the encoder are rendered by the decoder to a set ofdifferent layouts (2.0, 5.1, 7.1 and 9.1 for supervised learning, and5.1 for unsupervised learning) to compute the reconstruction loss.In our implementation, the entire model is written in Tensorflow,including both the trainable and non-trainable digital signal process-ing modules. The mask-estimation block implements the U-Net [16],with minimal adaptations in the final layer to generate 5.1 masks. Therenderer is the Dolby Atmos renderer [13] expressed in differentiableform. The de-panner and de-trimmer also correspond to the DolbyAtmos renderer. The model operates on audio excerpts of 5.44 sec-onds at 48 kHz, with an FFT window length of 2048 samples, leadingto audio patches of 256 time bins and 1025 frequency bands, whichare grouped in 128 mel bands.
4. TRAINING OBJECTIVES
We rely on two main training objectives: reconstruction losses, tomatch the content of the mix at the multichannel level, and regular-ization losses, a set of penalties that encourage the extracted objectsto behave consistently.
Reconstruction losses — These are derived from the comparisonbetween the reference renders/mixes and the outputs of the decoder.As discussed in Sects. 2 and 3, in multichannel-based supervisedlearning we compare several reference renders (2.0, 5.1, 7.1 and9.1, rendered from the reference object based production) to thecorresponding decoder outputs. Whereas for multichannel-basedunsupervised learning, we compare the reference 5.1 mix to the 5.1output of the decoder (see Fig. 1). We got our best results when usingthe L loss. For supervised learning, the resulting reconstruction lossis computed by a weighted average of the losses obtained for eachmultichannel layout. We give a higher weight to more dense layoutslike 7.1 and 9.1, because they provide more resolution and give betterresults in practice. Regularization losses — To motivate the need of these regu-larization terms, notice that, for the unsupervised case, the modelcould trivially minimize the reconstruction loss to zero by sendingall content to the bed channels. In either the supervised or unsuper-vised cases, the network can learn to minimize the cost function inways that result in extracted objects that are not usable—e.g., havingobjects moving from one position to another distant position withoutany continuity, or having static objects close to the speakers in a5.1 layout (behaving like bed channels). It is therefore necessaryto bias the model towards solutions which correspond to the wayobject-based productions are expected to be. We found that a conve-nient way to do so is via regularization loss terms. The regularizationloss terms we consider are the following: (i) bed channel content,
Fig. 4 . Spectrograms (top) and xy trajectories (below) of a monodownmix of the 5.1 input, an extracted object, and its reference.Setup: supervised multichannel-based learning to extract one object.to avoid trivial solutions and encourage object creation; (ii) objectsclose to the 5.1 loudspeaker positions, to discourage objects playingthe role of bed channels; (iii) very slowly moving objects, to limitthe amount of static objects; (iv) very rapidly accelerating objects, toavoid “jumpy” objects moving instantaneously from one side of theroom to the other; (v) objects close one to each other; (vi) correlatedobject trajectories; and (vii) correlated content among objects, orbetween objects and bed channels, to avoid objects and bed channelssharing the same content. The relative weight of each one of thedifferent loss penalties corresponds to tunable hyperparemeters of themodel. By tuning the different loss weights, the model can be madeto behave more or less aggressively at extracting objects. The scaleof the different regularization loss terms is between 0.1% and 100%of the reconstruction loss magnitude approximately. We found usefulto apply the largest penalization to terms (i) and (iv).
5. EXPERIMENT AND EVALUATION5.1. Experiment methodology
Since we are not aware of a previously established methodology toevaluate object extraction methods, we propose two experiments,differing in the number of objects to be extracted (1 or 3). The taskamounts to extracting 1 or 3 objects and the bed channels from a5.1 mix, rendered from known objects that are available for evalua-tion. The first experiment consists in extracting objects from fifty 5.1mixes containing 1 object and bed channels, and the second experi-ment consists in extracting objects from another set of fifty 5.1 mixeshaving 3-objects + bed channels. The objects in the mixes have beencreated by assigning pseudo-random synthetic trajectories to realaudio tracks representative of different sound categories appearing incinematic mixes (vehicle sound, special effects, music instruments,voices, footsteps, etc), obtained from Freesound Datasets [20–22].These object-based excerpts also contain bed channels with real sur-round recordings. All objects and bed channels in a given excerpt areset to the same level. We use the 1-object and 3-object excerpts toevaluate two object extraction systems capable of extracting 1 and 3objects, respectively. Example of the real spectrograms and trajec-tories, and the result of the inferred signals by the 1-object model,are shown in Fig. 4. To evaluate the performance of the object andbed channels extraction we compute the scale-invariant signal-to-distortion ratio (SI-SDR) [23] of the extracted objects with respect tothe ground truth audio objects, with the goal of evaluating how wellmultichannel-based learning methods can extract individual objects. ig. 5 . Object extraction evaluation: 1-object experiment.
Fig. 6 . Object extraction evaluation: 3-object experiment.Since the order of the extracted objects is arbitrary, in the 3-objectcase we compute all possible permutations between the extractedobjects and consider the permutation which gives the best medianSI-SDR value considering the three objects.As a baseline for the object extraction evaluation, we consider anaive mono downmix (L + R + C + Ls + Rs) / for the 1-objectexperiment, and a downmix to a (2L + C) / , (2R + C) / and (Ls + Rs) / for the 3-object experiment. These baselines correspondto one of the best possible methods to extract distinct objects inabsence of source separation techniques. For the bed channels, thebaseline consists in a bypass of the input 5.1 directly to the bedchannels. In Figs. 5–8 we depict the SI-SDR improvement (SI-SDRi)with respect to the baseline (the optimal permutation of the 3-channeldownmix baseline is also considered). As an upper reference, weconsider the ideal binary mask (IBM). To assess the impact of themel scale masks, we compute the IBM with and without frequencymel band grouping.We evaluate the performance of each model in three differ-ent training configurations: “supervised”, “unsupervised fit”, and“fine-tuned” (see Sect. 2). For the the supervised and fine-tunedtraining configurations, the models are trained, or pre-trained, withobject-based excerpts generated on the fly, created by pseudo-randomprocedurally-generated audio samples and synthetic trajectories,amounting to a virtually infinite training and evaluation datasets. Themodel was trained using the Adam optimizer [24] with a learningrate of × − and a batch size of 8. SI-SDRi object extraction results are reported in Figs. 5, 6. For allthree training configurations, both 1- and 3-object models showcasea noticeable improvement with respect to the baseline (dashed line at0 dB). Regarding the 1-object experiment, supervised training givesmodest SI-SDR improvements, but also more robust (with a smallerdata variability). Further, “unsupervised fit” and “fine-tuned” (whichexplicitly overfit the 5.1 excerpt) deliver a notable performance, ratherclose to the ideal mask results, indicating that most of the times themodel is able to estimate the right object trajectory and to extract itscorresponding audio object. Regarding the 3-object experiment, thereis also a noticeable improvement with respect to the baseline. How-ever, our results are far from the IBM results, denoting the difficultyof separating three objects from a complex and dense mix. Informallistening reveals that, sometimes, under challenging conditions, theextracted objects tend to follow a given reference object for just sometime, after which such extracted object changes to track another au-
Fig. 7 . Bed channels extraction evaluation: 1-object experiment.
Fig. 8 . Bed channels extraction evaluation: 3-object experiment.dio source. Additionally the extracted objects do not include strongseparation artifacts. This can be attrubuted to the fact tha our modelis based on spectral masks (filtering): it cannot generate any soundwhich is not already present in the original multichannel mix; the onlypossible artifacts are those associated to the time-frequency masking.While in most cases the main interest is in evaluating the ex-tracted objects, it is also interesting to evaluate the quality of thebed channels (see Figs. 7, 8). In our 1-object experiments the ex-tracted bed channels clearly outperform the baseline. However, in our3-object experiments, the “unsupervised fit” and “fine-tuned” config-urations cannot outperform the baseline. This result is illustrative,because it reveals the strengths and weaknesses of the “unsupervisedfit” and “fine-tuned” approaches. While these achieve among thebest results at object extration via overfitting a specific 5.1 excerpt,the strong inductive biases we introduce (through the regularizationlosses and the architecture) result in an aggressive object extractionthat can compromise the quality of the bed channel. This is particu-larly noticeable for the “unsupervised fit” case, that is trained fromscratch to overfit a given 5.1 without additional training data.
6. CONCLUSIONS AND DISCUSSION
We proposed a source-independent approach relying on strong in-ductive biases to learn from multichannel renders. The inductivebiases we explore are based on architectural constraints (enforcingthe bottleneck of our model to be a specific object-based format), andon additional regularization loss terms (enforcing objects to behaveaccording to object-based production conventions). Multichannel-based learning can be formulated in a supervised or unsupervisedfashion. We found that supervised multichannel-based learning deliv-ers solid results, extracting useful and congruent objects, even if themodel was optimized to approximate multichannel mixtures, not toextract audio objects.. Interestingly, we also found that unsupervisedmultichannel-based learning can improve the results obtained by su-pervised multichannel-based learning. Specifically, we looked at the“unsupervised fit” configuration, which trains the model from scratchto overfit a given 5.1 without additional training data. While “unsuper-vised fit” results are among the best, its separations can be aggressive,harming the quality of the bed channels. Furthermore, it is slow sincethe model needs to be optimized for any given example. Finally, wealso found that combining supervised pre-training with “unsupervisedfit” delivers results that are comparable to “unsupervised fit” alone. . REFERENCES [1] Francesc Llu´ıs, Jordi Pons, and Xavier Serra, “End-to-end mu-sic source separation: Is it possible in the waveform domain?,”in
Interspeech , 2019.[2] Fabian-Robert St¨oter, Stefan Uhlich, Antoine Liutkus, and YukiMitsufuji, “Open-unmix - a reference implementation for musicsource separation,”
Journal of Open Source Software , vol. 4, no.41, pp. 1667, 2019.[3] Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton,Kevin Wilson, Jonathan Le Roux, and John R Hershey, “Univer-sal sound separation,” in , 2019.[4] Axel Roebel, Jordi Pons, Marco Liuni, and Mathieu Lagrangey,“On automatic drum transcription using non-negative matrixdeconvolution and Itakura Saito divergence,” in
ICAASP , 2015.[5] Yann Sala¨un, Emmanuel Vincent, Nancy Bertin, NathanSouviraa-Labastie, Xabier Jaureguiberry, Dung T Tran, andFr´ed´eric Bimbot, “The flexible audio source separation toolboxversion 2.0,” in
ICASSP , 2014.[6] Yuki Mitsufuji, Marco Liuni, Alex Baker, and Axel Roebel,“Online non-negative tensor deconvolution for source detectionin 3DTV audio,” in
ICASSP , 2014.[7] Berkan Kadioglu, Michael Horgan, Xiaoyu Liu, Jordi Pons, DanDarcy, and Vivek Kumar, “An empirical study of Conv-TasNet,”in
ICASSP , 2020.[8] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio sepa-ration network for real-time, single-channel speech separation,”in
ICASSP , 2018.[9] Yuzhou Liu and DeLiang Wang, “Divide and conquer: Adeep CASA approach to talker-independent monaural speakerseparation,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 27, no. 12, pp. 2092–2102, 2019.[10] Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J Weiss,Kevin Wilson, and John R Hershey, “Unsupervised soundseparation using mixtures of mixtures,” in
NeurIPS , 2020.[11] Francesco Locatello, Stefan Bauer, Mario Lucic, GunnarRaetsch, Sylvain Gelly, Bernhard Sch¨olkopf, and OlivierBachem, “Challenging common assumptions in the unsuper-vised learning of disentangled representations,” in
ICML , 2019.[12] Jesse Engel, Chenjie Gu, Adam Roberts, et al., “DDSP: Differ-entiable digital signal processing,” in
ICLR , 2019.[13] Mark R Thomas and Charles Q Robinson, “Amplitude panningand the interior pan,” in
Audio Engineering Society Convention143 , 2017.[14] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, PawelSwietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio,“Multi-task self-supervised learning for robust speech recogni-tion,” in
ICASSP , 2020.[15] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, Antonio Bona-fonte, and Yoshua Bengio, “Learning problem-agnostic speechrepresentations from multiple self-supervised tasks,” in
Inter-speech , 2019.[16] A. Jansson, Eric J. Humphrey, N. Montecchio, Rachel M. Bit-tner, A. Kumar, and T. Weyde, “Singing voice separation withdeep u-net convolutional networks,” in
ISMIR , 2017. [17] Dario Rethage, Jordi Pons, and Xavier Serra, “A Wavenet forspeech denoising,” in
ICASSP , 2018.[18] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-U-Net: A multi-scale neural network for end-to-end audio sourceseparation,”
ISMIR , 2018.[19] Alexandre D´efossez, Nicolas Usunier, L´eon Bottou, and FrancisBach, “Music source separation in the waveform domain,” in
ICLR , 2020.[20] Eduardo Fonseca, Jordi Pons Puig, Xavier Favory, FredericFont Corbera, Dmitry Bogdanov, Andres Ferraro, Sergio Ora-mas, Alastair Porter, and Xavier Serra, “Freesound datasets:a platform for the creation of open audio datasets,” in
ISMIR ,2017.[21] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel PW Ellis,Xavier Favory, Jordi Pons, and Xavier Serra, “General-purposetagging of Freesound audio with audioset labels: Task descrip-tion, dataset, and baseline,” in
Detection and Classification ofAcoustic Scenes and Events (DCASE) , 2018.[22] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font,and Xavier Serra, “FSD50k: an open dataset of human-labeledsound events,” 2020, arXiv:2010.00475.[23] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John RHershey, “SDR—half-baked or well done?,” in
ICASSP , 2019.[24] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” in