[PDF] Designing Neural Speaker Embeddings with Meta Learning

Abstract

Neural speaker embeddings trained using classification objectives have demonstrated state-of-the-art performance in multiple applications. Typically, such embeddings are trained on an out-of-domain corpus on a single task e.g., speaker classification, albeit with a large number of classes (speakers). In this work, we reformulate embedding training under the meta-learning paradigm. We redistribute the training corpus as an ensemble of multiple related speaker classification tasks, and learn a representation that generalizes better to unseen speakers. First, we develop an open source toolkit to train x-vectors that is matched in performance with pre-trained Kaldi models for speaker diarization and speaker verification applications. We find that different bottleneck layers in the architecture variedly favor different applications. Next, we use two meta-learning strategies, namely prototypical networks and relation networks, to improve over the x-vector embeddings. Our best performing model achieves a relative improvement of 12.37% and 7.11% in speaker error on the DIHARD II development corpus and the AMI meeting corpus, respectively. We analyze improvements across different domains in the DIHARD corpus. Notably, on the challenging child speech domain, we study the relation between child age and the diarization performance. Further, we show reductions in equal error rate for speaker verification on the SITW corpus (7.68%) and the VOiCES challenge corpus (8.78%). We observe that meta-learning particularly offers benefits in challenging acoustic conditions and recording setups encountered in these corpora. Our experiments illustrate the applicability of meta-learning as a generalized learning paradigm for training deep neural speaker embeddings.

Full PDF

11 Designing Neural Speaker Embeddingswith Meta Learning

Manoj Kumar,

Member, IEEE,

Tae Jin-Park,

Member, IEEE,

Somer Bishop, and Shrikanth Narayanan,

Fellow, IEEE

Abstract —Neural speaker embeddings trained using classiﬁca-tion objectives have demonstrated state-of-the-art performancein multiple applications. Typically, such embeddings are trainedon an out-of-domain corpus on a single task e.g., speakerclassiﬁcation, albeit with a large number of classes (speakers). Inthis work, we reformulate embedding training under the meta-learning paradigm. We redistribute the training corpus as anensemble of multiple related speaker classiﬁcation tasks, andlearn a representation that generalizes better to unseen speakers.First, we develop an open source toolkit to train x-vectors thatis matched in performance with pre-trained Kaldi models forspeaker diarization and speaker veriﬁcation applications. Weﬁnd that different bottleneck layers in the architecture variedlyfavor different applications. Next, we use two meta-learningstrategies, namely prototypical networks and relation networks,to improve over the x-vector embeddings. Our best performingmodel achieves a relative improvement of 12.37% and 7.11%in speaker error on the DIHARD II development corpus andthe AMI meeting corpus, respectively. We analyze improvementsacross different domains in the DIHARD corpus. Notably, on thechallenging child speech domain, we study the relation betweenchild age and the diarization performance. Further, we showreductions in equal error rate for speaker veriﬁcation on theSITW corpus (7.68%) and the VOiCES challenge corpus (8.78%).We observe that meta-learning particularly offers beneﬁts inchallenging acoustic conditions and recording setups encounteredin these corpora. Our experiments illustrate the applicability ofmeta-learning as a generalized learning paradigm for trainingdeep neural speaker embeddings.

I. I

NTRODUCTION

Audio speaker embeddings refer to ﬁxed-dimensional vectorrepresentations extracted from variable duration audio utter-ances and assumed to contain information relevant to speakercharacteristics. In the last decade, speaker embeddings haveemerged as the most common representations used for speaker-identity relevant tasks such as speaker diarization (speakersegmentation followed by clustering: who spoke when? ) [1]and speaker veriﬁcation (does an utterance pair belong tosame speaker? ) [2]. Such applications are relevant acrossa variety of domains such as voice bio-metrics [3], [4],automated meeting analysis [5], [6], and clinical interactionanalysis [7], [8]. Recent technology evaluation challenges [9]–[12] have drawn attention to these domains by incorporatingnatural and simulated in-the-wild speech corpora exemplifyingthe many diverse technical facets that need to be addressed.

M. Kumar, T. J. Park and S. Narayanan are with Signal Analysis and Inter-pretation Laboratory, University of Southern California, Los Angeles, USAe-mail: ([email protected];[email protected];[email protected]). S. Bishop iswith Department of Psychiatry, University of California, San Francisco, USAe-mail:([email protected])

While initial efforts toward training speaker embeddings hadfocused on generative modeling [13], [14] and factor analysis[15], deep neural network (DNN) representations extracted atbottleneck layers have become the standard choice in recentworks. The most widely used representations are trained usinga classiﬁcation loss (d-vectors [16], x-vectors [17], [18]), whileother training objectives such as triplet loss [19], [20] andcontrastive loss [21] have also been explored. More recently,end-to-end training strategies [22]–[24] have been proposedfor speaker diarization to address the mismatch between train-ing objective (classiﬁcation) and test setup (clustering, speakerselection, etc).A common factor in the classiﬁcation formulation is thatall the speakers from training corpora are used throughoutthe training process for the purpose of loss computation andminimization. Typically, categorical cross-entropy is used asthe loss function. While the number of speakers (classes)can often be large in practice ( O (10 ) ), the classiﬁcationobjective represents a single task, i.e., the same speaker setis used to minimize cross-entropy at every training minibatch.This entails limited task diversity during the training processand offers scope for training better speaker-discriminativeembeddings by introducing more tasks. We note that a fewapproaches exist which introduce multiple objectives for em-bedding training, such as metric-learning with cross entropy[25], [26] and speaker classiﬁcation with domain adversariallearning [27], [28]. While these approaches demonstrate im-provements over a single training objective, the speaker set isoften common across objectives (except in domain adversarialtraining where target speaker labels are assumed unavailable).In this work we use the classiﬁcation framework whiletraining neural speaker embeddings, however we decomposethe original classiﬁcation task into multiple tasks wherein eachtraining step optimizes on a new task. A common encoderis learnt over this ensemble of tasks and used for extractingspeaker embeddings during inference. At each step of speakerembedding training, we construct a new task by samplingspeakers from the training corpus. For a large training speakerset available in typical training corpora, generating speakersubsets results in a large number of tasks. This provides anatural regularization to prevent task over-ﬁtting. Our approachis inspired by the meta-learning [29] paradigm, also knownas learning to learn . Meta-learning optimizes at two-levels:within each task and across a distribution of tasks [30]. Thisis in contrast to conventional supervised learning which opti-mizes a single task over a distribution of samples. In additionto beneﬁts from increased task variability meta-learning has a r X i v : . [ ee ss . A S ] J u l demonstrated success in unseen classes [30]–[32]. This formsa natural ﬁt for applications such as speaker diarization andspeaker veriﬁcation which often evaluate on speakers unseenduring embedding training.We compare our meta-learned models with x-vectors, whichhave established state-of-the-art performance in multiple ap-plications [17], [18] including recent evaluation challengessuch as DIHARD [33] and VOiCES [10]. First, we developa competitive wide-band x-vector baseline using the PyTorchtoolkit (calibrated with identical performance with the KaldiVoxceleb recipe ). Next, we use two different metric-learningobjectives to meta-learn the speaker embeddings: prototypicalnetworks and relation networks. While both approaches sharethe task sampling strategy during the training phase, they differin the choice of the comparison metric between samples. Weevaluate our approaches on two different applications: speakerdiarization and speaker veriﬁcation to illustrate the generalizedspeaker discriminability nature of meta-learned embeddings.The contributions of this work are as follows: we de-velop new speaker embeddings using meta-learning that arenot restricted to an application. Within each application, wedemonstrate improvements using multiple corpora obtainedunder controlled as well as naturalistic speech interactionsettings. Furthermore, we identify conditions where meta-learning demonstrates beneﬁts over conventional cross-entropyparadigm. We analyze diarization performance across differentdomains in the DIHARD corpora. We also consider the specialcase of impact of child age groups using internal child-adult interaction corpora from the Autism domain. We studythe effect of data collection setups (near-ﬁeld, far-ﬁeld andobstructed microphones) and the level of degradation artifactson the speaker veriﬁcation performance. While we presentresults using prototypical networks and relation networks, theproposed framework is independent of the speciﬁc metric-learning approach and hence offers scope for incorporatingnon-classiﬁcation objectives such as clustering. It should benoted however that the application of relation networks hasnot been explored in speaker embedding research. Finally, wepresent an open source implementation of our work, includingx-vectors baselines, based on a generic machine learningtoolkit (PyTorch) . II. B ACKGROUND

A. Meta-Learning for Task Generalization

Early works on meta-learning focused on adaptive learningstrategies such as combining gradient descent with evolu-tionary algorithms [34], [35], learning gradient updates usinga meta-network [36] and using biologically inspired con-straints for gradient descent [37], [38]. Recent meta-learningapproaches have addressed the issue of rapid generalization indeep learning, by learning to learn for a new task [30]–[32].This concept is inspired by the human ability to learn using ahandful of examples. For instance children learn to recognizea new animal when presented with a few images as opposed toconventional DNNs which require thousands of samples for a https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb https://github.com/manojpamk/pytorch xvectors new class. The ability to quickly generalize to unseen classes isachieved by generating diversity in training tasks, for instanceby using different sets of classes at each training step (see Fig.1 in [30]). Further, the classiﬁcation setup (in terms of numberof classes and samples per class) is controlled to match withthat of the test task [39]. Meta-learning has been successfullyapplied to achieve task generalization in computer vision [30],[31], [39] and more recently in natural language processing[40]–[42]. Drawing parallels with the above applications, wetrain speaker embeddings with a large number of speakerclassiﬁcation tasks to improve over the conventional modelwhich uses a single classiﬁcation task. Since speaker sets differbetween training steps, we replace the conventional softmaxnonlinearity and cross-entropy loss combination with metriclearning objectives used in previous meta-learning works [39],[43]–[45]. B. Meta-Learning Speaker Embeddings

Few recent approaches have used a variant of meta-learningto train speaker embeddings, speciﬁcally the metric-learningobjective from prototypical networks (protonets). In [46], theauthors extend angular softmax objective to protonets andcompare with various metric learning approaches for speakerveriﬁcation. Across different architectures, angular prototyp-ical loss outperforms other methods including conventionalsoftmax objective. The authors in [47] applied protonets forshort utterance speaker recognition and introduced globalprototypes that mitigate the need for class sampling. In relatedapplications, [48] and [49] used protonets for small foot-print speaker veriﬁcation and few-shot speaker classiﬁcation,respectively. In [50], the protonet loss was compared withtriplet loss and evaluated on (open and close set) speaker IDand speaker veriﬁcation tasks. However, previous approachesseldom compare embeddings trained using protonets withexisting benchmarks based on x-vectors, except for [48] wherea modiﬁed architecture was used owing to the nature ofthe task. Further, the class sampling strategy is not alwaysused with protonets (e.g., [46], [47]) which might inhibit taskdiversity during training. An exception from the above metric-learning approaches is [51], where the authors train deepspeaker embeddings using the model-agnostic meta-learningstrategy to mitigate domain mismatch for speaker veriﬁcation.To the best of our knowledge, meta-learning is yet to beapplied for general-purpose speaker diarization, except forthe speciﬁc case of dyadic speaker clustering in child-adultinteractions in our recent work [52].III. M

ETHODS

In this section, we introduce the meta-learning setup forneural embedding training followed by description of twometric-learning approaches adopted in this work: prototypicalnetworks and relation networks. Following which, we outlinetheir use in our tasks: speaker diarization and speaker ver-iﬁcation, including a description of the choice of clusteringalgorithm.Consider a training corpus where C denotes the set ofunique speakers, and where each speaker has multiple utter-ances available. Typically, | C | is a large integer ( O (10 ) ). Here, an utterance might be in the form of raw waveformor frame-level features such as MFCCs or Mel spectrogram.Under the meta-learning setup, each episode (a training step;equivalent to a minibatch) consists of two stages of sampling:classes and utterances conditioned on classes. First, a subset ofclasses L (speakers) is sampled from C within an episode, withthe number of speakers per episode | L | typically held constantduring the training process. Next, two disjoint sets from eachspeaker in L are sampled without replacement from the setof all utterances belonging to that speaker: supports S andqueries Q . Within an episode, supports and queries are usedfor model training and loss computation, respectively, similarto train and test sets in supervised training. This processcontinues across a large number of episodes with speakers andutterances sampled as explained above. Following terminologyfrom Section I, an episode is equivalent to a task , wherein themodel learns to classify speakers from that task. Hence, meta-learning optimizes across tasks, treating each task as a trainingexample. The optimization process is given as: θ = arg max θ E L [ E S,Q [ E ( x ,y ) ∈ Q [log p θ ( y | x , S )]]] (1)Here, θ denotes trainable parameters of the neural network, ( x , y ) represents an utterance and its corresponding speakerlabel. In contrast to conventional supervised learning: θ = arg max θ E B [ E ( x ,y ) ∈ B [log p θ ( y | x )]] (2)where B denotes a minibatch. Meta-learning approaches arebroadly categorized based on the characterization of p θ ( y | x ) :model-based [53], metric-based [44] and optimization-basedmeta-learning [31]. Of interest in this work are metric-basedapproaches where p θ ( y | x ) is a potentially learnable kernelfunction between utterances from S and Q . The reasoning isas follows: speaker embeddings trained for classiﬁcation arebottleneck representations, and the latter is directly optimizedusing task performance in metric-learning approaches. We nowdescribe the two metric-learning approaches used in this work:prototypical networks and relation networks. A. Prototypical Networks

Protonets learn a non-linear transformation where each classis represented by a single point in the embedding space,namely the centroid (prototype) of training utterances fromthat class. During inference a test sample is assigned to theclass of nearest centroid, similar to the nearest class meanmethod [54].At training time, consider an episode t , the support set ( S t )and the query set ( Q t ) sampled as explained above. Supportsare used for prototype computation while queries are used forestimating class posteriors and loss value. The prototype ( v c )for each class is computed as follows: v c = 1 | S t,c | (cid:88) ( x i ,y i ) ∈ S t,c f θ ( x i ) (3) f θ : R M → R P represents the parameters of the protonet. x i represents an M -dimensional utterance representation ex-tracted using a DNN. S t,c is the set of all utterances in S t belonging to class c . For every test utterance x j ∈ Q t ,the posterior probability is computed by applying softmaxactivation over the negative distances with prototypes: p θ ( y j = c | x j , S t ) = exp ( − d ( f θ ( x j ) , v c )) (cid:80) c (cid:48) ∈ L exp ( − d ( f θ ( x j ) , v c (cid:48) )) (4) d represents the distance function. Squared Euclidean distancewas proposed in the original formulation [39] due to itsinterpretability as a Bregman divergence [55] as well assupporting empirical results. For the above reasons, we adoptsquared Euclidean as a metric in this work. The negative log-posterior is treated as the episodic loss function and minimizedusing gradient descent:Loss = − | Q t | (cid:88) ( x j ,y j ) (cid:15)Q t log( p θ ( y j | x j , S t )) (5) B. Relation Networks

Relation networks compare supports and queries by learningthe kernel function simultaneously with the embedding space[43]. In contrast with protonets which use squared Euclideandistance, relation networks learn a more complex inductivebias by parameterizing the comparison metric using a neuralnetwork. Hence, relation networks attempt to jointly learnthe embedding and metric over an ensemble of tasks thatare generalized to an unseen task. Speciﬁcally, there existtwo modules: an encoder network that maps utterances intoﬁxed-dimensional embeddings and a comparison network thatcomputes a scalar relation given pairs of embeddings. Givensupports S t within an episode t , the class representation istaken as the sum of all support embeddings: v c = (cid:88) ( x i ,y i ) (cid:15)S t,c f θ ( x i ) (6) f θ represents the encoder network. For each query embeddingbelonging to a class j , its relation score r c,j with training class c is computed using the comparison network g φ as follows: r c,j = g φ ([ v c , f θ ( x j )]) (7)Here [ ., . ] represents concatenation operation. The originalformulation of relation networks [43] treated the relation scoreas a similarity measure, hence r c,j is trained with: r c,j = (cid:40) , if y j = c , otherwise (8)In the original formulation [43], the networks f θ and g φ were jointly optimized using mean squared error (MSE) ob-jective since the predicted relation network was treated similarto a linear regression model output. In this work, we replaceMSE with the conventional cross-entropy objective based onempirical results. Hence the posterior probability is computedas: p θ ( y j | x j , S t ) = exp ( r c,j ) (cid:80) c (cid:48) ∈ L exp ( r c (cid:48) ,j ) (9)and the loss function is computed using Eq. 5. C. Use in Speaker Applications1) Speaker Diarization:

Typically, there exist four steps ina speaker diarization system: speech activity detection, speakersegmentation, embedding extraction and speaker clustering(exceptions include recently proposed end-to-end approaches[23], [24]). In this work, we adopt the uniform segmen-tation strategy similar to [33], [56] wherein the session issegmented into equal duration segments with overlap. Meta-learned embeddings are extracted from these segments fol-lowed by clustering. We use a recently proposed variant ofspectral clustering [57] which uses a binarized version ofafﬁnity matrix between speaker embeddings. The binarizationis expressed using a parameter ( p ) which represents thefraction of non-zero values at every row in the afﬁnity matrix.The clustering algorithm attempts a tradeoff between pruningexcessive connections in the afﬁnity matrix (minimizing p )while increasing the normalized maximum eigengap (NME; g p ) where the latter is expressed as a function of p (Eq. (10) in[57]). The ratio ( pg p ) is then minimized to estimate the numberof resulting clusters (i.e., speakers) in a session. This processis referred to as binarized spectral clustering with normalizedmaximum eigengap (NME-SC).Our choice of NME-SC in this work is motivated by tworeasons: (1) We do not require a separate development setto estimate a threshold parameter used in the more commonagglomerative hierarchical clustering (AHC) method with av-erage linking applied on distances estimated using probabilis-tic linear discriminant analysis (PLDA) [33]. We choose thebinarization parameter ( p ) for each session by optimizing for( pg p ) over a pre-determined range for p . (2) Empirical resultswhich demonstrate similar performance between AHC tunedon a development set and NME-SC reported in [57] and inthis work.

2) Speaker Veriﬁcation:

We use the standard protocol forspeaker veriﬁcation wherein a speaker embedding is extractedfrom the entire utterance. Subsequently, the embeddings arereduced in dimension using LDA and trial pairs are scoredusing a PLDA model trained on the same data used totrain embeddings. Following this, target/imposter pairs aredetermined using a threshold on the PLDA scores.IV. D

ATASETS

Since we evaluate meta-learned embeddings on two appli-cations: speaker diarization and speaker veriﬁcation, we usedifferent corpora commonly used in evaluating these respectiveapplications. We choose corpora obtained from both controlledand naturalisitc settings, with the former generally assumedrelatively free from noise, reverberation and babble. We furtherchoose additional corpora to assist with application-speciﬁcanalysis of performance, such as the effect of domains andspeaker characteristics (age) on diarization error rate (DER)and channel conditions on equal error rate (EER). A summaryof the corpora used in this work is presented in Table I. Below,we provide details for each corpora.

A. Voxceleb

The Voxceleb corpus [21] consists of YouTube videos andaudio of speech from celebrities with a balanced gender

TABLE IOverview of training and evaluation corpora

Training Evaluation

Speaker Diarization Speaker VeriﬁcationVox2 AMI Vox1 testVox1 dev DIHARD II dev VOiCESADOS-Mod3 SITW distribution. Over a million utterances from ≈ ,we use the dev and test splits from Vox2 and the dev splitfrom Vox1 for embedding training. The test split from Vox1is reserved for speaker veriﬁcation. There exists no speakeroverlap between the train set and Vox1-test set. B. VOICES

The VOiCES corpora [10] was released as part of theVOiCES from a distance challenge . It consists of cleanaudio (Librispeech corpus [58]) played inside multiple roomconﬁgurations and recorded with microphones of differenttypes and placed at different locations in the room. In addition,various distractor noise signals were played along with thesource audio to simulate acoustically challenging conditionsfor speaker and speech recognition. Furthermore, the audiosource was rotated in its position to simulate a real person.We use the evaluation portion of the corpus which is expectedto contain more challenging room conﬁgurations [59] than thedevelopment portion. C. SITW

The speakers-in-the-wild corpus [60] was released as partof the SITW speaker recognition challenge. It consists of in-the-wild audio collected from a diverse range of recordingand background conditions. In addition to speaker identities,the utterances are manually annotated for gender, extent ofdegradation, microphone type and other noise conditions inorder to aid analysis. A subset of the utterances also includemultiple speakers, with timing information available for thespeaker with longest duration. A handful of speakers from theSITW corpus are known to overlap with the Voxceleb corpus .In this work, we remove the utterances corresponding to thesespeakers before evaluation. Details of corpora used in speakerveriﬁcation is provided in Table II. D. AMI

The AMI Meeting corpus consists of over 100 hoursof ofﬁce meetings recorded in four different locations. Themeetings are recorded using both close-talk and far-ﬁeld https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2 https://voices18.github.io/ ∼ vgg/data/Voxceleb/SITW overlap.txt http://groups.inf.ed.ac.uk/ami/corpus/ 𝓕 (512,1,5) 𝓕 (512,2,5) 𝓕 (512,3,7) 𝓕 (512,1,1) 𝓕 (1500,1,1) Utterance (MFCC Features)Pooling Layer F r a m e - l e v e l r ep r e s en t a t i on s U tt e r an c e - l e v e l r ep r e s en t a t i on s Speaker Labels 512512 30007323 𝓕 (512,1,5) 𝓕 (512,2,5) 𝓕 (512,3,7) 𝓕 (512,1,1) 𝓕 (1500,1,1) Pooling Layer 512512 3000512512 (Prototypical Loss) 𝓕 (512,1,5) 𝓕 (512,2,5) 𝓕 (512,3,7) 𝓕 (512,1,1) 𝓕 (1500,1,1) Pooling Layer 512512 3000512Utterance (MFCC Features)Utterance (MFCC Features)

X-vector Model (Baseline) Prototypical Network Relation Network (Encoder) (Comparison Network) N tt-D t+D t K Time Delay Layer: 𝓕 (N,D,K) 𝒇 θ ( x ) 𝒇 θ ( x ) fc1fc2 (a) (b) (c) (d) Fig. 1. Overview of baseline and meta-learning architectures. (a)

A time-delay layer F ( N, D, K ) which forms the basic component across models. At eachtime-step, activations from the previous layer are computed using a context width of K and a dilation of D . N represents the output embedding dimension. (b) Baseline x-vector model. Kaldi speaker embeddings are extracted at fc1 layer. We ﬁnd that fc2 and fc1 embeddings perform better for speaker diarization andspeaker veriﬁcation respectively. (c)

Prototypical network architecture. Layers marked with a dashed boundary are initialized with pre-trained x-vector models,while layers with a solid boundary are randomly initialized. The ﬁnal layer output is referred to as protonet embeddings. (d)

Relation encoder architecture.The ﬁnal layer output is referred to as relation network embeddings. Relation scores are computed used these embeddings as illustrated in Fig. 2b)TABLE IIStatistics of corpora used for speaker veriﬁcation, including trial subsetscreated for analysis purposes

Corpus

Vox1 test 40 4715 38K (19K)VOiCES 100 11392 3.6M (36K)close mic 98 1076 0.84M (8.5K)far mic 96 1006 0.78M (7.9K)obs mic 96 1006 0.77M (7.9K)SITW 151 1006 0.50M (3K)low deg 150 998 0.16M (735)high deg 151 1003 0.20M (1.2K) microphones, we use the former for diarization purpose. Sinceeach speaker has their individual channels, we beamformedthe audio into a single channel. We follow [61], [62] forsplitting the sessions into the dev and eval partitions, ensuringthat no speakers overlap between them. For our purposes, theAMI sessions represent audio collected in noise-free recordingconditions.

E. DIHARD

The DIHARD speaker diarization challenges [63] were in-troduced in order to focus on hard diarization tasks, i.e., in-the-wild data collected with naturalistic background conditions. Inthis work, we use the development set from second DIHARDchallenge. This corpus consists of data from multiple domainssuch as clinical interviews, audiobooks, broadcast news, etc.We make use of the 192 sessions in the single-channel task in this work. It is worth noting that a handful of sessions in thiscorpus contain only a single speaker.

TABLE IIIStatistics of corpora used for speaker diarizationCorpus µ ± σ ))DIHARD 192 3.48 7.44 ± ± ± F. ADOS-Mod3

One of the most challenging domains from the DIHARDevaluations included speech collected from children. Speakerdiarization for these interactions involve additional complexi-ties due to two reasons: (1) An intrinsic variability in childspeech owing to developmental factors [64], [65], and (2)Speech abnormalities due to underlying neuro-developmentaldisorder such as autism. To this end, we use 173 child-adultinteractions consisting of excerpts from the administrationof module 3 of the ADOS (Autism Diagnosis ObservationModule) [66]. These interactions involve children with sufﬁ-ciently developed linguistic skills, i,e., ability to form completesentences. All the children in this study had a diagnosis ofautism spectrum disorder (ASD) or attention deﬁcit hyperac-tivity disorder (ADHD). The sessions were collected from twodifferent locations and manually annotated using the SALT transcription guidelines . Details of corpora used for speakerdiarization is provided in Table III.

512 10241

Comparison NetworkPrototypical Loss 𝒇 θ ( x j ) v c r c,j (a) (b) Fig. 2. (a)

Illustrating the training step in prototypical networks. Decisionregions are indicated using background colors. For each class, prototypesare estimated as the centroid of supports (ﬁlled shapes). Given the query(unﬁlled shape), negative distances to each prototype are treated as logits.Adopted from [52]. ( b ) Comparison module in relation networks. The sum ofsupport embeddings from class c ( v c ) is concatenated with a query embedding( f θ ( x j ) ) and input to the comparison network. r c,j is known as the relationscore for query x j with respect to class c and treated as the logit. V. E

XPERIMENTS AND R ESULTS

A. Baseline Speaker Embeddings

In order to select a competitive and fair baseline to meta-learned embeddings, we ﬁrst developed an implementation ofx-vectors. Our model is similar to the Kaldi Voxceleb recipe with respect to training corpora and network architecture. Wecompare the reported performance of Kaldi embeddings withour implementation and select the best performing model asthe baseline system.As mentioned in Section IV-A, we use the Vox2 and Vox1-dev corpora for embedding training. Similar to the Kaldirecipe, we extract 30-dimensional MFCC features using aframe width of 25ms and overlap of 15ms. We augmentthe training data with noise, music and babble speech us-ing the MUSAN corpus [67], and reverberation using theRIR NOISES corpus. The augmented data consist of 7323speakers and 2.2M utterances. Following which, all utterancesshorter than 4 seconds in duration and all speakers with fewerthan 8 utterances each are removed to assist the trainingprocess. Cepstral mean normalization using a sliding windowof 3 seconds was performed to remove any channel effects.The model architecture consists of 5 time-delay layerswhich model temporal context information, followed by astatistical pooling layer to map into a utterance-level vector.This is followed by two feed-forward bottleneck layers with512 units in each layer and the ﬁnal layer which outputsspeaker posterior probabilities. In contrast with the Kaldiimplementation, we use Adam optimizer ( β =0.9, β =0.99)to train the model, with an initial learning rate of 1e-3. Thelearning rate is increased to 2e-3 and progressively reducedto 1e-6. Dropout and batch normalization are used at alllayers for regularization purpose. A minibatch of 32 samplesis used at each iteration, while ensuring that utterances in https://kaldi-asr.org/models/m7 each minibatch are of ﬁxed duration to improve the trainingprocess. We accumulated gradients for every 4 minibatchesbefore back propagation, which was observed to improvemodel convergence. B. Meta-learned embeddings

We select DNN architectures for the meta-learning mod-els similar to the baseline model in order to enable a faircomparison. We use the same network as x-vectors except forthe ﬁnal layer, i.e., we retain the time-delay layers, the statspooling layer, and two fully connected layers with 512 unitsin each layer. The protonet model uses an additional two fullyconnected layers with 512 units in each layer. Embeddingsextracted at the ﬁnal layer are used for prototype computationand loss estimation. The relation network uses one additionalfully connected layer (512 units) for the encoder network. Thecomparison network consists of three fully connected layerswith 1024 units at the input, 512 units in the hidden layer and1 unit at the output. For both networks, we use batch nor-malization which was observed to improve convergence. Wedo not use dropout in the meta-learned models following theirrespective original implementations [39], [43]. The number oftrainable parameters for the baseline x-vector model, protonetand relation net (encoder + comparison) are 9.8M, 6.6M and7.1M, respectively. We trained both protonets and relationnets using the Adam optimizer ( β =0.9, β =0.99). The initiallearning rate was set to 1e-4 and exponentially decreased ( γ = 0.9) every 10 episodes, where an episode corresponds toa single back-propagation step. The models were trained for100K episodes with the stopping point determined based onconvergence of smoothed loss function. The architecture andinitialization strategies for all models are presented in Figure 1,while the meta-learning losses are illustrated in Figure 2. Model Initialization:

We use a part of the pre-trained x-vector model as an initialization for the meta-learning model.Speciﬁcally, we initialize the time-delay layers using the pre-trained weights from the corresponding layers from the x-vector model. The fully connected layers are initialized uni-formly at random between [ − √ N , √ N ] where N is the numberof parameters in the layer. Empirically, we observed that theabove initialization scheme provided a signiﬁcant performanceimprovement in our experiments.

25 50 100 200 400

Classes S uppo r t s DIHARD

Classes

AMI-eval

Classes

ADOSMod3 S uppo r t s R e l a ti on N e t P r o t on e t Fig. 3. Speaker diarization performance (% DER) across different corpora for different combinations of supports examples and training classes within anepisode. Number of queries per class is always 1 in all experiments.TABLE VSpeaker diarization results comparing meta-learning models with x-vectors.x-vector+retrain represents mean DER computed with 3 trialsMethod DIHARD AMI ADOSMod3Oracle Est Oracle Est Oracle Estx-vectors 17.62 13.93 6.94 8.47 13.94 17.16x-vector+retrain 17.39 13.26 7.49 8.52 16.74 16.89Protonet

Since we borrow a part of the pre-trained x-vector modelin our meta-learning models during initialization, we verifythat any gains in performance obtained with meta-learningmodels do not arise from overtraining the x-vector model. Weconduct a sanity check experiment wherein we retrain the x-vector model similar to the meta-learning models. Speciﬁcally,we use the baseline model from Section V-A and retrain itusing pre-trained weights for time-delay layers and randominitialization for the fully-connected layers. The model wastrained for 100K minibatches, which corresponds to the samenumber of episodes used for training meta-learning models.

C. Speaker Diarization Results

We use the oracle speech activity detection for speakerdiarization in order to study exclusively the speaker errors.We segment the session to be diarized into uniform segments1.5 seconds long in duration and with an overlap of 0.75seconds. Embedding clustering is performed using the NME-SC method as described in Section III-C1. During scoring, wedo not use a collar similar to DIHARD evaluations. However,we discard speaker overlap regions since neither x-vectors normeta-learned embeddings are trained to handle overlappingspeech.Table IV presents speaker diarization results for variousbaseline embeddings. We compare between pre-trained Kaldi embeddings, and both feed-forward bottleneck layers in ourimplementation. In addition to NME-SC for speaker cluster-ing, we use AHC on PLDA scores using two methods forestimating number of speakers: (1) A ﬁxed threshold parameterof 0, (2) Tuned threshold parameter using a development set.We tuned the parameter using two-fold cross validation forDIHARD and ADOS-Mod3, and the AMI-dev set for the AMIcorpus.First, we notice that AHC is quite sensitive to the thresholdparameter when estimating the number of speakers acrossall corpora and clustering methods. DER reduction using aﬁne-tuned threshold is particularly signiﬁcant for the ADOS-Mod3 corpus with nearly 13% absolute improvement for fc1,and 10% for fc2 embeddings extracted using our network.In some cases on the DIHARD and AMI corpora the DERobtained by ﬁne-tuning the threshold is lower than whenoracle number of speakers is used, similar to observationsin [7]. Next, fc1 embeddings outperform fc2 embeddingswhen clustering using AHC and PLDA scores, consistent withﬁndings from [17]. However, when cosine afﬁnities are usedwith NME-SC we notice that the layer closer to the cross-entropy objective (fc2) results in a lower DER. This is thecase both when oracle number of speakers are used as wellas when they are estimated using the maximum eigengapvalue. The combination of fc2 embeddings with NME-SCmethod returns the lowest DERs for most conditions. Further,NME-SC removes the need for a separate development set forestimating the threshold parameter. Hence, we adopt this asthe diarization baseline method in all our experiments.In Table V, we compare the baseline with the meta-learningmodels. x-vector+retrain represents mean results from 3 trialsof the sanity check experiment described in the Section V-B.Both meta-learning models were trained for 100K episodes.Within each episode, 400 classes were randomly chosen with-out replacement from the training corpus. Following which,3 samples were chosen without replacement from each class. .

30 16 .

18 7 . - .

48 4 .

76 38 .

31 6 .

35 12 .

87 3 . - . .

97 8 . - . - .

96 5 .

49 32 .

84 4 .

32 19 .

54 7 . - . Broadcast Interview Child Clinical Court Maptask Meeting Restaurant Socio-Field Socio-Lab Webvideo051015202530354045 D E R ( % ) Baseline Protonets Relation Nets

Fig. 4. Diarization performance across domains in DIHARD. For each domain, the mean DER across sessions is provided for baseline (x-vectors), protonetsand relation nets. The relative change in DER (%) with respect to the baseline is given next to the bar (postive: DER reduction)

Two samples were treated as supports, while the third samplewas treated as query. From the results, we note that retrainingthe x-vector model provides minor DER improvement on theDIHARD corpus while performance worsens on the AMIcorpus. The meta-learning models outperform the baselinesin most cases, although improvements depend on the corpusand setting. On the DIHARD corpus consisting of challengingdomains, protonets result in 12.37% relative improvementgiven oracle number of speakers and 6.96 % improvementwhen the number of speakers are estimated. Relation networksshow a slight degradation when compared to protonets. Thisdifference is more on a relatively clean corpus such as AMIwhile estimating number of speakers. In the following exper-iments, we analyze which setups contribute to improvementsin performance over x-vectors.

1) Effect of classes within a task:

While training meta-learning models, previous works [39], [43], [44] often care-fully control the number of classes ( way ) within an episodeand the number of supports per class ( shot ) so as to match theevaluation scenario. Drawing analogies with speaker diariza-tion, a typical session consists of O (1) speakers ( way ), with O (10) utterances per speaker ( shot ). In this experiment wevary hyper-parameters for both protonets and relation nets, andstudy the effect on DER. We vary the way and shot between25 to 400, and 2 to 50, respectively, and train a new meta-learning model for each conﬁguration. Results are presentedin Fig 3.A common effect across different corpora and models is thatthe number of speakers (classes) is an important parameter fordiarization performance. Increasing the number of speakers inan episode favours DER. This is similar to previous ﬁndingsin few-shot image recognition [39], where during training, ahigher way than expected during testing was found to providethe best results. However, the effect of supports per class onDER is not straightforward. When a large number of classes isused, increasing supports provides little to no improvements inboth protonets and relation nets. Upon reducing the number ofclasses, the performance degrades with more supports acrossmost models. This suggests a possibility of over-ﬁtting due tolarge number of supports even though the conﬁguration closelyresembles a test session. It is more beneﬁcial to increase thenumber of classes within an episode during training.

2) Performance across different domains in DIHARD:

Itis often useful to understand the effect of conversation type,including speaker count, spontaneous speech and recordingsetups on the diarization performance. We study this using thedomain labels [9] available for the DIHARD corpus. For eachdomain, we compute the mean DER across sessions using thebaseline model as well as the meta-learning models. Oraclespeaker count is used during clustering in order to exclusivelystudy the effect of domain factors. We do not include theAudiobooks domain in this experiment since all the modelsreturn the same performance on account of sessions consistingof only one speaker. We present the results in Table 4.We note that there exists considerable variation betweendomains in terms of the DER improvement between x-vectorsand meta-learning models. Broadcast news, child, maptask,meeting and socio-ﬁeld domains show signiﬁcant gains due tometa-learning models. Speciﬁcally, meeting and child domainsbeneﬁt upto 38.31 % and 16.18 % relative DER improvementfrom protonets. Diarization in the court domain degrades inperformance consistently between protonets and relation nets,with up to 20.05 % relative degradation for relation networks.Upon a closer look at the court and meeting domains tounderstand this difference, we note that both domains containsimilar number of speakers per session (Court: 7, Meeting:5.3). However, the domains differ in the data collection setup:court sessions are collected by averaging audio streams fromindividual table-mounted microphones, while meeting sessionsare collected using a single table microphone distant fromall the participants [9]. Among the socio-linguistic interviewdomains, interviews recorded in the ﬁeld under diverse lo-cations and subject age groups (socio-ﬁeld) result in a largerDER improvement over those collected under quiet conditions(socio-lab). Socio-lab contains recording from both close-talking and distant microphones, hence it is not immediatelyclear whether microphone placement alone is a factor in DERimprovement. Child and restaurant domains show variation inDER reduction although they perform similar with the baselinemodels, suggesting that background noise types affect beneﬁtsfrom meta-learning. Overall, most domains that include in-the-wild data collection show improvements with meta-learning.

3) Performance across different child age groups:

As men-tioned in Section IV-F, automatic child speech processing has been considered a hard problem when compared to processingadult speech. More recently, the child domain returned oneof the highest DERs during the DIHARD evaluations [68],illustrating the challenges of working with child speech fordiarization. Considering meta-learning models return signiﬁ-cant improvement over x-vectors for child domain, we attemptto understand gains in DER by controlling for the age ofthe child. Children develop linguistic skills as they grow up,hence child age is a reasonable proxy for their linguisticdevelopment. We select sessions from the ADOS-Mod3 corpuswhere we have access to the child age metadata. We computethe DER for each child using the respective baseline and meta-learned models described in Section V-B. For children wheretwo sessions are available, we compute the mean DER perchild. We study the effect of child age on DER by groupingchild age into 3 groups with approximately equal number ofchildren in each set. Children below 7.5 years of age arecollected in the Low age group, children between 7.5 yearsand 9.5 years of age are collected in the Mid age group, andchildren above 9.5 years of age are collected in the High agegroup.

TABLE VIAnalysis of child-adult diarization performance on the ADOS-Mod3 corpus.For each age group, mean DER (%) of sessions in each group are presentedalong with relative improvement in parenthesis.Model Low Mid HighBaseline 17.36 13.42 13.77Protonet 15.77 (9.16)

From the results in Table VI, we notice that the Low agegroup returns the highest DER, while Mid and High agegroups return similar performance across models. Given thatchildren in the Low age group are more likely to exhibitspeech abnormalities, this result illustrates the relative difﬁ-culty in automatic speech processing under such conditions.Improvements in DER from meta-learning models are dis-tributed across all age groups. A consistent improvement of10% relative DER among the Low age group is particularlyencouraging given the challenging nature of such sessions. Thehigh age group exhibits similar improvements in DER, withthe relation networks providing upto 17.43 % relative gains.

D. Speaker Veriﬁcation Results

We use speaker veriﬁcation as another application taskto illustrate the generalized speaker information captured bymeta-learned embeddings. Similar to speaker diarization, weﬁrst evaluate our implementation of the baseline with thepre-trained Kaldi embeddings. We use the test partition ofVoxceleb corpus, the eval set in VOiCES corpus and the evalset in SITW corpus in our experiments. We use the core-corecondition in the SITW corpus where a single speaker is presentin both utterances during a trial. For all models, we scoretrials using PLDA after performing dimension reduction to200 using LDA and length-normalization. The PLDA model is trained using the same data for embedding training, i.e.,Vox2 corpus and the dev set of Vox1 corpus. Speakers in theSITW corpus which overlap with the Voxceleb corpus wereremoved from the trials before evaluation. We use equal errorrate (EER) as the metric to select the best performing baselinesystem. Since cosine scoring returned signiﬁcantly high EERsrelative to PLDA, we did not investigate it further. Results areprovided in Table VII.

TABLE VIISelecting a baseline system for speaker veriﬁcation. Results are presented asequal error rate (EER %)Embedding Vox1-test VOiCES SITWKaldi 3.128 10.300 4.054Ours:fc1

Ours:fc2 3.006 9.854 4.087

We notice that embeddings from both layers in our im-plementation outperform or closely match the Kaldi imple-mentation. Similar to observations from Section V-A and [17]fc1 embeddings fare better than fc2 embeddings when scoredwith PLDA. We select fc1 embeddings as the baseline speakerveriﬁcation method.

TABLE VIIISpeaker veriﬁcation results comparing meta-learning models with x-vectors.Results presented using EER and minDCF computed at P target = 0 . Model Vox1-test VOiCES SITWEER DCF EER DCF EER DCFBaseline

Relation Net 2.884 0.313 8.238 0.690 3.725 0.370

When comparing meta-learning models, we use the samemodels developed in Section V-C. In addition to EER, wepresent results using the minimum detection cost function(minDCF) computed at P target = 0 . . From Table V, wenote that meta-learning models outperform x-vectors in mostsettings except in the case of Voxceleb corpus when EER isused. Both protonets and relation nets return similar EER andminDCF for the Voxceleb corpus. Interestingly, we achievenotable improvements on the relatively more challengingcorpora. Protonets provide up to 8.78% and 7.68% EERimprovements in the VOiCES and SITW corpora, respectively,with similar improvements in minDCF. While relation netsprovide better performance than x-vectors in the above cor-pora, they do not outperform protonets in any setting. Thissuggests that using a predeﬁned distance function (namelysquared Euclidean in protonets) might be beneﬁcial overallwhen compared to learning a distance metric using relationnetworks for speaker veriﬁcation application.

1) Robust Speaker Veriﬁcation:

Since VOiCES and SITWcorpora return the most improvement for speaker veriﬁcation,we take a closer look at which factors beneﬁt meta-learning.For each corpus, we make use of annotations for the micro-phone location and channel degradation to create new trialsfor speaker veriﬁcation. TABLE IXAnalysis of speaker veriﬁcation based on microphone location (Near: Near-ﬁeld, Far: Far-ﬁeld, Obs: Fully obscured) in VOiCES corpus and level ofdegradation artefacts in SITW corpusVOiCES (mic location) SITW (degradation level)Model Near Far Obs Low HighEER DCF EER DCF EER DCF EER DCF EER DCFBaseline 3.907

Relation Net 3.872 0.3521 7.618 0.6282 21.24 0.9527 3.81 0.3467

In the VOiCES corpus, we collect playback recordingsfrom rooms 3 and 4 present in the eval subset. Within theserecordings, we distinguish between the utterances based on themicrophone placement with respect to the loudspeaker (audiosource). Speciﬁcally, we create three categories: (1) utterancescollected using mic1 and mic18 are treated as near-ﬁeld, beingclosest to the source, (2) utterances collected from mic19 aretreated as far-ﬁeld, and (3) utterances collected from mic12 aretreated as obscured, since they are fully obscured by the wall.While creating the trials for each category, we ensure that theratio of target to nontarget pairs remain approximately equalto the overall eval set trial. An example room conﬁguration ispresented in Figure 5.

Fig. 5. An example room conﬁguration from the VOiCES corpus . Micro-phones are represented using circles. From the SITW corpus, we use the metadata annotations forlevel of degradation. The corpus includes multiple degradationartifacts: reverberation, noise, compression, etc, among others.The level of degradation for the most prominent artefact wasannotated manually on a scale of 0 (least) to 4 (maximum).We use the trials available as part of the eval set which areannotated with the degradation level. We group the trials intotwo levels: low (deg0 and deg1) and high (deg3 and deg4).Note that the utterances contain multiple types of degradationin each level. Details of target and imposter pairs for SITWcorpus (degradation level) and VOiCES corpus (microphoneplacement) are present in Table II. Speaker veriﬁcation resultsusing EER and minDCF are presented in Table IX. Figure adapted from https://voices18.github.io/rooms/

We notice that no single model performs the best acrossmultiple conditions. When controlled for microphone place-ment in VOiCES, protonets return the best EER at all lo-cations. The margin of improvement remains approximatelythe same when only the distance from source is considered:2.71% for near-ﬁeld and 2.45% for far-ﬁeld. The marginimproves to 9.14% when the microphone is fully obscuredby a wall and placed close to distractor noises. Interestingly,these improvements are not reﬂected in the minDCF scores inthe absence of noise, where x-vectors outperform both meta-learning models. We believe that improvements in EER andminDCF in VOiCES corpus primarily arise from utterancescollected in obstructed locations and in close vicinity ofdistractor noises. The experiments in SITW corpus focus onthe strength of such noise conditions. Under low degradationlevels, we see that x-vectors return the least EER, althoughtheir performance is not consistent with minDCF. Meta-learning models continue to work better in higher degradationlevels, providing 8.3% reduction in 4.1% reduction in EERand minDC, respectively.VI. C

ONCLUSIONS

We proposed neural speaker embeddings trained with themeta-learning paradigm, and evaluated on corpora represent-ing different tasks and settings. In contrast to conventionalspeaker embedding training which optimizes on a singleclassiﬁcation task, we simulate multiple tasks by samplingspeakers during the training process. Meta-learning optimizeson a new task at every training iteration, thus improvinggeneralizability to an unseen task. We evaluate two variantsof meta-learning, namely prototypical networks and relationnetworks on speaker diarization and speaker veriﬁcation. Weanalyze the performance of meta-learned speaker embeddingsin challenging settings such as far-ﬁeld recordings, childspeech, fully obstructed microphone collection and in thepresence of high noise degradation levels. The results indicatethe potential of meta-learning as a framework for trainingmulti-purpose speaker embeddings.In the future, we plan to investigate combining clusteringobjectives such as deep clustering [69], [70] with meta-learning. A combination of protonets and relation networkswith similar metric learning approaches such as matchingnetworks and induction networks will also be explored tostudy complementary information between them. Further gen-eralization to unseen classes can be obtained by incorporating domain adversarial learning techniques with the meta-learningparadigm. R EFERENCES[1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, andO. Vinyals, “Speaker diarization: A review of recent research,”

IEEETransactions on Audio, Speech, and Language Processing , vol. 20, no. 2,pp. 356–370, 2012.[2] J. P. Campbell, “Speaker recognition: a tutorial,”

Proceedings of theIEEE , vol. 85, no. 9, pp. 1437–1462, 1997.[3] Y. Rahulamathavan, K. R. Sutharsini, I. G. Ray, R. Lu, and M. Rajara-jan, “Privacy-preserving ivector-based speaker veriﬁcation,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27, no. 3,pp. 496–506, 2019.[4] N. Scheffer, L. Ferrer, A. Lawson, Y. Lei, and M. McLaren, “Recentdevelopments in voice biometrics: Robustness and high accuracy,” in , 2013, pp. 447–452.[5] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming forspeaker diarization of meetings,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 15, no. 7, pp. 2011–2022, 2007.[6] D. A. van Leeuwen and M. Huijbregts, “The ami speaker diarizationsystem for nist rt06s meeting data,” in

Machine Learning for MultimodalInteraction , 2006, pp. 371–384.[7] M. Pal, M. Kumar, R. Peri, T. J. Park, S. Hyun Kim, C. Lord, S. Bishop,and S. Narayanan, “Speaker diarization using latent space clusteringin generative adversarial network,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6504–6508.[8] B. Xiao, C. Huang, Z. E. Imel, D. C. Atkins, P. Georgiou, and S. S.Narayanan, “A technology prototype system for rating therapist empa-thy from audio recordings in addiction counseling,”

PeerJ ComputerScience , vol. 2, p. e59, 2016.[9] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, andM. Liberman, “The second dihard diarization challenge: Dataset, task,and baselines,” 2019.[10] C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco,M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout,P. Gamble, J. Hetherly, C. Stephenson, and K. Ni, “Voices obscured incomplex environmental settings (voices) corpus,” 2018.[11] J. H. Hansen, A. Sangwan, A. Joglekar, A. E. Bulut, L. Kaushik,and C. Yu, “Fearless steps: Apollo-11 corpus advancements for speechtechnologies from earth to the moon,” in

Proc. Interspeech 2018 ,2018, pp. 2758–2762. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1942[12] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The 2016speakers in the wild speaker recognition evaluation,” in

Interspeech2016 , 2016, pp. 823–827. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2016-1137[13] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veriﬁcationusing adapted gaussian mixture models,”

Digital signal processing ,vol. 10, no. 1-3, pp. 19–41, 2000.[14] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff,“Svm based speaker veriﬁcation using a gmm supervector kernel andnap variability compensation,” in , vol. 1, 2006,pp. I–I.[15] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,2011.[16] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker veriﬁcation,” in . IEEE, 2014, pp. 4052–4056.[17] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deepneural network embeddings for text-independent speaker veriﬁcation,”in

Proc. Interspeech 2017 , 2017, pp. 999–1003.[18] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in . IEEE, 2018, pp. 5329–5333.[19] H. Bredin, “Tristounet: Triplet loss for speaker turn embedding,” in , 2017, pp. 5430–5434. [20] C. Zhang, K. Koishida, and J. H. L. Hansen, “Text-independent speakerveriﬁcation based on triplet convolutional neural network embeddings,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 26, no. 9, pp. 1633–1644, 2018.[21] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speakerrecognition,” 2018.[22] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe,“End-to-End Neural Speaker Diarization with Permutation-FreeObjectives,” in

Proc. Interspeech 2019 , 2019, pp. 4300–4304.[Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2899[23] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers withencoder-decoder based attractors,” 2020.[24] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Nagamatsu, “End-to-end neural diarization: Reformulating speaker diarization as simplemulti-label classiﬁcation,” 2020.[25] J. Xu, X. Wang, B. Feng, and W. Liu, “Deep multi-metric learning fortext-independent speaker veriﬁcation,”

Neurocomputing , vol. 410, pp.394 – 400, 2020.[26] Z. Ren, Z. Chen, and S. Xu, “Triplet based embedding distance andsimilarity learning for text-independent speaker veriﬁcation,” in , 2019, pp. 558–562.[27] J. Zhou, T. Jiang, L. Li, Q. Hong, Z. Wang, and B. Xia, “Training multi-task adversarial network for extracting noise-robust speaker embedding,”in

ICASSP 2019 - 2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2019, pp. 6196–6200.[28] Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsuperviseddomain adaptation via domain adversarial training for speaker recogni-tion,” in , 2018, pp. 4889–4893.[29] J. Schmidhub¨er, “Evolutionary principles in self-referential learning,”Ph.D. dissertation, Technische Universitat M¨unchen, 1987.[30] S. Ravi and H. Larochelle, “Optimization as a model for few-shotlearning,” in , 2017.[31] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in

Proceedings of the 34thInternational Conference on Machine Learning, ICML 2017, Sydney,NSW, Australia, 6-11 August 2017 , ser. Proceedings of Machine LearningResearch, vol. 70. PMLR, 2017, pp. 1126–1135.[32] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to learnby gradient descent by gradient descent,” in

Proceedings of the 30thInternational Conference on Neural Information Processing Systems ,2016, p. 39883996.[33] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, andS. Khudanpur, “Diarization is hard: Some experiences and lessonslearned for the jhu team in the inaugural dihard challenge,” in

Proc. Interspeech 2018 , 2018, pp. 2808–2812. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1893[34] X. Yao, “Evolving artiﬁcial neural networks,”

Proceedings of the IEEE ,vol. 87, no. 9, pp. 1423–1447, 1999.[35] A. Abraham, “Meta learning evolutionary artiﬁcial neural networks,”

Neurocomputing , vol. 56, pp. 1 – 38, 2004.[36] D. K. Naik and R. J. Mammone, “Meta-neural networks that learn bylearning,” in [Proceedings 1992] IJCNN International Joint Conferenceon Neural Networks , vol. 1, 1992, pp. 437–442.[37] Y. Bengio, S. Bengio, and J. Cloutier, “Learning a synaptic learningrule,” in

IJCNN-91-Seattle International Joint Conference on NeuralNetworks , vol. ii, 1991, pp. 969 vol.2–.[38] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei, “On the optimizationof a synaptic learning rule,” in , vol. 2. Univ. of Texas, 1992.[39] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shotlearning,” in

Proceedings of the 31st International Conference on NeuralInformation Processing Systems , 2017, pp. 4080–4090.[40] M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro,H. Wang, and B. Zhou, “Diverse few-shot text classiﬁcation withmultiple metrics,” in

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics ,2018, pp. 1206–1215.[41] T. Gao, X. Han, Z. Liu, and M. Sun, “Hybrid attention-based prototyp-ical networks for noisy few-shot relation classiﬁcation,” in

The Thirty- Third AAAI Conference on Artiﬁcial Intelligence, AAAI , Jan 2019, pp.6407–6414.[42] Z.-Y. Dou, K. Yu, and A. Anastasopoulos, “Investigating meta-learningalgorithms for low-resource natural language understanding tasks,” in

Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) , Hong Kong, China,Nov. 2019, pp. 1192–1197.[43] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales,“Learning to compare: Relation network for few-shot learning,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 1199–1208.[44] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al. , “Matchingnetworks for one shot learning,” in

Advances in neural informationprocessing systems , 2016, pp. 3630–3638.[45] R. Geng, B. Li, Y. Li, X. Zhu, P. Jian, and J. Sun, “Induction networksfor few-shot text classiﬁcation,” 2019.[46] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham,S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speakerrecognition,” 2020.[47] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Meta-learning for short utterance speaker recognition with imbalance lengthpairs,” 2020.[48] T. Ko, Y. Chen, and Q. Li, “Prototypical networks for small footprinttext-independent speaker veriﬁcation,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6804–6808.[49] P. Anand, A. K. Singh, S. Srivastava, and B. Lall, “Few shot speakerrecognition using deep neural networks,” 2019.[50] J. Wang, K. Wang, M. T. Law, F. Rudzicz, and M. Brudno, “Centroid-based deep metric learning for speaker recognition,” in

ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2019, pp. 3652–3656.[51] J. Kang, R. Liu, L. Li, Y. Cai, D. Wang, and T. F. Zheng, “Domain-invariant speaker vector projection by model-agnostic meta-learning,”2020.[52] N. R. Koluguri, M. Kumar, S. H. Kim, C. Lord, and S. Narayanan,“Meta-learning for robust child-adult classiﬁcation from speech,” in

ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2020, pp. 8094–8098.[53] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap,“Meta-learning with memory-augmented neural networks,” in

Interna-tional conference on machine learning , 2016, pp. 1842–1850.[54] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classiﬁcation: Generalizing to new classes at near-zerocost,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 35, no. 11, pp. 2624–2637, 2013.[55] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering withbregman divergences,”

J. Mach. Learn. Res. , vol. 6, p. 17051749, Dec.2005.[56] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,“Speaker diarization using deep neural network embeddings,” in , 2017, pp. 4930–4934.[57] T. J. Park, K. J. Han, M. Kumar, and S. Narayanan, “Auto-tuning spectralclustering for speaker diarization using normalized maximum eigengap,”

IEEE Signal Processing Letters , vol. 27, pp. 381–385, 2020.[58] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An asr corpus based on public domain audio books,” in , 2015, pp. 5206–5210.[59] M. K. Nandwana, J. van Hout, M. McLaren, C. Richey, A. Lawson, andM. A. Barrios, “The voices from a distance challenge 2019 evaluationplan,” 2019.[60] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers inthe wild (sitw) speaker recognition database,” in

Interspeech 2016 , 2016,pp. 818–822.[61] G. Sun, C. Zhang, and P. C. Woodland, “Speaker diarisation using 2dself-attentive combination of embeddings,” in

ICASSP 2019 - 2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 5801–5805.[62] M. Pal, M. Kumar, R. Peri, T. J. Park, S. Hyun Kim, C. Lord, S. Bishop,and S. Narayanan, “Speaker diarization using latent space clusteringin generative adversarial network,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6504–6508. [63] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, andM. Liberman, “First dihard challenge evaluation plan,” 2018.[64] S. Lee, A. Potamianos, and S. S. Narayanan, “Acoustics of children’sspeech: Developmental changes of temporal and spectral parameters,”

Journal of the Acoustical Society of America , vol. 105, no. 3, pp. 1455–1468, mar 1999, selected Research Article.[65] S. Lee, A. Potamianos, and S. Narayanan, “Developmental acousticstudy of american english diphthongs,”

The Journal of the AcousticalSociety of America , vol. 136, no. 4, pp. 1880–1894, 2014.[66] C. Lord, S. Risi, L. Lambrecht, E. H. Cook, B. L. Leventhal, P. C. DiLa-vore, A. Pickles, and M. Rutter, “The Autism Diagnostic ObservationSchedule—Generic: A Standard Measure of Social and CommunicationDeﬁcits Associated with the Spectrum of Autism,”

Journal of Autismand Developmental Disorders , vol. 30, no. 3, pp. 205–223, Jun 2000.[67] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noisecorpus,” 2015.[68] J. Xie, L. P. Garca-Perera, D. Povey, and S. Khudanpur, “Multi-PLDADiarization on Childrens Speech,” in

Proc. Interspeech 2019 , 2019, pp.376–380. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2961[69] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering:Discriminative embeddings for segmentation and separation,” in , 2016, pp. 31–35.[70] M. T. Law, R. Urtasun, and R. S. Zemel, “Deep spectral clusteringlearning,” in