[PDF] FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data

Abstract

The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2) Challenge is the second annual challenge held for the Speech and Language Technology community to motivate supervised learning algorithm development for multi-party and multi-stream naturalistic audio. In this paper, we present an overview of the challenge sub-tasks, data, performance metrics, and lessons learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present advancements made in FS-2 through extensive community outreach and feedback. We describe innovations in the challenge corpus development, and present revised baseline results. We finally discuss the challenge outcome and general trends in system development across both phases (Phase FS-1 Unsupervised, and Phase FS-2 Supervised) of the challenge, and its continuation into multi-channel challenge tasks for the upcoming Fearless Steps Challenge Phase-3.

Full PDF

FFEARLESS STEPS Challenge (FS-2): Supervised Learningwith Massive Naturalistic Apollo Data

Aditya Joglekar, John H.L. Hansen, Meena Chandra Shekar, Abhijeet Sangwan

Center for Robust Speech Systems (CRSS), Eric Jonsson School of Engineering,The University of Texas at Dallas (UTD), Richardson, Texas, USA { aditya.joglekar, john.hansen, meena.chandrashekar, abhijeet.sangwan } @utdallas.edu Abstract

The Fearless Steps Initiative by UTDallas-CRSS led to the dig-itization, recovery, and diarization of 19,000 hours of originalanalog audio data, as well as the development of algorithms to ex-tract meaningful information from this multi-channel naturalisticdata resource. The 2020 FEARLESS STEPS (FS-2) Challengeis the second annual challenge held for the Speech and LanguageTechnology community to motivate supervised learning algo-rithm development for multi-party and multi-stream naturalisticaudio. In this paper, we present an overview of the challengesub-tasks, data, performance metrics, and lessons learned fromPhase-2 of the Fearless Steps Challenge (FS-2). We presentadvancements made in FS-2 through extensive community out-reach and feedback. We describe innovations in the challengecorpus development, and present revised baseline results. We ﬁ-nally discuss the challenge outcome and general trends in systemdevelopment across both phases (Phase FS-1 Unsupervised, andPhase FS-2 Supervised) of the challenge, and its continuationinto multi-channel challenge tasks for the upcoming FearlessSteps Challenge Phase-3.

Index Terms : NASA Apollo 11 mission, corpus, speech activitydetection, speaker diarization, speaker identiﬁcation, speechrecognition, multi-channel audio streams, diarized segments.

1. Introduction

Recent decades have seen tremendous improvements to Speechand Language Technology (SLT) systems. This has only beenpossible due to thoroughly curated speech and language cor-pora that have been made publicly available [1, 2, 3, 4, 5]. Theability for systems to adapt to, and extract meaningful infor-mation from unlabeled data using limited ground-truth knowl-edge is a challenge in machine learning and AI [6, 7, 8]. Un-fortunately, there is an unlimited amount of unstructured andunsupervised data compared to high quality human annotateddata. To effectively address this reality, development of solu-tions will require consistent improvements to SLT systems. Theinitially digitized 19,000 hours from the NASA Apollo-11 andApollo-13 missions [9, 10] represent the largest naturalistic timesynchronized multi-channel data. This corpus will be supple-mented in continuing efforts with an additional 150,000 hours,enabling research on the largest publicly available corpus tilldate. Structuring this data through pipeline diarization tran-scripts, automatic speaker/sentiment tagging, etc., will enablepreservation and archiving of historical data. These efforts willmassively increase research opportunities, and be of signiﬁcantbeneﬁt to the STEM community. As an initial step to motivatethis stream-lined and collaborative effort from the SLT commu-nity, UTDallas-CRSS has been hosting a series of progressivelycomplex tasks to promote advanced research on naturalisticBig Data corpora. This began with the Inaugural FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-1). The ﬁrstedition of this challenge encouraged the development of coreunsupervised/semi-supervised speech and language systems forsingle-channel data with low resource availability, serving as theFirst Step towards extracting high-level information from suchmassive unlabeled corpora [11, 12, 13, 14, 15]. As a naturalprogression following the successful inaugural FS-1 challenge,the FEARLESS STEPS Challenge Phase-2 (FS-2) focuses onthe development of single-channel supervised learning strate-gies. FS-2 Challenge provides 80 hours of ground-truth datathrough training (Train) and development (Dev) sets, with anadditional 20 hours of blind-set evaluation (Eval) data. Based onfeedback from the Fearless Steps participants, additional tracksfor streamlined speech recognition and speaker diarization havebeen included in the FS-2. To encourage diversiﬁed researchinterests, participants were also encouraged to utilize the FS-2corpus to explore additional problems dealing with naturalisticdata. The results for this challenge will be presented at the ISCAINTERSPEECH-2020 Special Session.

2. Community Outreach & Feedback

The NASA Apollo Mission Control recordings are rich sourceof time-critical team based communications. Complex commu-nication characteristics in this corpus can be explored throughmultiple avenues, and require vast resource utilization [16, 17].Figure 1:

Analysis of community feedback. (top): ParticipantBreakdown (bottom): Most requested areas of interest

To ensure optimum long-term beneﬁts of exploring this cor-pus, feedback from researchers in multiple intersecting disci-plines is crucial. An essential component of corpora developmentfollowing the completion of FS-1 was a focus on community a r X i v : . [ ee ss . A S ] A ug utreach and feedback. Multiple community engagement ses-sions were conducted with an aim in gathering essential futuredirections for the evolving FS Apollo corpus. The three commu-nities directly beneﬁting from this corpus research and develop-ment include: (i) Speech Processing Technology (SpchTech), (ii)Communication Science and History (CommSciHist), and (iii)Education/STEM, Preservation/Archives, and Community-use(EducArch), who were consulted through workshop engage-ments. Community engagements are illustrated in Figure 1. User feedback was primarily collected through 6 workshops atSTEM and archival events (including IS-19, JSALT-19, ASA-19,ASRU-19). Online surveys of researchers downloading the FScorpus enabled access to feedback globally. A sample of thefeedback from the above mentioned communities is illustrated inFigure 1. The salient responses across all communities focusedon availability of more labeled data for system development,linking unstructured audio data with relevant meta-data throughrobust semi-supervised SLT systems, convenient data access,and retrieval through pipeline diarization transcripts.

Over 170,000 hours of synchronized audio data were collectedby NASA during the Apollo missions. Digitizing this audio withsynchronized SLT pipeline processing would enable streamlinedinformation access and retrieval to all communities. Due to re-source limitations on developing manual annotations, speech andlanguage systems capable of extracting meaningful informationusing limited ground-truth resources are necessary. FS-1 wasdesigned with this premise, providing 20 hours of developmentset ground-truth, and 20 hours of evaluation set for ﬁve tasks:Speech Activity Detection (SAD), Speaker Diarization (SD),Speaker Identiﬁcation (SID), Speech Recognition (ASR), andSentiment Detection. These 40 hours of data was selected fromchannels with comparatively lower levels of degradation. A lexi-con and language model based on 4.2 billion NASA mission textcontent was also freely provided [18, 19]. Semi-supervised andunsupervised systems optimized for the Apollo data were used asbaseline systems [20, 21, 22, 23, 24]. These systems have beenused as benchmarks for evaluating the variability introduced inFS-2 by an additional 60 hours of audio from highly degradedchannels.Table 1:

Comparison of baseline results for FS-1 and FS-2evaluation sets. Evaluation Metrics for FS-1 and FS-2: SAD:DCF (%), SID: Top-5-Acc (%), SD: DER (%), ASR: WER (%),with Relative degradation in performance for same systems (%)

Fearless Steps System(s) Performance on Eval SetTask FS-1 (%) FS-2 (%) Rel. Degradation (%)

SAD 11.70 13.60

SD 68.23 88.27

SID 47.00 41.70

ASR 88.42 84.05 - 4.90

With a goal to maintain competitiveness in FS-2, highercontent of degraded audio was selected to form the Eval set inFS-2 to offset the advantage of Train set ground-truth availability.This is detailed in Section 4.1. Table-1 provides a comparison ofbaseline system performance for all tasks over the Eval sets ofFS-1 and FS-2. Signiﬁcant degradation in system performance inthree out of four tasks is observed. The evaluation metrics usedfor tasks SAD, SD, SID, and ASR were detection cost function(DCF), diarization error rate (DER), top-5 accuracy (%), and word error rate (WER) respectively [7, 25, 26].Sentiment Detection task from FS-1 provided participantswith rudimentary labels of ‘positive’, ‘neutral’, and ‘negative’.However, all communities expressed interest in descriptive labelsfor emotion and behavioral analysis, as seen in Figure 1. Hence,sentiment detection task was removed from FS-2, and will bereintroduced in FS-3 as emotion detection task with 100 hoursof improved labels.

3. FS-2 Challenge Tasks

The consensus from the community on requirement of in-creased transcribed data, and incremental task-targeted labelingprompted focused efforts on providing more variety in core-speech tasks. Hence, for FS-2, two separate challenge trackswere introduced for diarization and speech recognition. Thespeaker diarization track SD track2 focuses on developing robustspeaker embedding and clustering algorithms, while SD track1caters to the more challenging task of diarization from scratch.Equivalently, the Speech Recognition track ASR track2 focuseson transcribing diarized speech segments (each segment con-tains noisy speech from a single speaker), while ASR track1incorporates the broader scope of transcribing noisy overlappedmulti-speaker continuous streams. All challenge tasks for FS-2are given in the following list:•

TASK 1 : Speech Activity Detection (SAD) • TASK 2 : Speaker Identiﬁcation (SID) • TASK 3 : Speaker Diarization ◦ (3.a.) Track 1 : using system SAD (SD track1) ◦ (3.b.) Track 2 : using reference SAD (SD track2) • TASK 4 : Automatic Speech Recognition ◦ (4.a.) Track 1: using system SAD (ASR track1) ◦ (4.b.) Track 2: using diarized audio (ASR track2) The evaluation metrics for all tasks are consistent with theprevious challenge, and described in Section 2.2 [26, 27, 28]. Ascoring toolkit was made publicly available for this challenge.

4. Corpus Re-Deployment (FS-2)

The ﬁve selected channels Flight Director ( FD ), Mission Opera-tions Control Room ( MOCR ), Guidance Navigation and Control(

GNC ), Network Controller (

NTWK ), and Electrical Environ-mental and Consumables Manager (

EECOM ) from FS-1 werepreserved with improved labeling for FS-2. The high degree ofvariability in speech and noise characteristics across these ﬁvechannels has been explored previously [1, 2, 19, 29]. In FS-2,we introduce 60 hours of additional speech transcriptions andspeaker labels from these channels to the existing 40 hours toprovide sufﬁcient data for supervised system training.

The Dev, and Eval sets provided through FS-1 were developedusing 70% audio streams selected from clean channels, and 30%selected from degraded channels. The Train, Dev, and Eval setsfor FS-2 were categorized with scope to introduce multi-channeltasks in future challenges, while maintaining progressive dif-ﬁcultly in veriﬁcation sets. The intention behind this data set https://github.com/aditya-joglekar/FS02_Scoring_Toolkit igure 2: Probability Distributions of decision parameters for Train, Dev, and Eval sets.

Table 2:

General statistics for the SID task. The mean, median, minimum, and maximum valuesfor cumulative speaker durations, and individual speaker utterances are all expressed in seconds.

Data set

Train 218 505.5 106.7 (6.89 , 11254.36) 4.03 (1.84 , 16.95) 27336Dev 218 118.1 24.2 (3.13 , 2596.18) 4.04 (1.78 , 16.95) 6373Eval 218 156.9 31.5 (3.19 , 3460.41) 4.04 (1.8 , 16.22) 8466 design was to replicate naturalistic system development pro-cesses [5, 6, 30]. The FS-2 Challenge Corpus audio is dividedinto (i) audio streams, and (ii) audio segments. Audio streams re-ﬂect unaltered digitized audio from the Apollo missions. Audiosegments are short duration speech sections diarized from theaudio streams. Each segment contains a continuous speech utter-ance from a single speaker. Section 4.2 describes the process ofsplitting 100 hours into Train, Dev, and Eval sets. Section 4.3provides more insight into development of segment based tasksSID and ASR track2.

Performance of SLT systems is dependent on factors like over-lap content present in the data, amount of unintelligible speech,speech density variation, amount of data with unknown speakers,etc. In addition to this, the unsupervised baseline systems are use-ful in providing a measure of degradation in a given audio stream.We use the term ’decision parameters’ to cumulatively describethe above measures. Using this methodology, it is possible toprovide sets with progressive levels of difﬁculty across multi-channel audio streams in spite of inter-channel variations foundin the Apollo data [1]. We perform this process by calculating alldecision parameters for 100 hours of audio streams individually.These parameters are then normalized to generate degradationscores across the 100 hours. These scores are time-aligned across5 channels and averaged to provide a single degradation scoreper 30-minute time chunk. These scores are ﬁnally categorizedinto three sets by progressive order of degradation. 5 channelsegments with a cumulative highest degradation across all deci- sion parameters are thus included in the Eval set, followed byDev set. The streams with the least performance degradationare selected into the Train set. Trends observed from Figure 2explain that even when the overall degradation across multiplechannels is large, due to the variances in channel characteris-tics, the distributions for Train, Dev, and Eval sets have similarmeans, but differing distributions. Such varying distributionsacross decision parameters can aid in assessing the robustness ofsystems and their ability to generalize to data with a high degreeof cross-channel variability.Table 3:

Duration Statistics of audio segments for ASR track2.The mean, min, and max values are expressed in seconds.

Data set Segments Utterance Duration (s)mean min max

Train 35,474 2.85 0.10 70.37Dev 9,203 2.97 0.12 67.39Eval 13,714 2.78 0.10 53.04

SID task in FS-1 challenge provided 183 speakers a minimumof 10 seconds of training data. FS-2 SID task extends this set byadding over 30,000 additional utterances for 218 speakers. Withshorter utterance durations and larger variations in speaker dura-tions as seen in Table-2, FS-2 provides a more challenging taskover FS-1. This data also encapsulates the challenges faced inspeaker tagging for Apollo corpora. While a few personnel hadmajor speaking roles, most backroom staff in the mission controludio recordings had limited but integral speaking roles, makingunbalanced and low resource speaker identiﬁcation essential fora real-world scenario. Table-3 illustrates the general durationstatistics of audio segments provided for the ASR track2 task.While this task has the advantage of having fully diarized seg-ments, the single word utterance durations shorter than 0.2 secspose a challenge to ASR systems.

5. Baseline Systems

The SAD, ASR, and speaker diarization baseline systems fromthe ﬁrst challenge were retrained and optimized for usage in thischallenge [2]. Both tracks for SD and ASR tasks were evaluatedusing the same system, with differing conﬁgurations. Baselineresults for all tasks are provided in Table-4.Table 4:

Baseline Results for Development and Evaluation Sets

Fearless Steps Phase-02 Baseline ResultsTask Metric Dev (%) Eval (%)

SAD DCF 12.50 13.60SD track1 DER 79.72 88.27SD track2 DER 68.68 67.91SID Top-5 Acc. 75.20 72.46ASR track1 WER 83.80 84.05ASR track2 WER 80.50 82.23

The SID baseline system developed for FS-1 used i-Vectors forfront-end processing [23]. This system was more suited to theFS-1 SID data since it had at least 10 seconds of speech contentper speaker. Due to the challenging nature of the current FS-2SID data ( ≤ Reduced dimensional i-Vector embedding (left) , andx-Vector embedding (right) t-SNE plots for 140 speakers [33]

To provide an alternate baseline system more suited to therevised SID data, SincNet system was used [34]. Input data wasnormalized and preprocessed to provide speech frames usingthe rVAD system (which ranked 4 th in the FS-1 SAD task) [35].rVAD system threshold was optimized to provide strict speechboundaries. The SincNet was trained for 360 epochs. This sys-tem (shown in Figure 4) provided a Top-5 Accuracy of 72.46%,which was a 30% absolute improvement over the FS-1 SIDbaseline system.

6. Discussion

FS-2 Challenge concluded with 111 system submissions acrossall tasks. While this was similar to the 116 system submissionsreceived for FS-1 challenge, participation for both tracks of SDand ASR tasks was noticeably higher. The systems developed for Figure 4: rVad-SincNet based SID baseline system [34, 35]

FS-2 also exhibited vast improvements in performance comparedto the best systems developed for FS-1 challenge [2, 11, 12, 13,15], as seen in Table-5. We observed relative improvements of67%, 57%, and 62% for SAD, Speaker Diarization from scratch,and Speech Recognition from audio streams tasks respectively.These top ranked systems from the community will be used todevelop baselines for the next phase of the challenge, FS-3.Table 5:

Comparison of the best systems developed for all FS-1and FS-2 challenge tasks. Relative improvement of top-rankedsystem per task in FS-2 over FS-1 is illustrated.

Comparison of Best System SubmissionsTask FS-1 (%) FS-2 (%) Rel. Imp. (%)

SAD 3.31 1.07

SID 89.94 92.39 2.72 %SD track1 68.23 28.85

SD track2

N/A

ASR track1 63.97 24.01

ASR track2

N/A

7. Conclusions

The FEARLESS STEPS Challenge Phases are aimed at develop-ing robust speech and language systems for multi-party naturalis-tic audio. FS-2 enabled the development of new state-of-the-artsupervised systems for core-speech tasks on Apollo data throughits Challenge Corpus. Train, Dev, and Eval sets compatiblefor multi-channel challenges were also developed. Final Phase(FS-3) of the Fearless Steps initiative will include single andmulti-channel core-speech tasks on the available 100 hours, and20 hours of yet unrevealed Apollo-13 multi-channel audio (Hous-ton, we’ve had a problem!). System advancements through FS-2have also accelerated the development of conversational analysisand natural language understanding tasks for FS-3 like hot-spotdetection, topic summarization, and emotion detection.

8. Acknowledgements

This project was supported in part by AFRL under contractFA8750-15-1-0205, NSF-CISE Project 1219130, and partiallyby the University of Texas at Dallas from the DistinguishedUniversity Chair in Telecommunications Engineering held byJ.H. L. Hansen. We would also like to thank Tatiana Korelskyand the National Science Foundation (NSF) for their supporton this scientiﬁc and historical project. A special thanks toKatelyn Foxworth (CRSS Transcription Team) for leading theground-truth development efforts on the FS-2 Challenge Corpus. . References [1] J. H. Hansen, A. Sangwan, A. Joglekar, A. E. Bulut, L. Kaushik,and C. Yu, “Fearless Steps: Apollo-11 Corpus Advancementsfor Speech Technologies from Earth to the Moon,” in

Proc.Interspeech 2018 , 2018, pp. 2758–2762. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1942[2] J. H. Hansen, A. Joglekar, M. C. Shekhar, V. Kothapally, C. Yu,L. Kaushik, and A. Sangwan, “The 2019 Inaugural FearlessSteps Challenge: A Giant Leap for Naturalistic Audio,” in

Proc.Interspeech 2019 , 2019, pp. 1851–1855. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-2301[3] J. Carletta, “Unleashing the killer corpus: experiences in creatingthe multi-everything AMI Meeting Corpus,”

Language Resourcesand Evaluation , vol. 41, no. 2, pp. 181–190, 2007.[4] M. Harper, “The Automatic Speech Recognition in ReverberantEnvironments (ASpIRE) challenge,” in . IEEE,2015, pp. 547–554.[5] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “TheFifth ’CHiME’ Speech Separation and Recognition Challenge:Dataset, Task and Baselines,” in

Proc. Interspeech 2018 , 2018,pp. 1561–1565. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1768[6] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets,R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas et al. , “The REVERB challenge: A common evaluation frameworkfor dereverberation and recognition of reverberant speech,” in . IEEE, 2013, pp. 1–4.[7] F. R. Byers, J. G. Fiscus, S. O. Sadjadi, G. A. Sanders, and M. A.Przybocki, “Open Speech Analytic Technologies Pilot EvaluationOpenSAT Pilot,” NIST, Tech. Rep., 2019.[8] G. E. Hinton, T. J. Sejnowski, T. A. Poggio et al. , UnsupervisedLearning: Foundations of Neural Computation . MIT press, 1999.[9] A. Sangwan, L. Kaushik, C. Yu, J. H. Hansen, and D. W. Oard,“Houston, We have a Solution : Using NASA Apollo Program toadvance Speech and Language Processing Technology.” in

INTER-SPEECH

Proc.Interspeech 2019 , pp. 2015–2019, 2019.[12] A. Vafeiadis, E. Fanioudakis, I. Potamitis, K. Votis, D. Giakoumis,D. Tzovaras, L. Chen, and R. Hamzaoui, “Two-Dimensional Con-volutional Recurrent Neural Networks for Speech Activity Detec-tion,” in

Proc. Interspeech 2019 . International Speech Communi-cation Association, 2019.[13] G. Deshpande, V. S. Viraraghavan, and R. Gavas, “A SuccessiveDifference Feature for Detecting Emotional Valence from Speech,”in

Proc. SMM19, Workshop on Speech, Music and Mind 2019 ,2019, pp. 36–40.[14] P. Fallgren, Z. Malisz, and J. Edlund, “How to annotate 100 hoursin 45 minutes,”

Proc. Interspeech 2019 , pp. 341–345, 2019.[15] V. Manohar et al. , “Semi-Supervised Training for AutomaticSpeech Recognition,” Ph.D. dissertation, Johns Hopkins University,2019.[16] J. H. Hansen, A. Joglekar, A. Sangwan, and C. Yu, “FearlessSteps: Taking the next step towards advanced speech technologyfor naturalistic audio,”

The Journal of the Acoustical Society ofAmerica , vol. 146, no. 4, pp. 2956–2956, 2019.[17] A. Joglekar and J. H. Hansen, “Fearless Steps, NASAs ﬁrst heroes:Conversational speech analysis of the Apollo-11 mission controlpersonnel,”

The Journal of the Acoustical Society of America , vol.146, no. 4, pp. 2956–2956, 2019.[18] A. Stolcke, “SRILM - an Extensible Language Modeling Toolkit,”in

Seventh international conference on spoken language processing ,2002. [19] L. N. Kaushik, “Conversational Speech Understanding in HighlyNaturalistic Audio Streams,” Ph.D. dissertation, University ofTexas at Dallas, 2018.[20] S. O. Sadjadi and J. H. L. Hansen, “Unsupervised Speech ActivityDetection Using Voicing Measures and Perceptual Spectral Flux,”

IEEE Signal Processing Letters , vol. 20, no. 3, pp. 197–200, March2013.[21] V. Kothapally and J. H. Hansen, “Speech Detection and Enhance-ment Using Single Microphone for Distant Speech Applications inReverberant Environments.” in

INTERSPEECH , 2017, pp. 1948–1952.[22] H. Dubey, A. Sangwan, and J. H. Hansen, “Robust SpeakerClustering using Mixtures of von Mises-Fisher Distributions forNaturalistic Audio Streams,” in

Proc. Interspeech 2018 , 2018,pp. 3603–3607. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-50[23] F. Bahmaninezhad and J. H. L. Hansen, “i-Vector/PLDA speakerRecognition using Support Vectors with Discriminant Analysis,”in , March 2017, pp. 5410–5414.[24] W. Xia, J. Huang, and J. H. Hansen, “Cross-lingual Text-independent Speaker Veriﬁcation Using Unsupervised AdversarialDiscriminative Domain Adaptation,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2019, pp. 5816–5820.[25] “NIST Rich Transcription Spring 2003 Evaluation,” https://catalog.ldc.upenn.edu/LDC2007S10, accessed: 2019-03-01.[26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi Speech Recognition Toolkit,” in

IEEE 2011 workshopon automatic speech recognition and understanding , no. CONF.IEEE Signal Processing Society, 2011.[27] C. S. Greenberg, D. Bans´e, G. R. Doddington, D. Garcia-Romero,J. J. Godfrey, T. Kinnunen, A. F. Martin, A. McCree, M. Przybocki,and D. A. Reynolds, “The NIST 2014 Speaker Recognition i-Vector Machine Learning Challenge,” in

Odyssey: The Speakerand Language Recognition Workshop , 2014, pp. 224–230.[28] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, andM. Liberman, “First DIHARD Challenge Evaluation Plan,” 2018.[29] A. Ziaei, L. Kaushik, A. Sangwan, J. H. Hansen, and D. W. Oard,“Speech Activity Detection for NASA Apollo Space Missions:Challenges and Solutions,” in

Fifteenth Annual Conference of theInternational Speech Communication Association , 2014.[30] N. Bendre, N. Ebadi, J. J. Prevost, and P. Najaﬁrad, “HumanAction Performance Using Deep Neuro-Fuzzy Recurrent AttentionModel,”

IEEE Access , vol. 8, pp. 57 749–57 761, 2020.[31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-End Factor Analysis for Speaker Veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19, no. 4,pp. 788–798, May 2011.[32] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-Vectors: Robust DNN Embeddings for Speaker Recognition,”in . IEEE, 2018, pp. 5329–5333.[33] L. v. d. Maaten and G. Hinton, “Visualizing Data using t-SNE,”

Journal of Machine Learning Research , vol. 9, no. Nov, pp. 2579–2605, 2008.[34] M. Ravanelli and Y. Bengio, “Speaker Recognition from RawWaveform with SincNet,” in . IEEE, 2018, pp. 1021–1028.[35] Z.-H. Tan, N. Dehak et al. , “rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method,”