An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Jesper Jensen
11 An Overview of Deep-Learning-Based Audio-VisualSpeech Enhancement and Separation
Daniel Michelsanti,
Student Member, IEEE,
Zheng-Hua Tan,
Senior Member, IEEE,
Shi-Xiong Zhang,
Member, IEEE,
Yong Xu,
Member, IEEE,
Meng Yu,Dong Yu,
Fellow Member, IEEE, and Jesper Jensen
Abstract — Speech enhancement and speech separation are tworelated tasks, whose purpose is to extract either one or moretarget speech signals, respectively, from a mixture of soundsgenerated by several sources. Traditionally, these tasks have beentackled using signal processing and machine learning techniquesapplied to the available acoustic signals. More recently, visualinformation from the target speakers, such as lip movements andfacial expressions, has been introduced to speech enhancementand speech separation systems, because the visual aspect ofspeech is essentially unaffected by the acoustic environment.In order to efficiently fuse acoustic and visual information, re-searchers have exploited the flexibility of data-driven approaches,specifically deep learning , achieving state-of-the-art performance.The ceaseless proposal of a large number of techniques toextract features and fuse multimodal information has highlightedthe need for an overview that comprehensively describes anddiscusses audio-visual speech enhancement and separation basedon deep learning. In this paper, we provide a systematic surveyof this research topic, focusing on the main elements thatcharacterise the systems in the literature: visual features ; acousticfeatures ; deep learning methods ; fusion techniques ; training targets and objective functions . We also survey commonly employed audio-visual speech datasets , given their central role in thedevelopment of data-driven approaches, and evaluation methods ,because they are generally used to compare different systemsand determine their performance. In addition, we review deep-learning-based methods for speech reconstruction from silentvideos and audio-visual sound source separation for non-speechsignals , since these methods can be more or less directly appliedto audio-visual speech enhancement and separation. Index Terms —Speech enhancement, speech separation, speechsynthesis, sound source separation, deep learning, audio-visualprocessing.
I. I
NTRODUCTION S PEECH is one of the primary ways in which humansshare information. A model that describes human speechcommunication is the so-called speech chain , which consistsof two stages: speech production and speech perception [49].Speech production is the set of voluntary and involuntaryactions that allow a person, i.e. a speaker , to convert anidea expressed through a linguistic structure into a sound
D. Michelsanti, and Z.-H. Tan are with the Department of Elec-tronic Systems, Aalborg University, Aalborg 9220, Denmark (e-mail: { danmi,zt } @es.aau.dk).Shi-Xiong Zhang, Yong Xu, Meng Yu, and Dong Yu are with TencentAI Lab, Bellevue, WA, USA (e-mail: { auszhang, lucayongxu, raymondmyu,dyu } @tencent.com).J. Jensen is with the Department of Electronic Systems, Aalborg University,Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark(e-mail: [email protected]). pressure wave. On the other hand, speech perception is theprocess happening mostly in the auditory system of a listener ,consisting of interpreting the sound pressure wave comingfrom the speaker. Some external factors, such as acousticbackground noise, can have an impact on the speech chain.Usually, normal-hearing listeners are able to focus on a specificacoustic stimulus, in our case the target speech or speech ofinterest , while filtering out other sounds [25], [228]. This well-known phenomenon is called the cocktail party effect [34],because it resembles the situation occurring at a cocktail party.Generally, the presence of high-level acoustic environmentalnoise or competing speakers poses several challenges to thespeech communication effectiveness, especially for hearing-impaired listeners. Similarly, the performance of automaticspeech recognition (ASR) systems can be severely impactedby a high level of acoustic noise. Therefore, several signalprocessing and machine learning techniques to be employed ine.g. hearing aids and ASR front-end units have been developedto perform speech enhancement (SE), which is the task ofrecovering the clean speech of a target speaker immersed ina noisy environment. On the other hand, some applicationsrequire the estimation of multiple target signals. This taskis known in the literature as source separation or speechseparation (SS), when the signals of interest are all speechsignals.Classical SE and SS approaches (cf. [162], [260] andreferences therein) make assumptions regarding the statisticalcharacteristics of the signals involved and aim at estimating theunderlying target speech signal(s) according to mathematicallytractable criteria. More recent methods based on deep learning tend to depart from this knowledge-based modelling, embrac-ing a data-driven paradigm, where SE and SS are treated assupervised learning problems [261].The techniques mentioned above consider only acousticsignals, so we refer to them as audio-only SE (AO-SE) andaudio-only SS (AO-SS) systems. However, speech perceptionis inherently multimodal, in particular audio-visual (AV),because in addition to the acoustic speech signal reachingthe ears of the listeners, location and movements of somearticulatory organs that contribute to speech production, e.g.tongue, teeth, lips, jaw and facial expressions, may also bevisible to the receiver. Studies in neuroscience [78], [202]and speech perception [174], [236] have shown that the visualaspect of speech has a potentially strong impact on the abilityof humans to focus their auditory attention on a particularstimulus. These findings inspired the first audio-visual SE (AV- a r X i v : . [ ee ss . A S ] A ug SE) and audio-visual SS (AV-SS) works [47], [73], whichdemonstrated the benefit of using features extracted from thevideo of a speaker. Later, more complex frameworks basedon classical statistical approaches have been proposed [2],[14], [137], [155], [159], [171], [186], [187], [231], [232], butthey have very recently been outperformed by deep learningmethods, such as [7], [10], [12], [55], [66], [77], [85], [99],[122], [128], [164], [165], [178], [195], [222], [244], [273],[274].Despite the large amount of recent research and the interestin AV methods, no overview article currently focuses on deep-learning-based AV-SE and AV-SS. The survey article by Wangand Chen [261] is the most extensive overview on deep-learning-based AO-SE and AO-SS for both single-microphoneand multi-microphone settings, but it does not cover AVmethods. The overview article by Rivet et al. [213] surveysAV-SS techniques, but it dates back to 2014, when deeplearning was still not adopted for the task. Multimodal methodsare also covered by Taha and Hussain [242] in their surveyon SE techniques. However, six AV-SE papers are discussedin total, and only one of these is based on deep learning. Alimited number of deep learning approaches for AV-SE andAV-SS were described in [212], [289]. In the first case, Rinc´on-Trujillo and C´ordova-Esparza [212] performed an analysis ofdeep-learning-based SS methods. They considered both AO-SS and AV-SS, with only five AV papers discussed. In thesecond case, Zhu et al. [289] provided a bird’s-eye view ofseveral AV tasks, to which deep learning has been applied.Although AV-SE and AV-SS are discussed, the presentationcovers only five approaches.In this paper, we present an extensive survey of recentadvances in AV methods for SE and SS, with a specific focuson deep-learning-based techniques. Our goal is to help thereader to navigate through the different approaches in theliterature. Given this objective, we try not to recommend oneapproach over another based on its performance, because acomparison of systems designed for a heterogeneous set ofapplications might be unfair. Instead, we provide a systematicdescription of the main ideas and components that characterisedeep-learning-based AV-SE and AV-SS systems, hoping toinspire and stimulate new research in the field. This is also thereason why current challenges and possible future directionsare presented and discussed throughout the paper, generally atthe end of a section or a subsection. Furthermore, we reviewAV datasets and evaluation methods, because they are twoimportant elements used to train and assess the performanceof the systems, respectively. In the final part of the paper, anoverview of two strongly related research topics, i.e. speech re-construction from silent videos and audio-visual sound sourceseparation for non-speech signals , is also provided, becausemethods and ideas in these areas could be directly applied toAV-SE and AV-SS. A list of resources for datasets, objectivemeasures and several AV approaches can be accessed atthe following link: https://github.com/danmic/av-se. There, weprovide direct links to available demos and source codes, thatwould not be possible to include in this paper due to spacelimitations. Our goal is to allow both beginners and experts inthe fields to easily access a collection of relevant resources. The rest of this paper is organised as follows. Section II in-troduces the basic signal model to provide a formulation of theAV-SE and AV-SS problems. Section III surveys relevant AVspeech datasets that can be used to train deep-learning-basedmodels. Section IV presents a range of methodologies that maybe considered for performance assessment. Section V reviewsdeep-learning-based AV-SE and AV-SS systems, focusing onthe main elements that characterise them. Section VI dealswith speech reconstruction from silent videos and AV soundsource separation for non-speech signals. Finally, Section VIIprovides a conclusion, summarising the principal concepts andthe potential future research directions presented throughoutthe paper.II. S
IGNAL M ODEL AND P ROBLEM F ORMULATION
Let h s [ n ] denote the impulse response from the spatialposition of the s -th target source to the microphone, with n indicating a discrete-time index. Furthermore, let h s [ n ] = h es [ n ]+ h ls [ n ] , where h es [ n ] is the early part of h s [ n ] (containingthe direct sound and low-order reflections) and h ls [ n ] is thelate part of h s [ n ] . Assuming a total number of S target speechsignals and a number of C additive noise sources, the observedacoustic mixture signal can be modelled as: y [ n ] = S (cid:88) s =1 x (cid:48) s [ n ] ∗ h es [ n ] (cid:124) (cid:123)(cid:122) (cid:125) x s [ n ] + S (cid:88) s =1 x (cid:48) s [ n ] ∗ h ls [ n ] + C (cid:88) c =1 d c [ n ] (cid:124) (cid:123)(cid:122) (cid:125) d [ n ] , (1)where x (cid:48) s [ n ] is the speech signal emitted at the s -th targetspeaker position, x s [ n ] is the clean speech signal from the s -th target speaker at the microphone (including low-orderreflections), d c [ n ] is the signal from the c -th noise sourceas observed at the microphone and d [ n ] indicates the totalcontribution from noise and late reverberations. Besides y [ n ] ,let v [ m ] indicate the observed two-dimensional visual signal,with m denoting a discrete-time index different from n ,because the acoustic and the visual signals are usually notsampled with the same sampling rate.Given y [ n ] and v [ m ] , the task of AV-SS consists of de-termining estimates ˆ x s [ n ] of x s [ n ] , with s = 1 , . . . , S . Insome setups, additional information is available, for example aspeakers’ enrolment acoustic signal and a training set collectedunder time and location different from the recordings of y [ n ] and v [ m ] .When S = 1 , we refer to the task as AV-SE and rewriteEq. (1) as: y [ n ] = x [ n ] + d [ n ] , (2)with x [ n ] denoting x [ n ] .Due to the linearity of the short-time Fourier transform(STFT), it is possible to express the acoustic signal modelof Eqs. (1) and (2) in the time-frequency (TF) domain as: Y ( k, l ) = S (cid:88) s =1 X s ( k, l ) + D ( k, l ) , (3) While preserving early reflections is important in some applications (e.g.hearing aids), in other cases the goal is to determine only estimates of x (cid:48) s [ n ] .This observation does not have a big impact on the formulation of the problem,therefore we are not going to make a distinction between the two cases. for SS, and as: Y ( k, l ) = X ( k, l ) + D ( k, l ) , (4)for SE, where k denotes a frequency bin index, l indicatesa time frame index, and Y ( k, l ) , X s ( k, l ) and D ( k, l ) arethe short-time Fourier transform (STFT) coefficients of themixture, the s -th target signal, and the noise, respectively.The definitions provided above are valid for single-microphone single-camera AV-SE and AV-SS. It is possibleto extend all the concepts to the case of multiple acoustic andvisual signals. Let F and P be the number of cameras andmicrophones of a system, respectively. We denote as v f [ m ] the observed visual signal with the f -th camera. Assuming Sspeakers to separate, then the acoustic mixture as received bythe p -th microphone can be modelled as: y p [ n ] = S (cid:88) s =1 x (cid:48) s [ n ] ∗ h eps [ n ] (cid:124) (cid:123)(cid:122) (cid:125) x ps [ n ] + S (cid:88) s =1 x (cid:48) s [ n ] ∗ h lps [ n ] + C (cid:88) c =1 d pc [ n ] (cid:124) (cid:123)(cid:122) (cid:125) d p [ n ] . (5)In this case, the SS task consists of determining estimates ˆ x s [ n ] of x p ∗ s [ n ] for s = 1 , . . . , S , given v f [ m ] with f =1 , . . . , F , y p [ n ] with p = 1 , . . . , P and any other additionalinformation, assuming that the microphone with index p = p ∗ is a pre-defined reference microphone.III. A UDIO -V ISUAL C ORPORA
One of the key aspects that allowed the recent progress andadoption of deep learning techniques for a range of differenttasks is the availability of large-scale datasets. Therefore, thechoice of a database is critical and it is determined by thespecific purpose of the research that needs to be conducted.With this Section, our goal is to provide a non-exhaustiveoverview of existing resources which will hopefully help thereader to choose AV datasets that suit their purpose. In thefollowing, we present the most commonly used databases forAV-SE and AV-SS, highlighting their main characteristics.Table I shows information regarding AV speech datasets,including: year of publication, number of speakers, linguisticcontent, video resolution and frame rate, audio sample fre-quency, additional information (e.g. recording settings) andthe AV-SE and AV-SS papers in which the datasets are usedfor the experiments. We notice that the two most commonlyused databases in the area of deep-learning-based AV-SEand AV-SS are GRID [43] and TCD-TIMIT [89]. GRIDconsists of audio and video recordings where 34 speakers(18 males and 16 females) pronounce 1000 sentences each.The data was collected in a controlled environment : thespeakers were placed in front of a plain blue wall insidean acoustically isolated booth and their face was uniformlyilluminated. A GRID sentence has the following structure: < command(4) > < color(4) > < preposition(4) > < letter(25) >< digit(10) > < adverb(4) > , where the number of choices foreach word is indicated in parentheses. Although the numberof possible command combinations using such a sentencestructure is high, the vocabulary is small, with only 51 words.This may pose limitations to the generalisation performance of a deep learning model trained with this database. Similarto GRID, TCD-TIMIT consists of recordings in a controlledenvironment, where the speaker is in front of a plain greenwall and their face is evenly illuminated. Compared to GRID,TCD-TIMIT has more speakers, 62 in total (32 males and 30females, three of which are lipspeakers ), and they pronouncea phonetically balanced group of sentences from the TIMITcorpus. Other databases have characteristics similar to GRIDand TCD-TIMIT (e.g. [1], [13], [16], [204], [223]), but thesetwo are still the most adopted ones, probably for two reasons:the amount of data in them is suitable to train reasonablylarge deep-learning-based models; their adoption in early AV-SE and AV-SS techniques have made them benchmark datasetsfor these tasks.More recently, an effort of the research community has beenput to gather data in the wild , in other words recordings fromdifferent sources without the careful planning and setting ofthe controlled environment used in conventional datasets, likethe already mentioned GRID and TCD-TIMIT. The goal ofcollecting such large-scale datasets, characterised by a vastvariety of speakers, sentences, languages and visual/auditoryenvironments, not necessarily in a controlled lab setup, is tohave data that resemble real-world recordings. One of thefirst AV speech in-the-wild databases is LRW [40]. LRW wasspecifically collected for visual speech recognition (VSR) andconsists of around 170 hours of AV material from Britishtelevision programs. The utterances in the dataset are spokenby hundreds of speakers and are divided into 500 classes.Each sentence of a class contains a non-isolated keywordbetween 5 and 10 characters. The trend of collecting largerdatasets has continued in subsequent collections which consistof materials from British television programs [8], [39], [41],generic YouTube videos [38], [185], TED talks [9], [55],movies [217] or lectures [55], [244]. Among them, AVSpeechis the largest dataset used for AV-SE and AV-SS, with its4,700 hours of AV material. It consists of a wide range ofspeakers (150,000 in total), languages (mostly English, butalso Portuguese, Russian, Spanish, German and others) andhead poses (with different pan and tilt angles). Each video clipcontains only one talking person and does not have acousticbackground interferences.The large-scale in-the-wild databases, as opposed to theones containing recordings in controlled environments, are par-ticularly suitable for training deep models that must performrobustly in real-world situations. Nevertheless, the datasetscollected in controlled environments are a good choice fortraining a prototype designed for a specific purpose or forstudying a particular problem. Examples of databases usefulin this sense are: TCD-TIMIT [89] and OuluVS2 [16], to studythe influence of several angles of view; MODALITY [46] andOuluVS2 [16], to determine the effect of different video framerates; Lombard GRID [13], to understand the impact of theLombard effect, also from several angles of view; RAVDESS[161], to perform a study of emotions in the context of SE andSS; KinectDigits [224] and MODALITY [46], to determine Lipspeakers are professionals trained to make their mouth movementsmore distinctive. They silently repeat a spoken talk, making lipreading easierfor hearing impaired listeners [89].
TABLE IM
AIN AUDIO - VISUAL SPEECH DATASETS . T
HE LAST COLUMN INDICATES THE AUDIO - VISUAL SPEECH ENHANCEMENT AND SEPARATION ARTICLESWHERE THE DATABASE HAS BEEN USED . Dataset Year ×
480 Stereo 44 kHz Controlled environment [55]and 20 pairs digits (7,000 utterances) 29.97 FPS Mono 16 kHz Speaker movementsSimultaneous speakersGRID [43] 2006 34 (18 males) Command sentences 720 ×
576 50 kHz Controlled environment [3], [5], [6], [17], [65], [66], [76], [77](1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263]per speaker) [176], [183], [195], [203], [239]OuluVS [286] 2009 20 (17 males) 10 everyday greetings 720 ×
576 48 kHz Controlled environment –(817 sequences) 25 FPS Rotating head movementsLDC2009V01 [211] 2009 14 (4 males) Single words and full 720 ×
480 48 kHz Controlled environment [274]sentences (7 hours) 29.97 FPS Frontal faceTCD-TIMIT [89] 2015 62 (32 males) Phonetically rich 1920 × ◦ camera [203]OuluVS2 [16] 2015 53 (40 males) Continuous digits 1920 × ×
480 (front)100 FPS (front)KinectDigits [224] 2016 30 (15 males) English digits 0 – 9 104 ×
80 Four-channel Controlled environment –30 FPS 16 kHz RGB and depth frames ofthe mouth regionMic. array recordingsLRW [40] 2016 Hundreds Utterances of 500 256 ×
256 16 kHz Videos in the wild [107], [108]different words 25 FPS Recordings from BBC(173 hours) Mostly frontal facesSmall Mandarin Sentences 2016 1 male 40 utterances 320 ×
240 48 kHz Controlled environment [100]Corpus [100] (3-4 seconds each) Frontal faceMandarinMODALITY [46] 2017 35 (26 males) Separated commands 1920 × ×
240 (ToF) 44.1 kHz Clean and noisy conditions60 FPS (ToF) Mic. array recordingsNTCD-TIMIT [1] 2017 Extension of TCD-TIMIT obtained by adding six noise types to the corpus: white, babble, car, living room, [154], [220]–[222]street and cafeLRS [39] 2017 Several Continuous sentences Not specified a Not specified a Videos in the wild –(Not specified a ) (75.5 hours) Recordings from BBCMostly frontal facesMV-LRS [41] 2017 Several Continuous sentences Not specified a Not specified a Videos in the wild [10](Not specified a ) (777.2 hours) Recordings from BBCMultiviewVoxCeleb [185] 2017 1,251 Continuous sentences Not specified b Not specified b Videos in the wild from [195](690 males) (153,516 utterances, Youtube352 hours) Challenging multi-speakeracoustic environmentsMandarin Sentences 2018 1 male 320 utterances 1920 × c Not specified c Wide variety of lighting, [66](300 videos of 2-3 face pose, background,minutes long) scaling and audiorecording conditionsLombard GRID [13] 2018 54 (24 males) Command sentences 720 ×
480 (front) 48 kHz Controlled environment [178], [179](50 Lombard and 50 24 FPS (front) Frontal faceplain per speaker) 864 ×
480 (side) Lombard effect recordings30 FPS (side) Straight and side cameraRAVDESS [161] 2018 24 actors Continuous sentences 1920 × b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [169](3,761 males) (1,128,246 utterances, Youtube2,442 hours) Challenging visual andauditory environmentsLRS2 [8] 2018 Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153], [273](up to 100 characters Recordings from BBCeach - 224.5 hours)LRS3 [9] 2018 Around 5,000 Continuous sentences 224 ×
224 16 kHz Videos from TED and [10], [192], [208](438 hours) 25 FPS TEDx YouTube channelsAVSpeech [55] 2018 150,000 Continuous sentences Not specified b Not specified b Videos in the wild [55], [109](4,700 hours) Wide variety of people,languages and face posesAV Chinese Mandarin [244] 2019 Several Continuous sentences Not specified b Not specified b Mandarin lectures from [85], [244], [279](Not specified) (155 hours) YouTubeGrayscale frames of lipsAVA-ActiveSpeaker [82], [217] 2019 Several Continuous sentences Not specified b Not specified b Movie and TV videos from –(Not specified) (38.5 hours) YouTubeHuman-labelled framesASPIRE [77] 2019 3 (1 male) Command sentences 1920 × a We could not get this information because the database is not available to the public due to license restrictions. b Since the material is from YouTube, we can expect variable video resolution and audio sample rate. c Weekly addresses from The Obama White House YouTube channel. Original video resolution of 1920 × the importance that supplementary information from the depthmodality might have; ASPIRE [77], to evaluate the systemsin real noisy environments.IV. P ERFORMANCE A SSESSMENT
The main aspects generally of interest for SE and SS are quality and intelligibility . Speech quality is largely subjective[49], [162] and can be defined as the result of the judgementbased on the characteristics that allow to perceive speechaccording to the expectations of a listener [123]. Given thehigh number of dimensions that the quality attribute possessesand the different subjective concept of what is high and lowquality for every person, a large variability is usually observedin the results of speech quality assessments [162]. On theother hand, intelligibility can be considered a more objectiveattribute, because it refers to the speech content [162]. Still,a variability in the results of intelligibility assessments can beobserved due to the individual speech perception, which has animpact on the ability of recognising words and/or phonemesin different situations.In the rest of this Section, we review how AV-SE and AV-SS systems are evaluated, with a particular focus on speechquality and speech intelligibility. A summary of the differentmethods and measures used in the literature are shown inTable II.
A. Listening Tests
A proper assessment of SE and SS systems should beconducted on the actual receiver of the signals. In manyscenarios (e.g. hearing assistive devices, teleconferences etc.),the receiver is a human user and listening tests , performed witha population of the expected end users, are the most reliableway for the evaluation.The tests that are currently employed for the assessmentof AV-SE and AV-SS systems are typically adopted from theaudio-only domain, i.e. they follow procedures validated forAO-SE and AO-SS techniques. Although different kinds oflistening tests exist, some general recommendations include: • Several subjects are required to be part of the assessment.The number depends on the task, the listeners’ experienceand the magnitude of the performance differences (e.g.between a system under development and its predecessor)that one wishes to detect. Generally, fewer subjects arerequired, if they are expert listeners. • Before the actual test, a training phase allows the subjectsto familiarise themselves with the material and the task. • The speech signals are presented to the listeners in arandom order. • To reduce the impact of listening fatigue, long testsessions are avoided.The most common method used in SE [162] to assess speechquality is the mean opinion score (MOS) test [113], [115],[116]. This test is characterised by a five-point rating scale(cf. the ‘OVRL’ column of Table III) and was adopted inthree AV works [3], [5], [6]. However, the MOS scale wasoriginally designed for speech coders, which introduce differ-ent distortions than the ones found in SE [162]. Therefore, an extended standard [118] was proposed and five-point discretescales were used to rate not only the overall (OVRL) quality(like in the MOS test), but also the signal (SIG) distortion andthe background (BAK) noise intrusiveness (cf. Table III). Thiskind of assessment was adopted to evaluate the AV systemin [99].A distinct quality assessment procedure, the multi stimulustest with hidden reference and anchor (MUSHRA) [114],was used in [75], [77], [178]. In this case, the listeners arepresented with speech signals to be rated using a continuousscale from 0 to 100, consisting of 5 equal intervals labelled as‘bad’, ‘poor’, ‘fair’, ‘good’, and ‘excellent’. The test is dividedinto several sessions. In each session, the subjects are askedto rate a fixed number of signals under test (processed and/ornoisy speech signals), one hidden reference (the underlyingclean speech signal) and at least one hidden anchor (a low-quality version of the reference). In addition, the clean speechsignal (i.e. the unhidden reference) is provided. The hiddenreference allows to understand whether the subject is able todetect the artefacts of the processed signals, while the hiddenanchor provides a lowest-quality fixed-point in the MUSHRAscale, determining the dynamic range of the test. Having thepossibility to switch among the signals at will, the listenerscan make comparisons with a high degree of resolution.Together with speech quality, also intelligibility should beassessed with appropriate listening tests. It is possible to groupthe intelligibility tests into three classes, based on the speechmaterial adopted [162]: • Nonsense syllable tests - Listeners need to recognisenonsense syllables drawn from a list [64], [180]. Usually,it is hard to build such a list of syllables where each itemis equally difficult to be identified by the subjects, hencethese tests are not very common. • Word tests - Listeners are asked to identify words drawnfrom a phonetically balanced list [53] or rhyming words[60], [101], [255]. Among these tests, the diagnosticrhyme test (DRT) [255] is extensively adopted to evaluatespeech coders [162]. The main criticism about word testsis that they may be unable to predict the intelligibility inreal-world scenarios, where a listener is usually exposedto sentences, not single words. • Sentence tests - Listeners are presented with sentencesand are asked to identify keywords or recognise the wholeutterances. It is possible to distinguish these tests betweenthe ones that use everyday sentences [129], [191] and theones that use sentences with a fixed syntactical structure,known as matrix tests [86], [95], [197], [257]–[259]. Oneof the most commonly used sentence tests is the hearingin noise test (HINT) [191], also adapted for differentlanguages, including Canadian-French [248], Cantonese[272], Danish [189], [190] and Swedish [87].A simple way to quantify the intelligibility for the previouslymentioned tests is by calculating the so-called percentageintelligibility [162]. This measure indicates the percentage ofcorrectly identified syllables, words or sentences at a fixedsignal to noise ratio (SNR). The main drawback is that itmight be hard to find the SNR at which the test can be
TABLE IIM
AIN PERFORMANCE ASSESSMENT METHODS FOR AUDIO - VISUAL SPEECH ENHANCEMENT AND SEPARATION . T
HE LAST COLUMN INDICATES THEAUDIO - VISUAL SPEECH ENHANCEMENT AND SEPARATION ARTICLES WHERE THE EVALUATION METHOD HAS BEEN USED . Type Evaluation Method Year Notes AV-SE/SS papersListening tests for speech MOS [113], [115], [116] – Audio-only listening test with 5-point [3], [5], [6]quality assessment rating scaleSIG / BAK / OVRL [118] 2003 Extension of MOS considering signal [99]distorsion and noise intrusivenessMUSHRA [114] 2003 Audio-only listening test with [75], [77]continuous rating scaleMUSHRA-like audio-visual 2019 MUSHRA test using audio-visual [178]test [114], [178] stimuliListening tests for speech DRT [255] 1983 Audio-only listening test using –intelligibility assessment rhyming wordsHINT [191] 1994 Audio-only listening test using –everyday sentencesMatrix-like audio-visual 2019 Matrix test using audio-visual [178]test [178] stimuli [13]Estimators of speech quality PESQ [117], [119], [120], [214] 2001 Designed to assess quality across a [3], [5]–[7], [12], [17], [37], [55], [65]based on perceptual models wide range of codecs and network [66], [76], [77], [85], [99], [107], [108]conditions mostly for telephony [109], [122], [128], [136], [153], [154][176], [178], [179], [183], [220]–[222][239], [244], [263], [274], [279]CSIG / CBAK / COVRL [104] 2007 Composite measures which combine [108]basic objective measuresHASQI [131], [133] 2010 Specifically designed for hearing- [99], [100]impaired listenersPOLQA [121] 2011 PESQ successor –ViSQOL [93], [94] 2012 Specifically designed for voice over [55], [183]IP transmissionEstimators of speech quality SNR – It does not provide a proper [12], [65], [66], [109]based on energy ratios (Signal-to-Noise Ratio) estimation of speech distortionSSNR / SSNRI – Assessment of short-time [100], [108], [239](Segmental SNR) behaviour(SSNR Improvement)SDI [31] 2006 It provides a rough distortion [99], [100]measureSDR [252] 2006 Specifically designed for blind audio [7], [10], [17], [42], [55], [65], [85]source separation [107]–[109], [136], [153], [154], [169][164], [165], [183], [192], [195], [203][208], [220]–[222]SIR [252] 2006 Specifically designed for blind audio [7], [65], [107], [136], [164], [165]source separation [195]SAR [252] 2006 Specifically designed for blind audio [65], [107], [136], [164], [165], [195]source separationSI-SDR [150] 2019 Extension of SDR to make it scale- [77], [85], [108], [244], [273]invariantEstimators of speech SII [110] 1997 Used for additive stationary noise or [108]intelligibility bandwidth reductionCSII [130] 2004 Extension of SII for broadband peak- [108]clipping and center-clipping distortionESII [210] 2005 Extension of SII for fluctuating noise [108]STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109]intelligibility in several situations [99], [122], [128], [136]HASPI [132] 2014 Specifically designed for hearing- [99], [100]impaired listenersESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244]modulated noise sourcesAutomatic speech recognition WER – Word-level comparison [7], [10], [85], [154], [208], [279]performance (Word Error Rate)PER – Phone-level comparison [203](Phone Error Rate)Computational efficiency RTF – Ratio between GPU processing [85](Real-Time Factor) time and audio time.
TABLE IIIS
IGNAL (SIG),
BACKGROUND (BAK)
AND OVERALL (OVRL)
QUALITYRATING SCALES ACCORDING TO [118]. T
HE OVERALL QUALITY SCALE ISTHE SAME AS THE MEAN OPINION SCORE SCALE . Rating SIG BAK OVRL5 Not distorted Not noticeable Excellent4 Slightly distorted Slightly noticeable Good3 Somewhat distorted Noticeable but not intrusive Fair2 Fairly distorted Somewhat intrusive Poor1 Very distorted Very intrusive Bad optimally performed, because floor or ceiling effects mightoccur if the listeners’ task is too hard or too easy. This issuecan be mitigated by testing the system at several SNR withina pre-determined range, at the expense of the time needed toconduct the listening experiments. As an alternative, speechintelligibility can be measured in terms of the so-called speechreception threshold (SRT), which is the SNR at which listenerscorrectly identify the material they are exposed to with a 50%accuracy [162]. The SRT is determined with an adaptiveprocedure, where the SNR of the presented stimuli increasesor decreases by a fixed amount at every trial based on thesubject’s previous response. In this case, the main drawbackis that the test is not informative for SNRs that substantiallydiffer from the determined SRT.Speech intelligibility tests are yet to be adopted by the AV-SE and AV-SS community. In fact, an intelligibility evaluationinvolving human subjects for AV-SE can only be found in[178]. There, listeners were exposed to speech signals fromthe Lombard GRID corpus [13] processed with several systemsand were asked to determine three keywords in each sentence.The results were reported in terms of percentage intelligibilityfor four different SNRs distributed in uniform steps between −
20 dB and − • There is a big difference among individuals in lip-readingabilities. This difference is not reflected in the variationin auditory perception skills [237]. • The per-subject fusion response to discrepancies betweenthe auditory and the visual syllables is large and unpre-dictable [172]. Variants exist where a different percentage is used. • The availability of visual information makes ceiling ef-fects more probable to occur.These considerations suggest a strong need for exploration anddevelopment of ecologically valid paradigms for AV listeningtests [106], which should reduce the variability of the resultsand provide a robust and reliable estimation of the performancein real-world scenarios. A first step towards achieving thisgoal is to perform tests in which the subjects are carefullyselected within a homogeneous group and exposed to AVspeech signals that resemble actual conversational settingsfrom a visual and an acoustic perspective.
B. Objective Measures
Listening tests are ideal in the assessment performance ofSE and SS systems. However, conducting such tests can betime consuming and costly [162], in addition to requiringaccess to a representative group of end users. Therefore,researches developed algorithmic methods for repeatable andfast evaluation, able to estimate the results of listening testswithout listening fatigue effects. Such methods are often called objective measures and most of them exploit the knowledgefrom low-level (e.g. psychoacoustics) and high-level (e.g.linguistics) human processing of speech [162] (cf. Table II).The most widely used objective measure to assess speechquality for AV-SE and AV-SS is the perceptual evaluation ofspeech quality (PESQ) measure [117], [119], [120], [214].PESQ was originally designed for telephone networks andcodecs. It is a fairly complex algorithm consisting of sev-eral components, including level equalisation, pre-processingfiltering, time alignment, perceptual filtering, disturbance pro-cessing and time averaging. All these steps are used to takeinto account relevant psychoacoustic principles: • The frequency resolution of the human auditory systemis not uniform, showing a higher discrimination for lowfrequencies [235]. • Human loudness perception is not linear, meaning thatthe ability to perceive changes in sound level varies withfrequency [292]. • Masking effects might hinder the perception of weaksounds [71].The output of PESQ is supposed to approximate the MOSscore and it is a value generally ranging between 1 and 4.5,although a lower score can be observed for extremely distortedspeech signals. Rix et al. [214] reported a high correlation withlistening tests in several conditions, i.e. mobile, fixed, voiceover IP (VoIP) and multiple type networks. A later study [104]showed that PESQ correlates well also with the overall qualityof signals processed with common SE algorithms.As new network and headset technologies were introduced,PESQ was not able to accurately predict speech quality.Therefore, a new measure, the perceptual objective listeningquality assessment (POLQA) [121], was introduced. POLQAis considered the successor of PESQ and it is particularlyrecommended in scenarios where its predecessor performspoorly or cannot be used, e.g. for high background noise,super-wideband speech, variable delay and time scaling. Al-though POLQA correlates well with listening test results, outperforming PESQ [22], [121], it has not been used toevaluate AV-SE and AV-SS systems yet.For SS techniques, assessing the overall quality of the pro-cessed signals might not be sufficient, because it is desirableto have measures that characterise different speech qualitydegradation factors. For this reason, the majority of AV-SSsystems are evaluated using a set of measures contained inthe blind source separation (BSS)
Eval toolkit [252]. Thecomputation of these measures consists of two steps. First,each of the processed signals is decomposed into four terms,representing the components perceived as coming from: thedesired speaker, other target speakers (generating cross-talkartefacts), noise sources and other causes (e.g. processingartefacts). The second step provides performance criteria fromthe computation of energy ratios related to the previous fourterms: source to distortion ratio (SDR), source to interferencesratio (SIR), sources to noise ratio and sources to artefactsratio (SAR). Although a reasonable correlation was foundbetween SIR and human ratings of interference [268], otherexperiments [27], [268] showed that energy-based measuresare not ideal for determining perceptual sound quality for SSalgorithms.Besides speech quality estimators, objective intelligibilitymeasures have also been developed. Among them, the short-time objective intelligibility (STOI) measure [241] is the mostcommonly used for AV-SE and AV-SS. STOI is based on thecomputation of a correlation coefficient between the short-time overlapping temporal envelope segments of the clean andthe degraded/processed speech signals. It has been shown thatSTOI correlates well with the results of intelligibility listeningexperiments [61], [241], [275]. An extension of STOI, ESTOI,was later proposed [124] to provide a more accurate predictionof speech intelligibility in presence of highly modulated noisesources.Table II indicates also other measures that we have notpresented above, because they are less adopted in AV-SEand AV-SS works. However, it is worth mentioning some ofthem, since they can be used by researchers to evaluate thesystems for specific purposes. For example, the hearing-aidspeech quality index (HASQI) [131], [133] and the hearing-aidspeech perception index (HASPI) [132] are two measures thathave been specifically designed to evaluate speech quality andand intelligibility as perceived by hearing-impaired listeners.Sometimes, the evaluation of a system is expressed in termsof word error rate (WER) as measured by an ASR system(cf. Table II). This measure assumes that the receiver of thesignals is a machine, not a human, and it provides additionalperformance information for specific applications, e.g. videocaptioning for teleconferences or augmented reality.Most of the objective measures used to evaluate AV-SEand AV-SS systems have two main limitations to be desirablyaddressed in future works. First, they require the target speechsignal(s) in order to produce a quality or an intelligibilityestimate of the degraded/processed signal(s). These measuresare known as intrusive estimators . For algorithm development,where clean speech reference is readily available, this assump-tion is reasonable. However, for in-the-wild tests, it is notpossible to collect reference signals and intrusive estimators cannot be adopted.The other limitation is the use of audio-only signals in allthe objective measures. As already pointed out for listeningtests, ignoring the visual component of speech may causean erroneous estimation of the system performance in manyreal-world situations, where the listener is able to look atthe speaker. In order to develop new predictors of qualityand intelligibility in an AV context, a substantial amount ofdata from AV listening tests is required. When such AV datais available, it would be possible to understand the factorsinfluencing human AV perception of processed speech andproperly design and validate new objective measures.
C. Beyond Speech Quality and Intelligibility
When considering SE and SS systems, aspects other thanspeech quality and intelligibility might be of interest to assess.Some systems, like hearing assistive devices and teleconfer-ence systems, have a low-latency requirement, because theyneed to deliver processed signals to allow real-time conversa-tions. In this case, it might be relevant to report a measure ofthe computational efficiency of the approach under analysis.An example is the so-called real-time factor (RTF), used in[85] and defined as the ratio between the processing time andthe duration of the signal.Sometimes, a given processed speech signal could be fullyintelligible, but the effort that the listener must put into thelistening task could be substantial in order to be able to un-derstand the speech content. Therefore, it might be importantto measure the energy that a subject needs to invest in alistening task, i.e. the listening effort . As for speech qualityand intelligibility, the listening effort may be measured withlistening tests [63], [283].Moreover, speech carries a lot of additional information,e.g. about the speaker, including gender , age , emotional state , mood , their location in the scene, etc. These aspects might beimportant and SE or SS systems should ideally preserve themeven after the processing of a heavily corrupted speech signal(cf. [88], in which the proposed system is specifically testedfor its ability to preserve spatial cues). Standardised methodsfor the assessment of these aspects of AV speech are currentlylacking, but they would be important to develop in order toguarantee high performance to the end users.V. A UDIO -V ISUAL S PEECH E NHANCEMENT AND S EPARATION S YSTEMS
The problems of AV-SE and AV-SS have recently beentackled with supervised learning techniques, specifically deeplearning methods. Supervised deep-learning-based models canautomatically learn how to perform SE or SS after a trainingprocedure, in which pairs of degraded and clean speech sig-nals, together with the video of the speakers, are presented tothem. Ideally, deep-learning-based systems should be trainedusing data that is representative of the settings in which theyare deployed. This means that in order to have good perfor-mance in a wide variety of settings, very large AV datasetsfor training and testing need to be collected. In practice, thesystems are trained using a large number of complex acoustic
TABLE IVM
AIN ELEMENTS OF DEEP - LEARNING - BASED AUDIO - VISUAL SPEECH ENHANCEMENT AND SPEECH SEPARATION SYSTEMS . T
HE LAST COLUMN SHOWSTHE PAPERS IN WHICH THE ELEMENTS HAVE APPEARED . System Elements AV-SE/SS papersVisualFeatures Raw pixels:- Mouth [12], [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]–[222], [244], [263], [274], [279]- Face [65]AAM of mouth region [136]2D-DCT of mouth region [3]–[6]Optical flow [17], [65], [154], [164], [165]Landmark-based features [100], [154], [183], [203]Multisensory features [195]Face recognition embedding [55], [109], [169], [192], [239]VSR embedding [7], [10], [107]–[109], [153], [222], [273]Facial appearance embedding [42], [208]Compressed mouth frames [37]Speaker direction [85], [244], [279]AcousticFeatures Magnitude spectrogram [3]–[7], [10], [12], [17], [37], [42], [65], [66], [76], [77], [85], [99], [100], [107], [122], [128], [136], [153], [154], [164][165], [176], [178], [179], [183], [192], [195], [203], [208], [220]–[222], [244], [263], [274], [279]Phase a [7], [10], [153]Complex spectrogram [55], [107], [109], [169], [239]Raw waveform [108], [273]Speaker embeddings [10], [85], [169], [192], [208]IPD | cosIPD | sinIPD [85], [107], [279] | [107], [244] | [107]Angle feature [85], [244], [279]DeepLearningMethods FFNN [3]–[6], [10], [12], [37], [42], [55], [65], [66], [76], [77], [99], [100], [107], [109], [136], [153], [154], [164], [165], [169][176], [178], [179], [183], [192], [208], [220]–[222], [239], [244], [263], [274]CNN [3], [5], [7], [10], [12], [37], [42], [55], [66], [76], [77], [85], [99], [107]–[109], [122], [153], [154], [164], [165], [169][176], [178], [179], [192], [195], [208], [239], [244], [263], [273], [274], [279]AE [37], [66], [107], [122], [128], [176], [178], [179], [195]LSTM [3]–[6], [12], [37], [76], [77], [109], [128], [263]BiLSTM [10], [17], [55], [107], [122], [154], [164], [165], [169], [183], [192], [203], [208], [239], [244], [274]Skip connections [107], [122], [128], [176], [178], [179], [195]Residual connections [7], [10], [42], [65], [85], [107], [108], [122], [128], [153], [244], [273], [279]VAE [220]–[222]FusionTechniques Concatenation-based [3], [5], [7], [10], [12], [17], [37], [42], [55], [76], [77], [85], [99], [100], [107]–[109], [153], [154], [164], [165], [169][176], [178], [179], [183], [195], [203], [208], [221], [222], [239], [244], [273], [274], [279]Addition-based [10], [136]Product-based [164], [192], [263]Squeeze-excitation fusion [122], [128]Attention-based [42], [85], [153], [192], [239]Integration within a Wiener [3]–[6]filtering frameworkTrainingTargets Magnitude spectrogram (DM) [3]–[6], [37], [66], [99], [100], [176], [195], [203], [244], [274]Phase [7], [10], [153], [195]Mask: MA: IM: Other:- IBM [76], [77] – [65], [136], [164], [165]- TBM [183] – [65]- PBM [263] – –- IRM [12], [263] – [65], [136]- IAM [176], [178], [179] [7], [10], [17], [42], [55], [122], [128], [153], [176], [183], [192] –[208]- Ratio mask – – [85], [244], [279]- PSM [176] [107], [154], [176] –- CRM [169] [55], [107], [109], [239] [279]Waveform [108], [273]Mouth frames [99]Compressed mouth frames [37]ObjectiveFunctions MSE [3]–[6], [12], [17], [37], [42], [66], [99], [100], [107], [109], [136], [154], [169], [176], [178], [179], [183], [192], [203][239], [244], [263], [274]MAE [7], [10], [12], [122], [128], [195], [198]Cosine distance/similarity [7], [10], [12], [153]Cross entropy [76], [77], [107], [183], [263]SI-SDR b [85], [108], [244], [273], [279]Multitask learning [42], [99], [192], [203], [263]CTC loss [203]Speaker representation loss [42]PIT [107], [195]Deep clustering [164], [165]Triplet loss [164] a Only if it is used in processing, not just to reconstruct the signal. b Applied to the time-domain signal. TrainingTargetAcousticFeaturesVisualFeatures OutputObjective FunctionDeepLearningMethodsandFusionTechniques
Fig. 1. Interconnections between the main elements of a generic audio-visualspeech enhancement/separation system based on deep learning. White boxesrepresent data, while grey boxes represent processing blocks. scenes that are synthetically generated by adding target speechsignals and signals from sources of interference at severalSNRs. This way of generating synthetic training material hasempirically shown its effectiveness in both audio-only (AO)and AV settings, since speech signals processed with systemstrained in this way improve in terms of both estimated speechquality and intelligibility [7], [55], [143], [282].In this Section, we focus on deep-learning-based AV-SEand AV-SS systems. Specifically, we describe the main ele-ments that characterise these systems: visual features; acousticfeatures; deep learning methods; fusion techniques; trainingtargets and objective functions . Figure 1 provides a concep-tual block diagram illustrating the interconnections of theseelements. Further details are reported in Table IV, where, foreach element, the possibilities explored in the literature arereported. A. Visual Features
Given a video recording, the first step of most AV-SE andAV-SS systems is to determine the number of speakers in itand track their faces across the visual frames. This is usuallyperformed by face detection [140], [160], [254] and tracking [166], [246] algorithms. This approach allows to considerablyreduce the dimensionality of the input and, as a consequence,the number of parameters of the SE and the SS models,because only crops of the target faces are considered. Inaddition, face detection is one way to determine the number ofspeakers in a scene, and can be helpful by the SS systems thatcan handle only a fixed number of speakers (e.g. [55]), becausea priori knowledge of the number of target speech signalsis needed to choose a specific trained multi-speaker model.From these considerations, we can understand the criticalimportance of face detection and tracking algorithms: if theyfail, all the later modules would fail as well. Therefore, robustface tracking, in particular under varying light conditions,occlusions etc. is essential to guarantee high performance inreal-world scenarios.Once that the video frames of the speaker’s face areavailable, visual features can be used by AV-SE and AV-SSapproaches (cf. Table IV). Many systems, such as [65] and[66], directly use a crop around the face or the mouth ofthe target speaker(s) as input. This approach is not alwaysconvenient: learning to perform a task from high-dimensionalinput consisting of raw pixels with a neural network is usually Training targets and objective functions are not used during inference. challenging and requires a large amount of data [109], [169].Hence, several approaches are employed to reduce the inputdimensions by extracting different types of features from theraw pixel input, as we report in the following.Khan et al. [136] reduced the dimensionality of the visualinformation with an active appearance model (AAM) [44],which is a framework that combines appearance-based andshape-based features through principal component analysis(PCA). Other classical approaches have also been used forvisual feature extraction. For example, some works [3]–[6]produced a vector of pixel intensities from the lip regionof the speaker with a 2-D discrete cosine transform (DCT).Alternatively, optical flow features were used as an additionalinput in [65], [154], [164], [165] to explicitly incorporate themotion information in the system.Research has also been conducted to investigate the use of facial landmark points . Hou et al. [100] considered a repre-sentation of the speaker’s mouth consisting of the coordinatesof 18 points. Distances for each pair of these points werecomputed and the 20 elements with the highest variance acrossan utterance were provided to the SE network. Instead of thedistance for each pair of landmark points, Morrone et al. [183]obtained a differential motion feature vector by subtracting theface landmark points of a video frame with the points extractedfrom the previous one. Motion of landmarks points was alsoexploited by Li et al. [154], who first computed the distance forevery symmetric pair of lip landmark points in the vertical andthe horizontal directions, and then defined a variation vectorof the lip movements consisting of the differences betweenthe distance vectors of two contiguous video frames. Thisdistance-based motion vector was finally combined with aspectratio features.A different approach consists of extracting embeddings,i.e. meaningful representations in a typically low dimensionalprojected space, with a neural network pre-trained on a relatedtask. For example, Owens and Efros [195] proposed to usemultisensory features. They designed a deep-learning-basedsystem that could recognise whether the audio and the videostreams of a recording were synchronised. The features ex-tracted from such a network provided an AV representationthat allows to achieve superior performance compared to anAO-SE approach. Besides multisensory features, embeddingsextracted with face recognition [55] or VSR [7] models havebeen shown to be effective. ˙Inan et al. [109] performed astudy to evaluate the differences between these two kinds ofembeddings. Their results showed that VSR embeddings wereable to separate voice activity and silence regions better thanface recognition embeddings, which could provide a betterdistinction between speakers instead. Overall, the performanceobtained with VSR embeddings was superior, because theyallowed to easier characterise lip movements. Another study[273] further investigated VSR embeddings, showing that theuse of features extracted with a model trained for phone-levelclassification led to better results if compared to the adoptionof word-level embeddings.Attempts [42], [208] have been made to exploit the in-formation of a still image of the target speaker instead of avideo. This approach outperformed a system that used only the audio signals, because there exists a cross-modal relationshipbetween the voice characteristics of a speaker and their facialappearance [139], [194]. This explains why facial features canguide the extraction of the target speech from a mixture. Theadvantage of using a still image is the reduced complexity ofthe overall system, although the dynamic information of thevideo is not exploited.When the information from multiple microphones is avail-able, the location of the target speaker with respect to themicrophone array can be used for spatial filtering, i.e. beam-forming (cf. Section V-B). In [85], [244], the target direction isestimated with a face detection method. In more complicatedscenarios, where people move and turn their heads, facedetection might fail over several visual frames. The use offeatures from the speaker’s body might help in building a morerobust target source tracker.In general, the use of visual features allow AV systemsto obtain a performance improvement over AO systems. Amore detailed analysis regarding the actual contribution ofvision for AV-SE was conducted in [12]. In particular, visualfeatures were shown to be important to get not only high-levelinformation about speech and silence regions of an utterance,but also fine-grained information about articulation. Althoughimprovements were shown for all visemes , sounds that areeasier to distinguish visually were the ones that improved themost with an AV-SE system.Future challenges include the extraction of features withlow complexity algorithms that can be robust to illuminationchanges, occlusion and pose variations. At the moment, theserobustness issues are tackled by training the systems with dataartificially modified to include such perturbations [10]. Newopportunities to build low-latency systems that are energy-efficient and robust to light changes are given by eventcameras . In contrast to conventional frame-based cameras,event cameras are asynchronous sensors that output changesin brightness for each pixel only when they occur. Theyhave low latency, high dynamic range and very low powerconsumption [156]. Arriandiaga et al. [17] showed that theSE results obtained with optical flow features, extracted froman event camera, are on par with a frame-based approach.The main limitation of exploiting the full potential of eventcameras is that existing image processing algorithms cannotbe employed, due to the inherently different nature of the dataproduced by them. Research in this area is expected to bringnovel algorithms and performance improvements. B. Acoustic Features
Besides the video stream, AV-SE and AV-SS systems alsoprocess acoustic information (cf. Figure 1). As can be seenin Table IV, the predominant acoustic input feature is thepotentially transformed magnitude spectrogram of a single-microphone recording, sometimes in the log mel domain, likein [66]. However, a magnitude spectrogram is generally anincomplete representation of the acoustic signal, because it iscomputed from STFT coefficients which are complex-valued. A viseme is the basic unit of visual speech and represents what a phonemeis for acoustic speech [173].
Recent works have used as acoustic input to the AV systemeither the magnitude spectrogram and the respective phase [7],[10], [153], the real and the imaginary parts of the complexspectrogram [55], [107], [109], [169], [239], or directly theraw waveform [108], [273]. Although these approaches allowto incorporate and process the full information of an acousticsignal, research in this area is still active and suggests thatthere is still room for improvement by exploiting the fullinformation of the noisy speech signal [168], [281].Since Wang et al. [262] showed that an AO system cansuccessfully extract the speech of interest from a mixturesignal when conditioned on the speaker embedding vectorof an enrolment audio signal of the target spreaker, severalAV-SE and AV-SS systems have made use of a similaridea. Luo et al. [169] showed that i-vectors [48], a low-dimensional representation of a speech signal effective inspeaker verification, recognition and diarisation [250], wereparticularly effective for AV-SS of same gender speakers,obtaining a large improvement over an AV baseline model thatdid not incorporate speaker embeddings. Afouras et al. [10]extracted a compact speaker representation from an enrolmentspeech signal with the deep-learning-based method in [276]and obtained good performance for mixtures of two andthree speakers, especially when face occlusions occurred. Inaddition, their system could learn the speaker representation onthe fly by using the enhanced magnitude spectrogram obtainedfrom a first run of the algorithm without speaker embedding.This essentially bypassed the need for enrolment audio, whichis cumbersome or even impossible to collect in certain appli-cations. The approach in [85] also used a pre-trained deep-learning-based model [284] to extract a speaker representationfrom an additional audio recording. The results indicate thatvisual information of the speaker’s lips is more importantthan the information contained in the speaker embeddingvector, and that their combination led to a general performanceimprovement. Instead of adopting a pre-trained model, Ochiaiet al. [192] decided to use a sequence summarising neuralnetwork (SSNN) [251], which was jointly trained with themain separation model. Their experiments showed that similaroutcomes could be obtained when the enrolment audio and thevisual information were used as input in isolation, but betterperformance was achieved when used at the same time. Ingeneral, all these approaches show that speaker embeddings,when extracted from an available additional speech utterancefrom the target speaker, can be useful, confirming the resultsobtained in the AO domain [262].The spatial information contained in multi-channel acousticrecordings provides an informative cue complementary tospectral information for separating multiple speakers. Specif-ically, inter-channel phase differences (IPDs) [84], inter-channel time differences (ITDs) [126], inter-channel leveldifferences (ILDs) [126], directional statistics [33] or simplymixture STFT vectors [193] are used in multi-channel deep-learning-based systems to perform SE or SS. Among thesefeatures, IPDs are widely applied due to their robustnessto reverberation and microphone sensitivities [85]. However,because of the well known issues of spatial aliasing and phasewrapping, IPDs can be the same even for spatially separated sources with different time delays in particular frequencies.This causes fundamental difficulties in separating one sourcefrom another. Wang et al. [267] proposed to concatenatecosine IPDs (cosIPDs) and sine IPDs (sinIPDs) with logmagnitudes as input of their AO system. With this strategy,spectral features can help to resolve the IPDs ambiguity. Inaddition, the combination of cosIPDs and sinIPDs is preferredover IPDs, because the former exhibits a continuous helixstructure along frequency due to the Euler formula [266],while the latter suffers from abrupt discontinuities caused byphase wrapping. In AV-SE and AV-SS, systems used IPDs[85], cosIPDs [107], [244] and sinIPDs [107]. Some AV multi-microphone approaches [85], [244] effectively included alsoan angle feature [33], which computes the averaged cosinedistance between the target speaker steering vector and IPDon all selected microphone pairs. C. Deep Learning Methods
As illustrated in Figure 1, after the feature extractionstage, the actual processing and fusion of acoustic and visualinformation is performed with a combination of deep neuralnetwork models. Although a detailed exposition of generaldeep learning architectures and concepts [79] is outside of thescope of this paper, here we provide a brief presentation ofthe deep neural network models used in existing AV-SE andAV-SS systems and listed in Table IV.One of the most used architectures is the feedforward fully-connected neural network (FFNN), also known as multilayerperceptron (MLP). A FFNN consists of several artificial neu-rons, or nodes , organised into a number of layers . The networkis fully-connected because each node shares a connectionwith every node belonging to the previous layer. In addition,it is feedforward since the information flows only in onedirection from the input layer to the output layer, through theintermediate layers, called hidden layers . In order to act asa universal approximator [45], [97], [98], i.e. being able toapproximate arbitrarily well any function which maps intervalsof real numbers to some real interval, a FFNN needs alsoto include activation functions, like sigmoid or ReLU, whichallow to model potential non-linearities of the function toapproximate.Another kind of feedforward network is the convolutionalneural network (CNN) [151]. While in FFNNs each node isconnected with all the nodes of the previous layer, CNNsare based on the convolution operation , which leverages sparse connectivity , parameter sharing and equivariance totranslation [79]. Sometimes, a convolutional layer is followedby a pooling operation , which performs a downsampling,for example by local maximisation, to reduce the amountof parameters and obtain invariance to local transformations .In AV-SE and AV-SS systems, CNNs are generally used toprocess the visual frames and automatically extract visual fea-tures [274]. They are also adopted for the acoustic signals, toprocess either the spectrogram [66] or the raw waveform [273].Since in SE and SS the acoustic input and the output sharesa similar structure, some approaches, such as [122], [176],[195], adopted a convolutional autoencoder (AE) architecture, sometimes including skip-connections like in U-Net [216] toallow the information to flow despite the bottleneck.The training of feedforward neural networks, i.e. the updateof the network parameters, is performed e.g. using stochasticgradient descent (SGD) [138], [215] to minimise an objectivefunction (see Section V-E for further details) using the back-propagation algorithm [219] for gradient computation. Varia-tions of SGD are also adopted, in particular RmsProp [245]and Adam [141]. Although increasing the number of hiddenlayers, i.e. the network depth , usually leads to a performanceincrease [229], two issues often arise: vanishing/explodinggradient [23], [74] and degradation problem [91]. Theseissues are generally addressed with batch normalisation [111]and residual connections [91], respectively, both extensivelyadopted in AV-SE and AV-SS systems.When dealing with speech signals, a different family of neu-ral networks is also used: recurrent neural networks (RNNs)[219]. The reason is that RNNs were designed to processsequential data. Therefore, they are particularly suitable forspeech signals, in which the temporal dimension is important.The training of RNNs is performed with backpropagationthrough time [270] and, similarly to feedforward neural net-works, vanishing/exploding gradient issues are common. Themost effective solution to the problem is to introduce paths inwhich the gradient could flow through time and regulate thepropagation of information with gates . This class of networksare called gated RNNs, and among them the most adoptedare long short-term memory (LSTM) [72], [96] and gatedrecurrent unit (GRU) [35]. Although these models have acausal structure, architectures in which the output at a giventime step depends on the whole sequence, including past andfuture observations, are also common, and they are knownas bidirectional RNNs (BiRNNs) [225], bidirectional LSTMs(BiLSTMs) and bidirectional GRUs (BiGRUs).Compared to knowledge-based approaches, deep learningmethods have some disadvantages that we expect to be ad-dressed in future works. First of all, neural network archi-tectures need to be trained with a large amount of data togeneralise well to a wide variety of speakers, languages,noise types, SNRs, illumination conditions and face poses.A big step in the evolution of AV-SE and AV-SS systemsoccurred when researchers started to train the models withlarge-scale AV datasets [7], [55], [195]. An interesting researchdirection would be to study the possibility of training deep-learning-based systems with a smaller amount of data withoutdegrading the performance in unknown scenarios [76], [77].In this context, it would be relevant to explore unsupervisedlearning techniques, such as the one proposed by Sadeghi etal. [220]–[222], who extended a previous work on AO-SE[152] and adopted variational auto-encoders (VAEs) for AV-SE. In their approach, there was no need of mixing manydifferent noise types with the speech of interest at severalSNRs, because the system modelled directly the clean speech.Despite this attempt, a supervised learning approach that learnsa mapping from noisy to clean speech or from a mixture toseparated speech signals is still the preferred way to tackleAV-SE and AV-SS, because it allows to reach state-of-the-artperformance. Furthermore, typical paradigms employed for training AV-SE and AV-SS systems assume that the sound sources ofa scene are independent from each other. This assumptionis adopted for convenience, because collecting actual speechin noise data is costly. However, it is often wrong, sincespeakers tend to change the way they speak, when they areimmersed in a noisy environment, in order to make theirspeech more intelligible. This phenomenon is known in theliterature as
Lombard effect [26], [163]. Recent work [178],[179] investigated the impact of this effect on data-driven AV-SE models, showing that training a system with Lombardspeech is beneficial especially at low SNRs. Therefore, theperformance of most deep-learning-based AV-SE systems isaffected by the fact that data used for training does not matchreal conditions.Another issue especially for low-resource devices is thatdeep learning models are usually computationally expensive,because data needs to be processed with an algorithm consist-ing of millions of parameters in order to achieve satisfactoryperformance. It is important to explore novel ways to reducethe model complexity without reducing the speech quality andintelligibility of the processed signals.
D. Fusion Techniques
As previously mentioned, AV-SE and AV-SS systems typi-cally consist of a combination of the neural network architec-tures presented above, which allows to fuse the acoustic andvisual information in several ways. The traditional multimodalfusion approaches are generally grouped into two classes,based on the processing level at which the fusion occurs[158], [209]: early fusion and late fusion (cf. Figure 2). Inearly fusion, the information of the different modalities iscombined into a joint representation at the feature level. Themain advantage is that the correlation between audio and videocan be exploited with a single model at a very early stage,making the system more robust if compared to another one thatprocesses the two modalities separately and combines themonly at a later stage. Evidence in speech perception suggeststhat also in humans the AV integration occurs at a very earlystage [226]. The disadvantage of early fusion is that usuallythe features of the two modalities are inherently different.Therefore, appropriate techniques for feature normalisation,transformation and synchronisation need to be developed. Latefusion, on the other hand, consists of combining the modalitiesonly at the decision level, after that the acoustic and visualinformation is processed separately with two different models.Although, from a theoretical perspective, early fusion wouldbe preferable for the reasons mentioned above, late fusion isoften used in practice for two reasons: it is possible to useunimodal models designed and validated over the years toachieve the best performance for each modality [128]; it iseasy to perform late fusion, because the data processed fromthe two modalities belongs to the same domain, being differentestimates of the same quantity.Although some AV-SE and AV-SS works showed that deeplearning offers the possibility to perform both early [183] andlate [65], [136] fusion, the majority of existing systems (e.g.
AcousticFeaturesVisualFeaturesEarly FusionOutputModel(a) AcousticFeaturesVisualFeatures OutputModel ModelLate Fusion(b) AcousticFeaturesVisualFeatures OutputDNN ModelIntermediate FusionDNN ModelDNN Model (c)
Fig. 2. AV fusion strategies. (a) Early fusion. (b) Late fusion. (c) Intermediatefusion. DNN model indicates a generic deep neural network model. [7], [55], [66], [128]) exploited the flexibility of deep learningtechniques and fused the different unimodal representationsinto a single hidden layer. This fusion strategy is known as intermediate fusion [209] (cf. Figure 2).Besides the level at which the AV integration occurs, it isimportant to consider the way in which this integration isperformed. As indicated in Table IV, the preferred way tofuse the information in AV-SE and AV-SS systems is through concatenation . Although this approach is easy to implement,it comes with some potential problems. When two modalitiesare concatenated, the system uses them simultaneously andtreats them in the same way. This means that although, inprinciple, a deep-learning-based system trained with a verylarge amount of data should be able to distinguish the casesin which the two modalities are complementary or in conflict[158], in practice we often experience that one modality (notnecessarily the most reliable in a given scenario) tends todominate over the other [62], [66], causing a performancedegradation. In AV-SE and AV-SS the acoustic modalities isthe one that dominates [66], [99]. This is something that mighthappen also for the approaches that employ an addition-basedfusion , in which the representations of the multimodal signalsare added, with or without weights, not dealing explicitly withthe aforementioned issues. Research has been conducted toinvestigate several possible methods to avoid that one modalitydominated over the other. We provide some examples in thefollowing.Hou et al. [99] adopted two strategies. First, they forcedthe system under development to use both modalities bylearning the target speech and the video frames of the speakermouth at the same time. However, this approach alone doesnot guarantee that the network discovers AV correlations: itmight happen that the network automatically learns to usesome hidden nodes to process only the audio modality, andother nodes to process only the video modality. To avoid thisselective behaviour, the second strategy adopted in [99] was amulti-style training approach [41], [188], in which one of theinput modalities could be randomly zeroed out. Gabbay et al.[66] introduced a new training procedure, which consisted ofincluding training samples, in which the noise signal added tothe target speech was, in fact, another utterance from the targetspeaker. Since it is hard to separate overlapping sentences from the same speaker using only the acoustic modality, thenetwork learned to exploit the visual features better. Morroneet al. [183] proposed a two-stage training procedure: first,a network was forced to use visual information because itwas trained to learn a mapping between the visual featuresand a target mask to be applied to the noisy spectrogram;then, a new network used the acoustic features together withthe visually-enhanced spectrogram obtained from the previousstage to further enhance the speech signal. Wang et al. [263]trained two networks separately for each modality to learntarget masks and used a gating network to perform a product-based fusion, keeping the system performance lower-boundedby the results of the AO network. This approach guaranteedgood performance also at high SNRs, where many AV systemsfail because acoustic information, which is very strong, andvisual information, which is rather weak, is strongly coupledwith early or intermediate fusion [263]. Joze et al. [128] andIuzzolino and Koishida [122] proposed the use of squeeze-excitation blocks which generalised the work in [102] for mul-timodal applications. In particular, each block consisted of twounits [128]: a squeeze unit that provided a joint representationof the features from each modality; an excitation unit whichemphasised or suppressed the multimodal features from thejoint representation based on their importance.In order to softly select the more informative modality forAV-SE and AV-SS, attention-based fusion mechanisms havealso been investigated in several works [42], [85], [153], [192],[239]. The attention mechanism [19] was introduced in thefield of natural language processing to improve sequence-to-sequence models [35], [240] for neural machine translation. Asequence-to-sequence architecture consists of RNNs organisedin an encoder, which reads an input sequence and compressesit into a context vector of a fixed length, and a decoder,which produces an output (i.e. the translated input sequence)considering the context vector generated by the encoder. Sucha model fails when the input sequence is long, because thefixed-length context vector acts as a bottleneck. Therefore,Bahdanau et al. [19] proposed to use a context vector thatpreserved the information of all the encoder hidden cells andallowed to align source and target sequences. In this case, themodel could attend to salient parts of the input. Besides neuralmachine translation [19], [170], [249], attention was latersuccessfully applied to various tasks, like image captioning[253], [277], speech recognition [36] and speaker verification[285]. In the context of AV-SE and AV-SS, two representativeworks are [42] and [85]. In [42], temporal attention [157]was used, motivated by the fact that different acoustic framesneed different degrees of separation. For example, the frameswhere only the target speech is present should be treateddifferently from the frames containing overlapped speech oronly the interfering speech. In [83], several information cueswere used by a single system: multi-channel acoustic mixture,speaker direction, lip movements and enrolled utterance of thespeaker. A rule-based attention mechanism [83] was employedto take into account the fact that the significance of eachinformation cue depended on the specific situation that thesystem needed to analyse. For example, when the speakerswere close to each other, spatial and directional features did not provide high discriminability. Therefore, when theangle difference between the speakers was small, the attentionweights allowed the model to selectively attend to the moresalient cues, i.e. the spectral content of the audio and the lipmovements. In addition, a factorised attention was adoptedto fuse spatial information, speaker characteristics and lipinformation at embedding level. The model first factorised theacoustic embeddings into a set of subspaces (e.g., phone andspeaker subspaces) and then used information from other cuesto fuse them with selective attention.For completeness, it is relevant to mention approaches thattried to leverage both deep-learning-based and knowledge-based models. For example, Adeel et al. [6] used a deep-learning-based model to learn a mapping between the videoframes of the target speaker and the filterbank audio featuresof the clean speech. The estimated speech features weresubsequently used in a Wiener filtering framework to getenhanced short-time magnitude spectra of the speech of inter-est. This approach was extended in [5], where both acousticand visual modalities were used to estimate the filterbankaudio features of the clean speech to be employed by theWiener filter. The combination of deep-learning-based andknowledge-based approaches was leveraged not only in asingle-microphone setup, but also for multi-microphone AV-SS. In [279], a jointly trained combination of a deep learningmodel and a beamforming module was used. Specifically, amulti-tap minimum variance distortionless response (MVDR)was proposed with the goal of reducing the nonlinear speechdistortions that are avoided with a MVDR beamformer [28],but inevitable for pure neural-network-based methods. Withthe jointly trained multi-tap MVDR, significant improvementsof ASR accuracy could be achieved compared to the pureneural-network-based methods for the SS task.The fusion strategies and the design of neural networkarchitectures experimented by researchers still require a lot ofexpertise. This means that despite the number of works on AV-SE and AV-SS, researchers might not have explored the bestarchitectures for data fusion. A way to deal with this issueis to investigate the possibility for a more general learningparadigm that focuses not only on determining the parametersof a model, but also on automatically exploring the space ofthe possible fusion architectures [209]. E. Training Targets and Objective Functions
As shown in Figure 1, two other important elements of AV-SE and AV-SS systems are training targets, i.e. the desiredoutputs of deep-learning-based models, and objective func-tions, which provide a measure of the distance between thetraining targets and the actual outputs of the systems. Here,we discuss the adoption of the various training targets andobjective functions for AV-SE and AV-SS comprehensivelylisted in Table IV, using the taxonomy proposed in [176].Following the terminology of Eq. (4) introduced in Sec-tion II (the extension to SS is straightforward), let A k,l = | X ( k, l ) | , V k,l = | D ( k, l ) | and R k,l = | Y ( k, l ) | indicate themagnitude of the STFT coefficients for the clean speech, thenoise and the noisy speech signals, respectively. A common Input c M k,l R k,l R k,l (IM) A k,l Objective FunctionInput c M k,l M k,l (MA) Objective Function A k,l Input b A k,l (DM) Objective FunctionModelModelModel Fig. 3. Illustration of direct mapping (DM), mask approximation (MA) andindirect mapping (IM) approaches. In the specific case of IM, the figure showsthe estimation of the ideal amplitude mask. Similar illustrations can be madefor different masks. way to perform the enhancement is by direct mapping (DM)[238] (cf. Figure 3): a system is trained to minimise anobjective function reflecting the difference between the output, (cid:98) A k,l , and the ground truth, A k,l . The most frequently usedobjective function is the mean squared error (MSE), whoseminimisation is equivalent to maximising the likelihood of thedata under the assumption of normal distribution of the errors.Alternatively, some AV models, such as [195], have beentrained with the mean absolute error (MAE), experimentallyproved to increase the spectral detail of the estimates andobtain higher performance if compared to MSE [175], [198].In order to reconstruct the time-domain signal, an estimateof the target short-time phase is also needed. The noisy phaseis usually combined with (cid:98) A k,l , since it is the optimal estimatorof the target short-time phase [54], under the assumption ofGaussian distribution of speech and noise. However, choosingthe noisy phase for speech reconstruction poses limitationsto the achievable performance of a system. Iuzzolino andKoishida [122] reported a significant improvement in termsof PESQ and STOI when their system used the target phaseinstead of the noisy phase to reconstruct the signal. Thissuggests that modelling the phase could be important in AVapplications and some research [7], [10], [153], [195] hasmoved towards this direction. Specifically, Owens and Efros[195] predicted both the target magnitude log spectrogram andthe target phase with their model. Afouras et al. [7] designeda sub-network to specifically predict a residual which, whenadded to the noisy phase, allowed to estimate the target phase.In this case, the phase sub-network was trained to maximisethe cosine similarity between the prediction and the targetphase, in order to take into account the angle between the two.The experiments showed that using the phase estimate wasbetter than using the phase of the input mixture, although therewas still room for improvements to match the performanceobtained with the ground truth phase.An alternative approach to DM consists of using a deep-learning-based model to get an estimate (cid:99) M k,l of a mask , M k,l .To reconstruct the clean speech signal during inference, (cid:99) M k,l needs to be element-wise multiplied with a TF representationof the noisy signal [176], [264]. This approach is known as mask approximation (MA), and an illustration of it is shown in Figure 3.In the literature, several masks have been defined in thecontext of AO-SE [261], [264] and then adopted for AV-SEand AV-SS. One way to build a TF mask is by setting its TFunits to binary values according to some criterion. An exampleis the ideal binary mask (IBM) [264], defined as: M IBMk,l = (cid:40) A k,l V k ,l > Γ( k )0 otherwise (6)where Γ( k ) indicates a predefined threshold. Later, otherbinary masks have been defined, such as the target binarymask (TBM) [142], [264] and the power binary mask (PBM)[263]. They have all been adopted as training targets in AVapproaches [76], [77], [183], [263] using the cross entropy loss as objective function.Besides binary masks, which are based on the principle ofclassifying each TF unit of a spectrogram as speech or noisedominated, continuous masks have been introduced for softdecisions. An example is the ideal ratio mask (IRM) [264]: M IRMk,l = (cid:32) A k,l A k,l + V k,l (cid:33) β , (7)where β is a scaling parameter. It is worth mentioning thatthis mask is heuristically motivated, although its form for β = 1 has some resemblance with the Wiener filter [162],[261]. IRM has been adopted as training target for a few AVmodels [12], [263], using either MSE or MAE as objectivefunction. Aldeneh et al. [12] proposed the use of a hybridloss which combined MAE and cosine distance to overcomethe limitations of MSE, getting sharp results and bypassing theassumption of statistical independence of the IRM componentsthat the use of MSE or MAE alone would imply.The IRM does not allow to perfectly recover the magnitudespectrogram of the target speech signal when multiplied withthe noisy spectrogram. Hence, the ideal amplitude mask (IAM)[264] was introduced: M IAMk,l = A k,l R k,l . (8)As we discussed previously, the noisy phase is often used toreconstruct the time-domain speech signal. All the masks thatwe mentioned above do not take the phase mismatch betweennoisy and target signals into account. Therefore, the phasesensitive mask (PSM) [58], [261] and the complex ratio mask (CRM) [261], [271] have been proposed. PSM is defined as: M PSM k,l = A k,l R k,l cos( θ k,l ) , (9)and tries to compensate for the phase mismatch by introducinga factor, cos( θ k,l ) , which is the cosine of the phase differencebetween the noisy and the clean signals. CRM is the only maskthat allows to perfectly reconstruct the complex spectrogramof the clean speech when applied to the complex noisyspectrogram, i.e.: X ( k, l ) = M CRM k,l ∗ Y ( k, l ) , (10)where ∗ denotes the complex multiplication, and M CRM k,l in-dicates the CRM. IAM, PSM and CRM can be found in several AV systems [169], [176], [178], [179], adopting MSEas objective function.MA is usually preferred to DM. The reason is that a maskis easier to estimate with a neural network [59], [176], [264].An exception is represented by speech dereverberation (SD).In this case, the use of a mask, specifically a ratio mask, isdiscouraged for the two reasons reported in [244]. First, asseen in Eq. (1), reverberation is a convolutive distortion andas such it does not justify the use of ratio masking, whichassumes that target speech and interference are uncorrelated[261]. In addition, if a system consists of a cascade of SS andSD modules, such as [244], a ratio mask applied in the SDstage would not be able to easily reduce the artefacts oftenintroduced by SS, because they are correlated with the targetspeech signal [244].An attempt to exploit the advantages of DM and MAat the same time is done by indirect mapping (IM) [238],[269] (cf. Figure 3). In IM, the model outputs a mask, asin MA, because it is easier to estimate than a spectrogramas mentioned above, but the objective function is defined inthe signal domain, as in DM. A comparison between DM, MAand IM for AV-SE was conducted in [176]. In contrast to whatone might expect, the results showed that IM did not obtainthe best performance among the three paradigms, as observedalso in [238], [269] for AO systems. Weninger et al. [269]experimentally showed for AO-SS that IM alone performedworse than MA, but it was beneficial when used to fine-tunea system previously trained with the MA objective. Despitethese results, AV-SE and AV-SS systems were often trainedfrom scratch with the IM paradigm (cf. Table IV) obtaininggood results. The reason is probably the use of large-scaledatasets, which allowed an optimal convergence of the models.Researchers experimented also with other ways than DM,MA and IM to estimate training targets with a neural networkmodel. For example, Gabbay et al. [65] and Khan et al. [136]used an estimate of the clean magnitude spectrogram, obtainedfrom visual features with a deep-learning-based model, tobuild a binary mask that could be applied to the noisyspectrogram. The approaches in [85], [244] can be consideredan extension of IM for a time-domain objective. Specifically,a system was trained to output a TF ratio mask using SI-SDRas objective function applied to the reconstructed time-domainsignals. The ratio mask obtained with this approach wasdifferent from the IAM, because it was not necessarily the onethat allowed a perfect reconstruction of the clean magnitudespectrogram. In [244], the system was also trained with anobjective that combined MSE on the magnitude spectrogramsand SI-SDR on the waveform signals. An objective functionin the time domain was also used in [108], [273]. In this case,a system, inspired by [168], was used to directly estimate thewaveform of the target speech signal with the SI-SDR trainingobjective.Other AV systems [42], [99], [192], [203] tried to improveSE and SS performance with multitask learning (MTL) [29],which consists of training a learning model to perform multiplerelated tasks. Pasa et al. [203] investigated MTL using ajoint system for AV-SE and ASR. They tried to either jointlyminimise a SE objective, MSE, and an ASR objective, con- nectionist temporal classification (CTC) loss [80], or alternatethe training between an AV-SE phase and an ASR phase.The alternated training was reported to be the most effective.Chung et al. [42] used two objective functions to train theirsystem: one was the MSE on magnitude spectrograms and theother was the speaker representation loss [184] on embeddingsfrom a network that extracted the speaker identity. Finally,Ochiai et al. [192] used a combination of losses that allowedtheir system to work even when acoustic or visual cues of thespeaker were not available.A typical issue for SS is the so-called source permutation [92], [282]. This problem occurs in speaker-independent SSsystems and it is characterised by an inconsistent assignmentover time of the separated speech signals to the sources. Twosolutions have been proposed in AO settings: permutationinvariant training (PIT) [144], [282] and deep clustering (DC)[32], [92], [112], [167]. The idea behind PIT is to calculatethe objective function for all the possible permutations of thesources and use the permutation associated with the lowesterror to update the model parameters. In DC, an embeddingvector is learned for each TF unit of the mixture spectrogramand is used to perform clustering to learn an IBM for SS.An extension of DC is the deep attractor network [32], whichcreates attractor points in the embedding space learned fromthe TF representation of the signal and estimates a soft maskfrom the similarity between the attractor points and the TFembeddings. Although some AV-SS systems used PIT or DC(cf. Table IV), source permutation is less of a problem in AV-SS, assuming that the target speakers are visible while theytalk: visual information is a strong guidance for the systemsand allows to automatically assign the separated speech signalsto the correct sources.Although many training targets and objective functions havealready been investigated for AV-SE and AV-SS, we expectfurther improvements following several research directions,such as: the use of perceptually motivated objective functions;the estimation of binaural cues to preserve the spatial dimen-sion also at the receiver end; a greater effort for design andestimation of time-domain training targets to perform end-to-end training. VI. R ELATED R ESEARCH
In this section, we consider two problems, speech recon-struction from silent videos and audio-visual sound sourceseparation for non-speech signals , because the first is a par-ticular case of AV-SE in which the acoustic input is missing,while the second is a general case of AV-SS in which thetarget sources are not human speakers. Speech reconstructionmodels can easily be adopted to estimate a mask for SEand SS, as shown in [65]. On the other hand, some genericsound source separation techniques can be adopted also forspeech signals by re-training the deep-learning-based modelson an AV speech dataset. In some cases, these techniquesare domain-specific, such as [67], making the adoption tothe speech domain hard. Nevertheless, the ways in whichmultimodal data is processed and fused can be of inspirationalso for AV-SE and AV-SS. A. Speech Reconstruction from Silent Videos
In some circumstances, the only reliable and accessiblemodality to understand the speech of interest is the visualone. Real-world situations of this kind include, for example:conversations in acoustically demanding situations like theones occurring during a concert, where the sound from theloudspeakers tends to dominate over the target speech; tele-conferences, in which sound segments are missing, e.g. due toaudio packet loss; surveillance videos, generally recorded in asituation where the target speaker is acoustically shielded (e.g.with a window) from camera(s) and microphone(s). All thesescenarios might be considered as an extreme case of AV-SEwhere the goal is to estimate the speech of interest from thesilent video of a talking face.In the literature, the problem of estimating speech fromvisual information is known as speech reconstruction fromsilent videos and it can be addressed with a system performingtwo tasks in cascade: • VSR - It consists of the prediction of text from the videoof a speaker’s face or lips area. • Text-to-speech (TTS) synthesis - It consists of generatinga speech signal from the reconstructed text.A two-stage approach may seem convenient, because thesetwo tasks have already been studied in isolation and systemsperforming reasonably well in both cases exist [18], [41],[205], [227], [233], [234]. However, when VSR and TTS arecombined, some problems may arise. First of all, a criticaldelay is introduced because the TTS model can generate aspeech segment only at the end of each word recognised bythe VSR module. Then, speech characteristics, like emotioncues and prosody, get lost when video frames are convertedinto text. Finally, AV recordings are not sufficient to train theVSR and the TTS systems: text transcriptions are required andthey are costly to obtain because manual annotation is needed.As an alternative to VSR-TTS systems, techniques that di-rectly perform a video-to-speech mapping have been proposed(cf. Table V). Although some attempts were made to recon-struct intelligible speech from silent articulations captured withseveral sensors [50], [105], [127], [135], Le Cornu and Milner[148] were the first to employ a neural network using only thesilent video of a speaker’s frontal face. They decided to basetheir system on STRAIGHT [134], a vocoder which allows toperform speech synthesis from three time-varying parametersdescribing fundamental aspects of a given speech signal:fundamental frequency (F0), aperiodicity (AP) and spectralenvelope (SP). Supported by the results of some previousworks [15], [20], [280], they assumed that only SP couldbe inferred from visual features. Therefore, AP and F0 werenot estimated from the silent video, but artificially producedwithout taking the visual information into account, while SPwas estimated with a Gaussian mixture model (GMM) andFFNN within a regression-based framework. As input to themodels, two different visual features were considered, 2-DDCT and AAM, while the explored SP representations were The data used in these works include (but are not limited to) recordingsobtained with electromagnetic articulography, electropalatography and laryn-gography sensors. TABLE VD
EEP - LEARNING - BASED APPROACHES FOR SPEECH RECONSTRUCTIONFROM SILENT VIDEOS . MV:
MULTI - VIEW . SI:
SPEAKER - INDEPENDENT .VSR:
VISUAL SPEECH RECOGNITION . Paper Year Input Output Model Info MV SI VSR[148] 2015 2-D DCT / AAM LPC or GMM / FFNN (cid:55) (cid:55) (cid:55) mouth mel-filterbankamplitudes[149] 2017 AAM Codebook entries FFNN / RNN (cid:55) (cid:55) (cid:55) mouth (mel-filterbankamplitudes)[57] 2017 Raw pixels LSP of LPC CNN, FFNN (cid:55) (cid:55) (cid:55) face[56] 2017 Raw pixels, Mel-scale and CNN, FFNN, (cid:55) (cid:55) (cid:55) optical flow linear-scale BiGRUface spectrograms[11] 2018 Raw pixels AE features, CNN, LSTM, (cid:55) (cid:55) (cid:55) face spectrogram FFNN, AE[145] 2018 Raw pixels LSP of LPC CNN, LSTM, (cid:51) (cid:55) (cid:55) mouth FFNN[147] 2018 Raw pixels LSP of LPC CNN, BiGRU, (cid:51) (cid:55) (cid:55) mouth FFNN[146] 2019 Raw pixels LSP of LPC CNN, BiGRU, (cid:51) (cid:51) (cid:51) mouth FFNN[243] 2019 Raw pixels WORLD CNN, FFNN (cid:55) (cid:55) (cid:55) mouth spectrum[256] 2019 Raw pixels Raw waveform GAN, CNN, (cid:55) (cid:51) (cid:55) mouth GRU[247] 2019 Raw pixels AE features, CNN, LSTM (cid:51) (cid:51) (cid:55) mouth spectrogram FFNN, AE[177] 2020 Raw pixels WORLD CNN, GRU, (cid:55) (cid:51) (cid:51) mouth / face features FFNN[206] 2020 Raw pixels mel-scale CNN, LSTM (cid:51) (cid:55) (cid:55) face spectrogram linear predictive coding (LPC) coefficients and mel-filterbankamplitudes. While the choice of visual features did not have abig impact on the results, the use of mel-filterbank amplitudesallowed to outperform the systems based on LPC coefficients.This work was extended in [149], where two improve-ments were proposed. First, instead of adopting a regressionframework, visual features were used to predict a class label,which in turn was used to estimate audio features from acodebook. Secondly, the influence of temporal informationwas explored from a feature-level point of view, by groupingmultiple frames, and from a model-level point of view, byusing RNNs. The obtained improvement in terms of intelli-gibility was substantial, but the speech quality was still low,mainly because the excitation parameters, i.e. F0 and AP, wereproduced without exploiting visual cues.Ephrat and Peleg [57] moved away from a classification-based method as the one presented in [149] and went back to aregression-based framework. Their approach consisted of pre-dicting a line spectrum pairs (LSP) representation of LPC coef-ficients directly from raw visual data with a CNN, followed bytwo fully connected layers. Their findings demonstrated that:no hand-crafted visual features were needed to reconstruct thespeaker’s voice; using the whole face instead of the moutharea as input improved the performance of the system; aregression-based method was effective in reconstructing out-of-vocabulary words. Although the results were promising interms of intelligibility, the signals sounded unnatural becauseGaussian white noise was used as excitation to reconstructthe waveform from LPC features. Therefore, a subsequentstudy [56] focused on speech quality improvements. In par-ticular, the proposed system was designed to get a linear-scale spectrogram from a learned mel-scale one with the post-processing network in [265]. The time-domain signal was then reconstructed combining an example-based technique similarto [196] with the Griffin-Lim algorithm [81]. Furthermore, amarginal performance improvement was obtained by providingnot only raw video frames as input, but also optical flow fieldscomputed from the visual feed.Another system was developed by Akbari et. al. [11],who tried to reconstruct natural sounding speech by learninga mapping between the speaker’s face and speech-relatedfeatures extracted by a pre-trained deep AE. The approachwas effective and outperformed the method in [57] in termsof speech quality and intelligibility.The main limitation of these techniques was that they wereemployed to reconstruct speech of talkers observed by themodel at training time. The first step towards a system thatcould generate speech from various speakers was taken byTakashima et al. [243]. They proposed an exemplar-basedapproach, where a CNN was trained to learn a high-levelacoustic representation from visual frames. This representationwas used to estimate the target spectrogram with the help ofan audio dictionary. The approach could generate a differentvoice without re-training the neural network model, but bysimply changing the dictionary with that of another speaker.Prajwal et al. [206] developed a sequence-to-sequence sys-tem adapted from Tacotron 2 [227]. Although their goalwas to learn speech patterns of a specific speaker fromvideos recorded on unconstrained settings, obtaining state-of-the-art performance, they also proposed a multi-speakerapproach. In particular, they conditioned their system onspeaker embeddings extracted from a reference speech signalas in [125]. Although they could synthesise speech of dif-ferent speakers, prior information was needed to get speakerembeddings. Therefore, this method cannot be considered aspeaker-independent approach, but a speaker-adaptive one.The challenge of building a speaker-independent systemwas addressed by Vougioukas et al. [256], who developeda generative adversarial network (GAN) that could directlyestimate time-domain speech signals from the video framesof the talker’s mouth region. Although this approach wascapable of reconstructing intelligible speech also in a speakerindependent scenario, the speech quality estimated with PESQwas lower than that in [11]. The generated speech signals werecharacterised by a low-power hum, presumably because themodel output was a raw waveform, for which suitable lossfunctions are hard to find [57].The method proposed in [177] intended to still be able to re-construct speech in a speaker independent scenario, but also toavoid artefacts similar to the ones introduced by the model in[256]. Therefore, vocoder features were used as training targetinstead of raw waveforms. Differently from [148], [149], thesystem adopted the WORLD vocoder [182], which was provedto achieve better performance than STRAIGHT [181], and wastrained to predict all the vocoder parameters, instead of SPonly. In addition, it also provided a VSR module, useful forall those applications requiring captions. The results showedthat a MTL approach, where VSR and speech reconstructionwere combined, was beneficial for both the estimated qualityand the estimated intelligibility of the generated speech signal.Most of the systems described above assumed that the speaker constantly faced the camera. This is reasonable insome applications, e.g. teleconferences. Other situations mayrequire a robustness to multiple views and face poses. Kumaret al. [145] were the first to make experiments in this direction.Their model was designed to take as input multiple viewsof the talker’s mouth and to estimate a LSP representationof LPC coefficients for the audio feed. The best results interms of estimated speech quality were obtained when twodifferent views were used as input. The work was extendedin [147], where results from extensive experiments with amodel adopting several view combinations were reported. Thebest performance was achieved with the combination of threeangles of view (0 ◦ , 45 ◦ and 60 ◦ ).The systems in [145] and in [147] were personalised,meaning that they were trained and deployed for a particu-lar speaker. Multi-view speaker-independent approaches wereproposed in [146] and [247]. In both cases, a classifier tookas input the multi-view videos of a talker and determinedthe angles of view from a discrete set of lip poses. Then,a decision network chose the best view combination andthe reconstruction model to generate the speech signal. Themain difference between the two systems was the audiorepresentation used. While Uttam et al. [247] decided to workwith features extracted by a pre-trained deep AE, similarly to[11], the approach in [146] estimated a LSP representation ofLPC coefficients. In addition, Kumar et al. [146] provided aVSR module, as in [177]. However, this module was trainedseparately from the main system and was designed to provideonly one among ten possible sentence transcriptions, making itdatabase-dependent and not feasible for real-time applications.Despite the research done in this area, several criticalpoints need to be addressed before speech reconstruction fromsilent videos reaches the maturity required for a commercialdeployment. All the approaches in the literature except [206]presented experiments conducted in controlled environments.Real-world situations pose many challenges that need to betaken into account, e.g. the variety of lighting conditionsand occlusions. Furthermore, systems that directly reconstructspeech from videos should, at least in principle, have theadvantage of preserving speech characteristics, like emotion,if compared with a two-stage approach consisting of a VSRand a TTS modules. However, no experiments with expressivedatasets, such as RAVDESS [161], have ever been conducted.Finally, before a practical system can be employed for unseenspeakers, performance needs to improve considerably. At themoment, the results for the speaker-independent case areunsatisfactory, probably due to the limited number of speakersused in the training phase. B. Audio-Visual Sound Source Separation for Non-SpeechSignals
Sound source separation might involve signals differentfrom speech. Imagine, for example, the task of extracting theindividual sounds coming from different music instrumentsplaying together. Although the signal of interest is not speechin this case, the approaches developed in this area can provideuseful insights also for AV-SE and AV-SS. TABLE VID
EEP - LEARNING - BASED APPROACHES FOR AUDIO - VISUAL SOURCESEPARATION FOR NON - SPEECH SIGNALS . L: L
OCALISATION . Paper Year Key Idea L[68] 2018 Guide source separation with audio frequency bases (cid:55) learned with a framework that maps to visual objects.[288] 2018 Separate audio sources into components that can be (cid:51) localised in the video frames.[218] 2019 Perform independent image co-segmentation and (cid:51) sound source separation for not synchronised data.[69] 2019 Use predicted binaural audio to aid sound source (cid:51) separation.[201] 2019 Use of a multiple instance learning paradigm for (cid:51) separation and localisation of weakly-labeled data.[287] 2019 Incorporate temporal motion information and employ (cid:51) a curriculum learning scheme for training.[278] 2019 Do not separate the sounds independently to avoid (cid:51) that acoustic components from the original mixtureget lost.[70] 2019 Devise a new paradigm to use videos with multiple (cid:55) (correlated) sounds during training.[230] 2020 Explore conditioning techniques with video stream (cid:55) and weak labels.[67] 2020 Use keypoint-based structured visual representations (cid:55) to model human-object interactions.[291] 2020 Refine the separated sounds with cascaded opponent (cid:51) filtering.[290] 2020 Use an appearance attention module for separation. (cid:51)
Several works addressed AV source separation for non-speech signals. Similarly to other fields, classical methods[21], [30], [199], [200], [207] were recently replaced by deep-learning-based approaches (cf. Table VI). The first two worksthat concurrently proposed deep processing stages for the taskunder analysis were [68] and [288].In [68], a novel neural network for multi-instance multi-label learning (MIML) was used to learn a mapping be-tween audio frequency bases and visual object categories.Disentangled audio bases were used to guide a non-negativematrix factorisation (NMF) framework for source separation.The method was successfully employed for in-the-wild videoscontaining a broad set of object sounds, such as musicalinstruments, animals and vehicles. NMF was also adopted in alater work by Parekh et al. [201], where both audio frequencybases and their activations were used, leveraging temporalinformation. In contrast to [68], the system could also performvisual localisation , which is the task of detecting the soundsources in the visual input.In [288], audio and video information were jointly used by adeep system called PixelPlayer to simultaneously localise thesound sources in the visual frames and acoustically separatethem. The results of this technique sparked a particular interestin the research community, causing the development of severalmethods aiming at improving it further.First of all, Rouditchenko et al. [218] extended the work in[288] for unsynchronised audio and video data. Their approachconsisted of a network able to disentangle acoustic and visualrepresentations to independently perform visual object co-segmentation and sound source separation.Then, PixelPlayer only considered semantic features ex-tracted from the video frames. Appearance information isimportant as highlighted in [290], where the separation was guided with a single image, but higher performance is expectedto be achieved when also motion information is exploited.Zhao et al. [287] proposed to combine trajectory and semanticfeatures to condition a source separation network. The systemwas trained with a curriculum learning scheme, consistingof three consecutive stages characterised by increasing levelsof difficulty. This approach showed its effectiveness even forseparating sounds of the same kind of musical instruments,an achievement not possible in [288]. However, the trajectorymotion cues are not able to accurately model the interactionsbetween a human and an object, e.g. a musical instrument.For this reason, Gan et al. [67] proposed to use keypoint-based structured visual representations together with the visualsemantic context. In this way, they were able to achieve state-of-the-art performance. Motion information was also used inthe form of optical flow and dynamic image [24] by Zhu andRahtu [291]. Their approach refined the separated sounds inmultiple stages within a framework called cascaded opponentfilter (COF). In addition, they could achieve accurate soundsource localisation with a sound source location masking(SSLM) network, following the idea in [103].Especially when dealing with musical instruments, havinga priori knowledge of the presence or absence of a partic-ular instrument in a recording, i.e. weak labels, might beadvantageous. Slizovskaia et al. [230] studied the problemof source separation conditioned with additional information,which included not only visual cues but also weak labels.Their investigation covered, among other aspects, neural net-work architectures (either U-Net [216] or multi-head U-Net(MHU-Net) [51]), conditioning strategies (either feature-wiselinear modulation (FiLM) [52] or multiplicative conditioning),places of conditioning (at the bottleneck, at all the encoderlayers or at the final decoder layer), context vectors (staticvisual context vector, visual-motion context vector and binaryindicator vector encoding the instruments in the mixture) andtraining targets (binary mask or ratio mask).The audio signals used in these systems are generallymonaural. Inspired by the fact that humans benefit from bin-aural cues [90], Gao and Grauman [69] proposed a method toexploit visual information with the aim of converting monauralaudio into binaural audio. This conversion allowed to exposeacoustic spatial cues that turned out to be helpful for soundsource separation.Zhao et al. [288] separated the sound sources in the observedmixture assuming that they were independent. This assumptioncan generate two main issues. The first is that the sum of theseparated sounds might be different from the actual mixture,i.e. some acoustic components of the actual mixture might notbe found in any outputs of the separation system. Therefore,Xu et al. [278] proposed a novel method called MinusPlusnetwork. The idea was to have a two-stage system in which: aminus stage recursively identified the sound with the highestenergy and removed it from the mixture; a plus stage refinedthe removed sounds. The recursive procedure based on soundenergy allowed to automatically handle a variable number ofsound sources and made the sounds with less energy emerge.The second issue is related to the fact that training is usuallyperformed following a paradigm in which distinct AV clips are randomly mixed. However, sounds that appear in the samescene are usually correlated, e.g. two musical instrumentsplaying the same song. The use of training materials consistingof independent videos might hinder a deep network fromcapturing such correlations. Hence, Gao and Grauman [70]introduced a new training paradigm, called co-separation, inwhich an association between consistent sounds and visualobjects across pairs of training videos was learned. Exploringthis aspect further and possibly overcoming the supervisionparadigm used in most of the works in the literature byusing real-world recordings and not only synthetic mixturesfor training is an interesting future research direction that caneasily be adopted also for AV-SE and AV-SS.VII. C ONCLUSION
In this paper, we presented an overview of deep-learning-based approaches for audio-visual speech enhancement (AV-SE) and audio-visual speech separation (AV-SS). The surveywas organised to allow for a description and a discussion of themain elements characterising state-of-the-art systems, namely:visual features; acoustic features; deep learning methods;fusion techniques; training targets and objective functions.We saw that, although raw visual data is often used asvisual input, low-dimensional features are preferred in severalworks. This choice serves two purposes: first, it allows toreduce the complexity of AV-SE and AV-SS algorithms, sincethe dimensionality of the data to process is lower; secondly,it makes it possible to train deep learning models for AV-SE and AV-SS with less data, because the low-dimensionalfeatures usually adopted are already somewhat robust to sev-eral factors, such as illumination conditions, face poses, etc.Besides visual features, AV-SE and AV-SS systems generallyuse the short-time magnitude spectrogram as acoustic input.Since the short-time magnitude spectrogram is not a completerepresentation of the acoustic signal, some methods exploitthe phase information, the complex spectrogram or directlythe time-domain signal. Future systems, where developmentdata and computational resources may be abundant, might aimfor end-to-end training using directly raw visual and acousticsignals as input.In state-of-the-art AV-SE and AV-SS systems, the actual dataprocessing is obtained with deep-learning-based techniques.Generally, acoustic and visual features are processed sepa-rately using two neural network models. Then, the outputvectors of these models are fused, often by concatenation,and, afterwards, used as input to another deep learning model.This strategy is convenient, because it is very easy to imple-ment. However, it comes with a major drawback: a simpleconcatenation does not allow to control how the informationfrom the acoustic and the visual modalities is treated. As aconsequence, one of the two modalities may dominate over theother, determining a decrease in the total system performance.Among the strategies adopted to tackle this problem, attention-based mechanisms, which allow the systems to attend torelevant parts of the input, mitigate the potential unbalancecaused by concatenation-based fusion.The last two elements of AV-SE and AV-SS systems aretraining targets, i.e. the desired output of a deep learning model, and objective functions, i.e. functions that measurethe distance between the desired output of a model and itsactual output. Although a few approaches tried to directlyapproximate the target speech signal(s) in time domain, moreoften a time-frequency (TF) representation of the signalsis used. In particular, the deep-learning-based systems aregenerally trained to minimise the mean squared error (MSE)between the network output and the (potentially transformed)TF coefficients of the training target, which can be either theclean magnitude spectrogram or a mask that is applied tothe noisy spectrogram to obtain an enhanced speech signal.Among the two training targets, the latter is usually preferred,because a mask has been empirically found to be easierto estimate with deep learning if compared to the cleanmagnitude spectrogram.We also presented three other aspects related to AV-SE andAV-SS, since they can provide additional insights. First, wesurveyed audio-visual (AV) speech datasets, since data-drivenmethods, like the ones based on deep learning, heavily relyon them. We saw that AV-SE and AV-SS research can stillbenefit from data collected in a controlled environment tostudy specific phenomena, like Lombard effect. The generaltendency, however, is to use large-scale in-the-wild datasets tomake the deep-learning-based systems robust to the variety ofconditions that may be present in real-world applications.Second, we reviewed the principal methodologies used toassess the performance of AV-SE and AV-SS. Specifically, weconsidered listening tests and objective measures. The formerrepresent the ideal way to assess processed speech signals andmust be employed eventually for a realistic system evaluation.However, they are generally time-consuming and costly toconduct. The latter allow to estimate some speech aspects,like quality and intelligibility, in a quick and easily repeatableway, which is highly desirable in the development phase of AVsystems. Although many objective measures exist, it might bereasonable to choose the ones that are widely adopted, to makecomparisons with previous approaches, and that are reportedto correlate well with listening test results. Examples includePESQ, STOI (or its extended version, ESTOI), and SDR (or itsscale-invariant definition, SI-SDR). Currently, such objectivemeasures are audio-only (AO). This is in contrast to humancommunication, which is generally AV.Third, we presented deep-learning-based methods used tosolve two related tasks: speech reconstruction from silentvideos and AV sound source separation for non-speech signals.In particular, we reported a chronological evolution of thesefields, because they influenced the first AV-SE and AV-SSapproaches and they may still provide a source of inspirationfor AV-SE and AV-SS research and vice versa.Finally, we identified several future research directionsthroughout the paper. Some of them address aspects, such asrobustness to a variety of acoustic and visual conditions, tobe applied e.g. in teleconferences, and reduction of the com-putational complexity of deep learning algorithms, especiallyrelevant for low-resource devices like hearing aids. Others, likethe investigation of new paradigms for AV fusion, are morefocused on a better exploitation of properties and constraintsin multimodal systems, and they could, for example, further affect AV speech recognition, AV emotion recognition and AVtemporal synchronisation.A CKNOWLEDGMENT
This research is partially funded by the William DemantFoundation. R
EFERENCES[1] A. H. Abdelaziz, “NTCD-TIMIT: A new database and baseline fornoise-robust audio-visual speech recognition.” in
Proc. of Interspeech ,2017.[2] A. Abel and A. Hussain, “Novel two-stage audiovisual speech filteringin noisy environments,”
Cognitive Computation , vol. 6, no. 2, 2014.[3] A. Adeel, J. Ahmad, H. Larijani, and A. Hussain, “A novel real-time,lightweight chaotic-encryption scheme for next-generation audio-visualhearing aids,”
Cognitive Computation , vol. 12, no. 3, pp. 589–601,2019.[4] A. Adeel, M. Gogate, and A. Hussain, “Towards next-generation lip-reading driven hearing-aids: A preliminary prototype demo,” in
Proc.of CHAT , 2017.[5] ——, “Contextual deep learning-based audio-visual switching forspeech enhancement in real-world environments,”
Information Fusion ,vol. 59, pp. 163–170, 2020.[6] A. Adeel, M. Gogate, A. Hussain, and W. M. Whitmer, “Lip-readingdriven deep learning approach for speech enhancement,”
IEEE Trans-actions on Emerging Topics in Computational Intelligence , 2019.[7] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deepaudio-visual speech enhancement,”
Proc. of Interspeech , 2018.[8] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman,“Deep audio-visual speech recognition,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2018.[9] T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: alarge-scale dataset for visual speech recognition,” arXiv preprintarXiv:1809.00496 , 2018.[10] ——, “My lips are concealed: Audio-visual speech enhancementthrough obstructions,” in
Proc. of Interspeech , 2019.[11] H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2AudSpec: Speechreconstruction from silent lip movements video,” in
Proc. of ICASSP ,2018.[12] Z. Aldeneh, A. P. Kumar, B.-J. Theobald, E. Marchi, S. Kajarekar,D. Naik, and A. H. Abdelaziz, “Self-supervised learning of visualspeech features with audiovisual speech enhancement,” arXiv preprintarXiv:2004.12031 , 2020.[13] N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown, “Acorpus of audio-visual Lombard speech with frontal and profile views,”
The Journal of the Acoustical Society of America , vol. 143, no. 6, pp.EL523–EL529, 2018.[14] I. Almajai and B. Milner, “Visually derived Wiener filters for speechenhancement,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 19, no. 6, pp. 1642–1651, 2010.[15] I. Almajai, B. Milner, and J. Darch, “Analysis of correlation betweenaudio and visual speech features for clean audio feature prediction innoise,” in
Proc. of Interspeech , 2006.[16] I. Anina, Z. Zhou, G. Zhao, and M. Pietik¨ainen, “OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in
Proc. of FG , 2015.[17] A. Arriandiaga, G. Morrone, L. Pasa, L. Badino, and C. Bartolozzi,“Audio-visual target speaker extraction on multi-talker environmentusing event-driven cameras,” arXiv preprint arXiv:1912.02671 , 2019.[18] Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, “LipNet:End-to-end sentence-level lipreading,” in
Proc. of GTC , 2017.[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” in
Proc. of ICLR , 2015.[20] J. P. Barker and F. Berthommier, “Evidence of correlation betweenacoustic and visual features of speech,” in
Proc. of ICPhS , 1999.[21] Z. Barzelay and Y. Y. Schechner, “Harmony in motion,” in
Proc. ofCVPR , 2007.[22] J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann,J. Pomy, and M. Keyhl, “Perceptual objective listening quality assess-ment (POLQA), the third generation ITU-T standard for end-to-endspeech quality measurement Part II - Perceptual model,”
The Journalof the Audio Engineering Society , vol. 61, no. 6, pp. 385–402, 2013. [23] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-cies with gradient descent is difficult,”
IEEE Transactions on NeuralNetworks , vol. 5, no. 2, pp. 157–166, 1994.[24] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamicimage networks for action recognition,” in
Proc. of CVPR , 2016.[25] A. W. Bronkhorst, “The cocktail party phenomenon: A review ofresearch on speech intelligibility in multiple-talker conditions,”
ActaAcustica united with Acustica , vol. 86, no. 1, pp. 117–128, 2000.[26] H. Brumm and S. A. Zollinger, “The evolution of the Lombard effect:100 years of psychoacoustic research,”
Behaviour , vol. 148, no. 11-13,pp. 1173–1198, 2011.[27] E. Cano, D. FitzGerald, and K. Brandenburg, “Evaluation of quality ofsound source separation algorithms: Human perception vs quantitativemetrics,” in
Proc. of EUSIPCO , 2016.[28] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,”
Proceedings of the IEEE , vol. 57, no. 8, pp. 1408–1418, 1969.[29] R. Caruana, “Multitask learning,”
Machine learning , vol. 28, no. 1, pp.41–75, 1997.[30] A. L. Casanovas, G. Monaci, P. Vandergheynst, and R. Gribonval,“Blind audiovisual source separation based on sparse redundant rep-resentations,”
IEEE Transactions on Multimedia , vol. 12, no. 5, pp.358–371, 2010.[31] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into thenoise reduction Wiener filter,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 14, no. 4, pp. 1218–1234, 2006.[32] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in
Proc. of ICASSP , 2017.[33] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speechextraction network,” in
Proc. of SLT , 2018.[34] E. C. Cherry, “Some experiments on the recognition of speech, with oneand with two ears,”
The Journal of the Acoustical Society of America ,vol. 25, no. 5, pp. 975–979, 1953.[35] K. Cho, B. van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingRNN encoder–decoder for statistical machine translation,” in
Proc. ofEMNLP , 2014.[36] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in
Proc. of NIPS ,2015.[37] S.-Y. Chuang, Y. Tsao, C.-C. Lo, and H.-M. Wang, “Lite audio-visualspeech enhancement,” in
Proc. of Interspeech (to appear) , 2020.[38] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speakerrecognition,”
Proc. of Interspeech , 2018.[39] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip readingsentences in the wild,” in
Proc. of CVPR , 2017.[40] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in
Proc. ofACCV , 2016.[41] ——, “Lip reading in profile,” in
Proc. of BMVC , 2017.[42] S.-W. Chung, S. Choe, J. S. Chung, and H.-G. Kang, “Facefilter:Audio-visual speech separation using still images,” arXiv preprintarXiv:2005.07074 , 2020.[43] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visualcorpus for speech perception and automatic speech recognition,”
TheJournal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421–2424, 2006.[44] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance mod-els,”
IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 23, no. 6, pp. 681–685, 2001.[45] G. Cybenko, “Approximation by superpositions of a sigmoidal func-tion,”
Mathematics of Control, Signals and Systems , vol. 2, no. 4, pp.303–314, 1989.[46] A. Czyzewski, B. Kostek, P. Bratoszewski, J. Kotus, and M. Szykulski,“An audio-visual corpus for multimodal automatic speech recognition,”
Journal of Intelligent Information Systems , vol. 49, no. 2, pp. 167–192,2017.[47] T. Darrell, J. W. Fisher, and P. Viola, “Audio-visual segmentation and“the cocktail party effect”,” in
Proc. of ICMI , 2000.[48] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,2010.[49] J. R. Deller, J. H. L. Hansen, and J. G. Proakis,
Discrete-TimeProcessing of Speech Signals . Wiley-IEEE Press, 2000.[50] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S.Brumberg, “Silent speech interfaces,”
Speech Communication , vol. 52,no. 4, pp. 270–287, 2010. [51] C. S. Doire and O. Okubadejo, “Interleaved multitask learning foraudio source separation with independent databases,” arXiv preprintarXiv:1908.05182 , 2019.[52] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. d. Vries, A. Courville,and Y. Bengio, “Feature-wise transformations,” Distill , vol. 3, no. 7, p.e11, 2018.[53] J. P. Egan, “Articulation testing methods,”
The Laryngoscope , vol. 58,no. 9, pp. 955–991, 1948.[54] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,”
IEEETransactions on Acoustics, Speech, and Signal Processing , vol. 32,no. 6, pp. 1109–1121, 1984.[55] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T.Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: Aspeaker-independent audio-visual model for speech separation,”
ACMTransactions on Graphics , vol. 37, no. 4, pp. 112:1–112:11, 2018.[56] A. Ephrat, T. Halperin, and S. Peleg, “Improved speech reconstructionfrom silent video,” in
Proc. of CVAVM , 2017.[57] A. Ephrat and S. Peleg, “Vid2Speech: Speech reconstruction from silentvideo,” in
Proc. of ICASSP , 2017.[58] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recur-rent neural networks,” in
Proc. of ICASSP , 2015.[59] ——, “Deep recurrent networks for separation and recognition ofsingle-channel speech in nonstationary background audio,” in
New Erafor Robust Speech Recognition . Springer, 2017, pp. 165–186.[60] G. Fairbanks, “Test of phonemic differentiation: The rhyme test,”
TheJournal of the Acoustical Society of America , vol. 30, no. 7, pp. 596–600, 1958.[61] T. H. Falk, V. Parsa, J. F. Santos, K. Arehart, O. Hazrati, R. Huber, J. M.Kates, and S. Scollie, “Objective quality and intelligibility predictionfor users of assistive listening devices: Advantages and limitations ofexisting tools,”
IEEE Signal Processing Magazine , vol. 32, no. 2, pp.114–124, 2015.[62] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in
Proc. of CVPR ,2016.[63] J. F. Feuerstein, “Monaural versus binaural hearing: Ease of listening,word recognition, and attentional effort,”
Ear and Hearing , vol. 13,no. 2, pp. 80–86, 1992.[64] H. Fletcher and J. Steinberg, “Articulation testing methods,”
The BellSystem Technical Journal , vol. 8, no. 4, pp. 806–854, 1929.[65] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing throughnoise: Visually driven speaker separation and enhancement,” in
Proc.of ICASSP , 2018.[66] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,”
Proc. of Interspeech , 2018.[67] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Musicgesture for visual sound separation,” in
Proc. of CVPR , 2020.[68] R. Gao, R. Feris, and K. Grauman, “Learning to separate object soundsby watching unlabeled video,” in
Proc. of ECCV , 2018.[69] R. Gao and K. Grauman, “2.5D visual sound,” in
Proc. of CVPR , 2019.[70] ——, “Co-separating sounds of visual objects,” in
Proc. of ICCV , 2019.[71] S. A. Gelfand,
Hearing: An introduction to Psychological and Physi-ological Acoustics . CRC Press, 2016.[72] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:Continual prediction with LSTM,”
Neural Computation , vol. 12, no. 10,pp. 2451–2471, 2000.[73] L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhancement ofspeech in noise,”
The Journal of the Acoustical Society of America ,vol. 109, no. 6, pp. 3007–3020, 2001.[74] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in
Proc. of AISTATS , 2010.[75] M. Gogate, A. Adeel, K. Dashtipour, P. Derleth, and A. Hussain,“AV speech enhancement challenge using a real noisy corpus,” arXivpreprint arXiv:1910.00424 , 2019.[76] M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, “DNNdriven speaker independent audio-visual mask estimation for speechseparation,” in
Proc. of Interspeech , 2018.[77] M. Gogate, K. Dashtipour, A. Adeel, and A. Hussain, “Cochleanet: Arobust language-independent audio-visual model for speech enhance-ment,”
Information Fusion , vol. 63, pp. 273–285, 2020.[78] E. Z. Golumbic, G. B. Cogan, C. E. Schroeder, and D. Poeppel, “Visualinput enhances selective speech envelope tracking in auditory cortexat a “cocktail party”,”
Journal of Neuroscience , vol. 33, no. 4, pp.1417–1426, 2013. [79] I. Goodfellow, Y. Bengio, and A. Courville,
Deep learning . MITpress, 2016.[80] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connection-ist temporal classification: labelling unsegmented sequence data withrecurrent neural networks,” in
Proc. of ICML , 2006.[81] D. Griffin and J. Lim, “Signal estimation from modified short-timefourier transform,”
IEEE Transactions on Acoustics, Speech, and SignalProcessing , vol. 32, no. 2, pp. 236–243, 1984.[82] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li,S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al. , “Ava:A video dataset of spatio-temporally localized atomic visual actions,”in
Proc. of CVPR , 2018.[83] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou,and D. Yu, “Neural spatial filter: Target speaker speech separationassisted with directional information,” in
Proc. of Interspeech , 2019.[84] R. Gu, J. Wu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, andD. Yu, “End-to-end multi-channel speech separation,” arXiv preprintarXiv:1905.06286 , 2019.[85] R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, and D. Yu, “Multi-modalmulti-channel target speech separation,”
IEEE Journal of SelectedTopics in Signal Processing , 2020.[86] B. Hagerman, “Sentences for testing speech intelligibility in noise,”
Scandinavian Audiology , vol. 11, no. 2, pp. 79–87, 1982.[87] M. H¨allgren, B. Larsby, and S. Arlinger, “A Swedish version of thehearing in noise test (HINT) for measurement of speech recognition,”
International Journal of Audiology , vol. 45, no. 4, pp. 227–237, 2006.[88] C. Han, Y. Luo, and N. Mesgarani, “Real-time binaural speech sepa-ration with preserved spatial cues,” in
Proc. of ICASSP , 2020.[89] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus ofcontinuous speech,”
IEEE Transactions on Multimedia , vol. 17, no. 5,pp. 603–615, 2015.[90] M. L. Hawley, R. Y. Litovsky, and J. F. Culling, “The benefit of binauralhearing in a cocktail party: Effect of location and type of interferer,”
The Journal of the Acoustical Society of America , vol. 115, no. 2, pp.833–843, 2004.[91] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proc. of the CVPR , 2016.[92] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering:Discriminative embeddings for segmentation and separation,” in
Proc.of ICASSP , 2016.[93] A. Hines, J. Skoglund, A. Kokaram, and N. Harte, “ViSQOL: Thevirtual speech quality objective listener,” in proc. of IWAENC , 2012.[94] A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “ViSQOL: anobjective speech quality model,”
EURASIP Journal on Audio, Speech,and Music Processing , vol. 2015, no. 1, pp. 1–18, 2015.[95] S. Hochmuth, T. Brand, M. A. Zokoll, F. Z. Castro, N. Wardenga, andB. Kollmeier, “A spanish matrix sentence test for assessing speechreception thresholds in noise,”
International Journal of Audiology ,vol. 51, no. 7, pp. 536–544, 2012.[96] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComputation , vol. 9, no. 8, pp. 1735–1780, 1997.[97] K. Hornik, “Approximation capabilities of multilayer feedforwardnetworks,”
Neural Networks , vol. 4, no. 2, pp. 251–257, 1991.[98] K. Hornik, M. Stinchcombe, H. White et al. , “Multilayer feedforwardnetworks are universal approximators.”
Neural Networks , vol. 2, no. 5,pp. 359–366, 1989.[99] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using multimodal deepconvolutional neural networks,”
IEEE Transactions on Emerging Topicsin Computational Intelligence , vol. 2, no. 2, pp. 117–128, 2018.[100] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W. Chang,and H.-M. Wang, “Audio-visual speech enhancement using deep neuralnetworks,” in
Proc. of APSIPA , 2016.[101] A. S. House, C. E. Williams, M. H. Hecker, and K. D. Kry-ter, “Articulation-testing methods: consonantal differentiation with aclosed-response set,”
The Journal of the Acoustical Society of America ,vol. 37, no. 1, pp. 158–166, 1965.[102] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc. of CVPR , 2018.[103] J. Hu, Y. Zhang, and T. Okatani, “Visualization of convolutional neuralnetworks for monocular depth estimation,” in
Proc. of ICCV , 2019.[104] Y. Hu and P. C. Loizou, “Evaluation of objective quality measuresfor speech enhancement,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 16, no. 1, pp. 229–238, 2007.[105] T. Hueber and G. Bailly, “Statistical conversion of silent articulationinto audible speech using full-covariance HMM,”
Computer Speech &Language , vol. 36, pp. 274–293, 2016. [106] A. Hussain, J. Barker, R. Marxer, A. Adeel, W. Whitmer, R. Watt, andP. Derleth, “Towards multi-modal hearing aid design and evaluation inrealistic audio-visual settings: Challenges and opportunities,” in Proc.of CHAT , 2017.[107] E. Ideli, “Audio-visual speech processing using deep learning tech-niques.” MSc thesis, Applied Sciences: School of EngineeringScience, 2019.[108] E. Ideli, B. Sharpe, I. V. Baji´c, and R. G. Vaughan, “Visually assistedtime-domain speech enhancement,” in
Proc. of GlobalSIP , 2019.[109] B. ˙Inan, M. Cernak, H. Grabner, H. P. Tukuljac, R. C. Pena, andB. Ricaud, “Evaluating audiovisual source separation in the contextof video conferencing,”
Proc. of Interspeech , 2019.[110] A. N. S. Institute,
American National Standard: Methods for Calcula-tion of the Speech Intelligibility Index . Acoustical Society of America,1997.[111] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
Proc. of ICML ,2015.[112] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey,“Single-channel multi-speaker separation using deep clustering,”
Proc.of Interspeech , 2016.[113] ITU-R, “Recommendation BS.562: Subjective assessment of soundquality,” 1990.[114] ——, “Recommendation BS.1534-1: Method for the subjective assess-ment of intermediate quality levels of coding systems,” 2003.[115] ——, “Recommendation BS.1284-2: General methods for the subjec-tive assessment of sound quality,” 2019.[116] ITU-T, “P.830 : Subjective performance assessment of telephone-bandand wideband digital codecs,” 1996.[117] ——, “Recommendation P.862: Perceptual evaluation of speech quality(PESQ): an objective method for end-to-end speech quality assessmentof narrow-band telephone networks and speech codecs,” 2001.[118] ——, “Recommendation P.835: Subjective test methodology for eval-uating speech communication systems that include noise suppressionalgorithm,” 2003.[119] ——, “Recommendation P.862.1: Mapping function for transformingP.862 raw result scores to MOS-LQO,” 2003.[120] ——, “Recommendation P.862.2: Wideband extension to recommen-dation P.862 for the assessment of wideband telephone networks andspeech codecs,” 2005.[121] ——, “Recommendation P.863: Perceptual objective listening qualityassessment,” 2011.[122] M. L. Iuzzolino and K. Koishida, “AV(SE) : Audio-visual squeeze-excite speech enhancement,” in Proc. of ICASSP , 2020.[123] U. Jekosch,
Voice and Speech Quality Perception: Assessment andEvaluation . Springer Science & Business Media, 2006.[124] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibilityof speech masked by modulated noise maskers,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 24, no. 11, pp.2009–2022, 2016.[125] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. L. Moreno, Y. Wu et al. , “Transfer learning from speakerverification to multispeaker text-to-speech synthesis,” in
Proc. ofNeurIPS , 2018.[126] Y. Jiang and R. Liu, “Binaural deep neural network for robust speechenhancement,” in
Proc. of ICSPCC , 2014.[127] C. Jorgensen and S. Dusan, “Speech interfaces based upon surfaceelectromyography,”
Speech Communication , vol. 52, no. 4, pp. 354–366, 2010.[128] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “MMTM:Multimodal transfer module for CNN fusion,”
Proc. of CVPR , 2020.[129] D. N. Kalikow, K. N. Stevens, and L. L. Elliott, “Development ofa test of speech intelligibility in noise using sentence materials withcontrolled word predictability,”
The Journal of the Acoustical Societyof America , vol. 61, no. 5, pp. 1337–1351, 1977.[130] J. Kates and K. Arehart, “Coherence and the speech intelligibilityindex,”
The Journal of the Acoustical Society of America , vol. 115,no. 5, pp. 2604–2604, 2004.[131] J. M. Kates and K. H. Arehart, “The hearing-aid speech quality index(HASQI),”
Journal of the Audio Engineering Society , vol. 58, no. 5,pp. 363–381, 2010.[132] ——, “The hearing-aid speech perception index (HASPI),”
SpeechCommunication , vol. 65, pp. 75–93, 2014.[133] ——, “The hearing-aid speech quality index (HASQI) version 2,”
Journal of the Audio Engineering Society , vol. 62, no. 3, pp. 99–117,2014. [134] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuringspeech representations using a pitch-adaptive time–frequency smooth-ing and an instantaneous-frequency-based F0 extraction: Possible roleof a repetitive structure in sounds,”
Speech Communication , vol. 27,no. 3-4, pp. 187–207, 1999.[135] C. T. Kello and D. C. Plaut, “A neural network model of thearticulatory-acoustic forward mapping trained on recordings of articu-latory parameters,”
The Journal of the Acoustical Society of America ,vol. 116, no. 4, pp. 2354–2364, 2004.[136] F. U. Khan, B. P. Milner, and T. Le Cornu, “Using visual speech infor-mation in masking methods for audio speaker separation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 26,no. 10, pp. 1742–1754, 2018.[137] M. S. Khan, S. M. Naqvi, W. Wang, J. Chambers et al. , “Video-aided model-based source separation in real reverberant rooms,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 21,no. 9, pp. 1900–1912, 2013.[138] J. Kiefer, J. Wolfowitz et al. , “Stochastic estimation of the maximum ofa regression function,”
The Annals of Mathematical Statistics , vol. 23,no. 3, pp. 462–466, 1952.[139] C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, and W. Matusik,“On learning associations of faces and voices,” in
Proc. of ACCV , 2018.[140] D. E. King, “Dlib-ml: A machine learning toolkit,”
Journal of MachineLearning Research , vol. 10, pp. 1755–1758, 2009.[141] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
Proc. of ICLR , 2015.[142] U. Kjems, J. B. Boldt, M. S. Pedersen, T. Lunner, and D. Wang, “Roleof mask pattern in intelligibility of ideal binary-masked noisy speech,”
The Journal of the Acoustical Society of America , vol. 126, no. 3, pp.1415–1426, 2009.[143] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Speech intelligibility potential ofgeneral and specialized deep neural network based speech enhancementsystems,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 1, pp. 153–167, 2017.[144] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant training of deeprecurrent neural networks,”
IEEE/ACM Transactions on Audio, Speechand Language Processing , vol. 25, no. 10, pp. 1901–1913, 2017.[145] Y. Kumar, M. Aggarwal, P. Nawal, S. Satoh, R. R. Shah, and R. Zim-mermann, “Harnessing AI for speech reconstruction using multi-viewsilent video feed,” in
Proc. of ACM-MM , 2018.[146] Y. Kumar, R. Jain, K. M. Salik, R. R. Shah, Y. Yin, and R. Zimmer-mann, “Lipper: Synthesizing thy speech using multi-view lipreading,”in
Proc. of AAAI , 2019.[147] Y. Kumar, R. Jain, M. Salik, R. R. Shah, R. Zimmermann, and Y. Yin,“MyLipper: A personalized system for speech reconstruction usingmulti-view visual feeds,” in
Proc. of ISM , 2018.[148] T. Le Cornu and B. Milner, “Reconstructing intelligible audio speechfrom visual speech features,” in
Proc. of Interspeech , 2015.[149] ——, “Generating intelligible audio speech from visual speech,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 25, no. 9, pp. 1751–1761, 2017.[150] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in
Proc. of ICASSP , 2019.[151] Y. LeCun et al. , “Generalization and network design strategies,”
Connectionism in Perspective , vol. 19, pp. 143–155, 1989.[152] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling frameworkbased on variational autoencoders for speech enhancement,” in
Proc.of MLSP , 2018.[153] C. Li and Y. Qian, “Deep audio-visual speech separation with attentionmechanism,” in
Proc. of ICASSP , 2020.[154] Y. Li, Z. Liu, Y. Na, Z. Wang, B. Tian, and Q. Fu, “A visual-pilot deepfusion for target speech separation in multitalker noisy environment,”in
Proc. of ICASSP , 2020.[155] Y. Liang, S. M. Naqvi, and J. A. Chambers, “Audio video basedfast fixed-point independent vector analysis for multisource separationin a room environment,”
EURASIP Journal on Advances in SignalProcessing , vol. 2012, no. 1, p. 183, 2012.[156] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×
128 120 dB 15 µ slatency asynchronous temporal contrast vision sensor,” IEEE Journalof Solid-State Circuits , vol. 43, no. 2, pp. 566–576, 2008.[157] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, andY. Bengio, “A structured self-attentive sentence embedding,”
Proc. ofICLR , 2017.[158] K. Liu, Y. Li, N. Xu, and P. Natarajan, “Learn to combine modalitiesin multimodal deep learning,”
Proc. of KDD BigMine , 2018. [159] Q. Liu, W. Wang, P. J. Jackson, M. Barnard, J. Kittler, and J. Chambers,“Source separation of convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-frequency masking,” IEEE Transactions on Signal Processing , vol. 61, no. 22, pp. 5520–5535, 2013.[160] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “SSD: Single shot multibox detector,” in
Proc. of ECCV ,2016.[161] S. R. Livingstone and F. A. Russo, “The Ryerson audio-visual databaseof emotional speech and song (RAVDESS): A dynamic, multimodalset of facial and vocal expressions in North American English,”
PLOSONE , vol. 13, no. 5, 2018.[162] P. C. Loizou,
Speech Enhancement: Theory and Practice . CRC press,2013.[163] E. Lombard, “Le signe de l’elevation de la voix,”
Annales des Maladiesde L’Oreille et du Larynx , vol. 37, no. 2, pp. 101–119, 1911.[164] R. Lu, Z. Duan, and C. Zhang, “Listen and look: Audio–visualmatching assisted speech source separation,”
IEEE Signal ProcessingLetters , vol. 25, no. 9, pp. 1315–1319, 2018.[165] ——, “Audio–visual deep clustering for speech separation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27,no. 11, pp. 1697–1712, 2019.[166] B. D. Lucas and T. Kanade, “An iterative image registration techniquewith an application to stereo vision,” in
Proc. of IJCAI , 1981.[167] Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speechseparation with deep attractor network,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 26, no. 4, pp. 787–796,2018.[168] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27,no. 8, pp. 1256–1266, 2019.[169] Y. Luo, J. Wang, X. Wang, L. Wen, and L. Wang, “Audio-visual speechseparation using i-Vectors,” in
Proc. of ICICSP , 2019.[170] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches toattention-based neural machine translation,” in
Proc. of EMNLP , 2015.[171] H. K. Maganti, D. Gatica-Perez, and I. McCowan, “Speech en-hancement and recognition in meetings with an audio–visual sensorarray,”
IEEE Transactions on Audio, Speech, and Language Processing ,vol. 15, no. 8, pp. 2257–2269, 2007.[172] D. B. Mallick, J. F. Magnotti, and M. S. Beauchamp, “Variability andstability in the McGurk effect: Contributions of participants, stimuli,time, and response type,”
Psychonomic Bulletin & Review , vol. 22,no. 5, pp. 1299–1307, 2015.[173] D. W. Massaro and J. A. Simpson,
Speech perception by ear and eye:A paradigm for psychological inquiry . Psychology Press, 2014.[174] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,”
Nature , vol. 264, no. 5588, pp. 746–748, 1976.[175] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial net-works for speech enhancement and noise-robust speaker verification,”in
Proc. of Interspeech , 2017.[176] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On trainingtargets and objective functions for deep-learning-based audio-visualspeech enhancement,” in
Proc. of ICASSP , 2019.[177] D. Michelsanti, O. Slizovskaia, G. Haro, E. G´omez, Z.-H. Tan, andJ. Jensen, “Vocoder-based speech synthesis from silent videos,” in
Proc.of Interspeech (to appear) , 2020.[178] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Deep-learning-based audio-visual speech enhancement in presence of Lom-bard effect,”
Speech Communication , vol. 115, pp. 38–50, 2019.[179] ——, “Effects of Lombard reflex on the performance of deep-learning-based audio-visual speech enhancement systems,” in
Proc. of ICASSP ,2019.[180] G. A. Miller and P. E. Nicely, “An analysis of perceptual confusionsamong some English consonants,”
The Journal of the AcousticalSociety of America , vol. 27, no. 2, pp. 338–352, 1955.[181] M. Morise and Y. Watanabe, “Sound quality comparison among high-quality vocoders by using re-synthesized speech,”
Acoustical Scienceand Technology , vol. 39, no. 3, pp. 263–265, 2018.[182] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-basedhigh-quality speech synthesis system for real-time applications,”
IEICETransactions on Information and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[183] G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, andL. Badino, “Face landmark-based speaker-independent audio-visualspeech enhancement in multi-talker environments,” in
Proc. of ICASSP ,2019. [184] S. Mun, S. Choe, J. Huh, and J. S. Chung, “The sound of my voice:speaker representation loss for target voice separation,” in
Proc. ofICASSP , 2020.[185] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scalespeaker identification dataset,”
Proc. of Interspeech , 2017.[186] S. M. Naqvi, W. Wang, M. S. Khan, M. Barnard, and J. A. Chambers,“Multimodal (audio–visual) source separation exploiting multi-speakertracking, robust beamforming and time–frequency masking,”
IET Sig-nal Processing , vol. 6, no. 5, pp. 466–477, 2012.[187] S. M. Naqvi, M. Yu, and J. A. Chambers, “A multimodal approach toblind source separation of moving sources,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 4, no. 5, pp. 895–910, 2010.[188] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,“Multimodal deep learning,” in
Proc. of ICML , 2011.[189] J. B. Nielsen and T. Dau, “Development of a Danish speech intelli-gibility test,”
International Journal of Audiology , vol. 48, no. 10, pp.729–741, 2009.[190] ——, “The Danish hearing in noise test,”
International Journal ofAudiology , vol. 50, no. 3, pp. 202–208, 2011.[191] M. Nilsson, S. D. Soli, and J. A. Sullivan, “Development of the hearingin noise test for the measurement of speech reception thresholds inquiet and in noise,”
The Journal of the Acoustical Society of America ,vol. 95, no. 2, pp. 1085–1099, 1994.[192] T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani,“Multimodal SpeakerBeam: Single channel target speech extractionwith audio-visual speaker clues,”
Proc. Interspeech , 2019.[193] T. Ochiai, S. Watanabe, and S. Katagiri, “Does speech enhancementwork with end-to-end ASR objectives?: Experimental analysis ofmultichannel end-to-end ASR,” in
Proc. of MLSP , 2017.[194] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein,and W. Matusik, “Speech2Face: Learning the face behind a voice,” in
Proc. of CVPR , 2019.[195] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in
Proc. of ECCV , 2018.[196] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, andW. T. Freeman, “Visually indicated sounds,” in
Proc. of CVPR , 2016.[197] E. Ozimek, A. Warzybok, and D. Kutzner, “Polish sentence matrix testfor speech intelligibility measurement in noise,”
International Journalof Audiology , vol. 49, no. 6, pp. 444–454, 2010.[198] A. Pandey and D. Wang, “On adversarial training and loss functionsfor speech enhancement,” in
Proc. of ICASSP , 2018.[199] S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. P´erez, and G. Richard,“Guiding audio source separation by video object information,” in
Proc.of WASPAA , 2017.[200] ——, “Motion informed audio source separation,” in
Proc. of ICASSP ,2017.[201] S. Parekh, A. Ozerov, S. Essid, N. Q. Duong, P. P´erez, and G. Richard,“Identify, locate and separate: Audio-visual object extraction in largevideo collections using weak supervision,” in
Proc. of WASPAA , 2019.[202] S. Partan and P. Marler, “Communication goes multimodal,”
Science ,vol. 283, no. 5406, pp. 1272–1273, 1999.[203] L. Pasa, G. Morrone, and L. Badino, “An analysis of speech enhance-ment and recognition losses in limited resources multi-talker singlechannel audio-visual ASR,” in
Proc. of ICASSP , 2020.[204] E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “CUAVE: Anew audio-visual database for multimodal human-computer interfaceresearch,” in
Proc. of ICASSP , 2002.[205] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,J. Raiman, and J. Miller, “Deep Voice 3: 2000-speaker neural text-to-speech,” in
Proc. of ICLR , 2018.[206] K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar,“Learning individual speaking styles for accurate lip to speech syn-thesis,” in
Proc. of CVPR , 2020.[207] J. Pu, Y. Panagakis, S. Petridis, and M. Pantic, “Audio-visual objectlocalization and separation using low-rank and sparsity,” in
Proc. ofICASSP , 2017.[208] L. Qu, C. Weber, and S. Wermter, “Multimodal target speech separationwith voice and face references,” arXiv preprint arXiv:2005.08335 ,2020.[209] D. Ramachandram and G. W. Taylor, “Deep multimodal learning:A survey on recent advances and trends,”
IEEE Signal ProcessingMagazine , vol. 34, no. 6, pp. 96–108, 2017.[210] K. S. Rhebergen and N. J. Versfeld, “A speech intelligibility index-based approach to predict the speech reception threshold for sentencesin fluctuating noise for normal-hearing listeners,”
The Journal of theAcoustical Society of America , vol. 117, no. 4, pp. 2181–2192, 2005. [211] C. Richie, S. Warburton, and M. Carter, Audiovisual database of spokenAmerican English . Linguistic Data Consortium, 2009.[212] J. Rinc´on-Trujillo and D. M. C´ordova-Esparza, “Analysis of speechseparation methods based on deep learning,”
International Journal ofComputer Applications , vol. 148, no. 9, pp. 21–29, 2019.[213] B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, “Audiovisualspeech source separation: An overview of key methodologies,”
IEEESignal Processing Magazine , vol. 31, no. 3, pp. 125–134, 2014.[214] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (PESQ) - A new method for speech qualityassessment of telephone networks and codecs,” in
Proc. of ICASSP ,2001.[215] H. Robbins and S. Monro, “A stochastic approximation method,”
TheAnnals of Mathematical Statistics , pp. 400–407, 1951.[216] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net-works for biomedical image segmentation,” in
MICCAI , 2015.[217] J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver,S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi et al. , “Ava activespeaker: An audio-visual dataset for active speaker detection,” in
Proc.of ICASSP , 2020.[218] A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba,“Self-supervised audio-visual co-segmentation,” in
Proc. of ICASSP ,2019.[219] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-sentations by back-propagating errors,”
Nature , vol. 323, no. 6088, pp.533–536, 1986.[220] M. Sadeghi and X. Alameda-Pineda, “Mixture of inference networksfor VAE-based audio-visual speech enhancement,” arXiv preprintarXiv:1912.10647 , 2019.[221] ——, “Robust unsupervised audio-visual speech enhancement using amixture of variational autoencoders,” in
Proc. of ICASSP , 2020.[222] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud,“Audio-visual speech enhancement using conditional variational auto-encoders,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 1788–1800, 2020.[223] C. Sanderson and K. K. Paliwal, “Noise compensation in a personverification system using face and multiple speech features,”
PatternRecognition , vol. 36, no. 2, pp. 293–302, 2003.[224] L. Sch¨onherr, D. Orth, M. Heckmann, and D. Kolossa, “Environmen-tally robust audio-visual speaker identification,” in
Proc. of SLT , 2016.[225] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-works,”
IEEE Transactions on Signal Processing , vol. 45, no. 11, pp.2673–2681, 1997.[226] J.-L. Schwartz, F. Berthommier, and C. Savariaux, “Audio-visual sceneanalysis: Evidence for a “very-early” integration process in audio-visualspeech perception,” in
Proc. of ICSLP - Interspeech , 2002.[227] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Natural TTS synthesis byconditioning WaveNet on mel spectrogram predictions,” in
Proc. ofICASSP , 2018.[228] B. G. Shinn-Cunningham and V. Best, “Selective attention in normaland impaired hearing,”
Trends in Amplification , vol. 12, no. 4, pp. 283–299, 2008.[229] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in
Proc. of ICLR , 2015.[230] O. Slizovskaia, G. Haro, and E. G´omez, “Conditioned source separationfor music instrument performances,” arXiv preprint arXiv:2004.03873 ,2020.[231] D. Sodoyer, L. Girin, C. Jutten, and J.-L. Schwartz, “Developing anaudio-visual speech source separation algorithm,”
Speech Communica-tion , vol. 44, no. 1-4, pp. 113–125, 2004.[232] D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten,“Separation of audio-visual speech sources: A new approach exploitingthe audio-visual coherence of speech stimuli,”
EURASIP Journal onAdvances in Signal Processing , no. 11, pp. 1165–1173, 2002.[233] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville,and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in
Proc. ofICLR Workshop , 2017.[234] T. Stafylakis and G. Tzimiropoulos, “Combining residual networkswith LSTMs for lipreading,” in
Proc. of Interspeech , 2017.[235] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for themeasurement of the psychological magnitude pitch,”
The Journal ofthe Acoustical Society of America , vol. 8, no. 3, pp. 185–190, 1937.[236] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli-gibility in noise,”
The Journal of the Acoustical Society of America ,vol. 26, no. 2, 1954. [237] Q. Summerfield, “Lipreading and audio-visual speech perception,”
Philosophical Transactions of the Royal Society of London. Series B:Biological Sciences , vol. 335, no. 1273, pp. 71–78, 1992.[238] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learningfor LSTM-RNN based speech enhancement,” in
HSCMA , 2017.[239] Z. Sun, Y. Wang, and L. Cao, “An attention based speaker-independentaudio-visual deep learning model for speech enhancement,” in
Proc. ofMMM , 2020.[240] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in
Proc. of NIPS , 2014.[241] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”
IEEE Transactions on Audio, Speech, and Language Processing ,vol. 19, no. 7, 2011.[242] T. M. F. Taha and A. Hussain, “A survey on techniques for enhancingspeech,”
International Journal of Computer Applications , vol. 179,no. 17, pp. 1–14, 2018.[243] Y. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based lip-to-speech synthesis using convolutional neural networks,” in
Proc. of IW-FCV , 2019.[244] K. Tan, Y. Xu, S.-X. Zhang, M. Yu, and D. Yu, “Audio-visual speechseparation and dereverberation with a two-stage multimodal network,”
IEEE Journal of Selected Topics in Signal Processing , vol. 14, no. 3,pp. 542–553, 2020.[245] T. Tieleman and G. Hinton, “Lecture 6.5 - RmsProp: Divide thegradient by a running average of its recent magnitude,” COURSERA:Neural Networks for Machine Learning, 2012.[246] C. Tomasi and T. Kanade, “Detection and tracking of point features,”
Technical Report CMU-CS-91-132 , 1991.[247] S. Uttam, Y. Kumar, D. Sahrawat, M. Aggarwal, R. R. Shah, D. Mahata,and A. Stent, “Hush-hush speak: Speech reconstruction using silentvideos,” in
Proc. of Interspeech , 2019.[248] V. Vaillancourt, C. Laroche, C. Mayer, C. Basque, M. Nali, A. Eriks-Brophy, S. D. Soli, and C. Gigu`ere, “Adaptation of the HINT (hearingin noise test) for adult Canadian Francophone populations,”
Interna-tional Journal of Audiology , vol. 44, no. 6, pp. 358–361, 2005.[249] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Proc. of NIPS , 2017.[250] P. Verma and P. K. Das, “i-Vectors in speech processing applications:A survey,”
International Journal of Speech Technology , vol. 18, no. 4,pp. 529–546, 2015.[251] K. Vesel`y, S. Watanabe, K. ˇZmol´ıkov´a, M. Karafi´at, L. Burget, andJ. H. ˇCernock`y, “Sequence summarizing neural network for speakeradaptation,” in
Proc. of ICASSP , 2016.[252] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurementin blind audio source separation,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 14, no. 4, pp. 1462–1469, 2006.[253] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Aneural image caption generator,” in
Proc. of CVPR , 2015.[254] P. Viola and M. J. Jones, “Robust real-time face detection,”
Interna-tional Journal of Computer Vision , vol. 57, no. 2, pp. 137–154, 2004.[255] W. D. Voiers, “Evaluating processed speech using the diagnostic rhymetest,”
Speech Technology , pp. 30–39, 1983.[256] K. Vougioukas, P. Ma, S. Petridis, and M. Pantic, “Video-drivenspeech reconstruction using generative adversarial networks,” in
Proc.of Interspeech , 2019.[257] K. Wagener, T. Brand, and B. Kollmeier, “Entwicklung und evaluationeines satztests in deutscher sprache - Teil II: Optimierung des Olden-burger satztests,”
Zeitschrift f¨ur Audiologie , no. 38, pp. 44–56, 1999.[258] ——, “Entwicklung und evaluation eines satztests in deutscher sprache- Teil III: Evaluierung des Oldenburger satztests,”
Zeitschrift f¨urAudiologie , no. 38, pp. 86–95, 1999.[259] K. Wagener, V. K¨uhnel, and B. Kollmeier, “Entwicklung und evaluationeines satztests in deutscher sprache - Teil I: Design des Oldenburgersatztests,”
Zeitschrift f¨ur Audiologie , no. 38, pp. 4–15, 1999.[260] D. L. Wang and G. J. Brown,
Computational Auditory Scene Analysis:Principles, Algorithms, and Applications . Wiley-IEEE Press, 2006.[261] D. L. Wang and J. Chen, “Supervised speech separation based on deeplearning: An overview,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing , 2018.[262] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey,R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Tar-geted voice separation by speaker-conditioned spectrogram masking,”
Proc. Interspeech , 2019.[263] W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robust audio-visual speech enhancement model,” in
Proc. of ICASSP , 2020. [264] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets forsupervised speech separation,” IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 22, no. 12, pp. 1849–1858,2014.[265] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. , “Tacotron: Towards end-to-end speech synthesis,”
Proc. of Interspeech , 2017.[266] Z.-Q. Wang, “Deep learning based array processing for speech separa-tion, localization, and recognition,” Ph.D. dissertation, The Ohio StateUniversity, 2020.[267] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deepclustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in
Proc. of ICASSP , 2018.[268] D. Ward, H. Wierstorf, R. D. Mason, E. M. Grais, and M. D. Plumbley,“BSS Eval or PEASS? Predicting the perception of singing-voiceseparation,” in
Proc. of ICASSP , 2018.[269] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Discrim-inatively trained recurrent neural networks for single-channel speechseparation,” in
Proc. of GlobalSIP , 2014.[270] P. J. Werbos, “Backpropagation through time: what it does and how todo it,”
Proceedings of the IEEE , vol. 78, no. 10, pp. 1550–1560, 1990.[271] D. S. Williamson, Y. Wang, and D. L. Wang, “Complex ratio maskingfor monaural speech separation,”
IEEE/ACM Transactions on Audio,Speech and Language Processing , vol. 24, no. 3, pp. 483–492, 2016.[272] L. L. Wong and S. D. Soli, “Development of the Cantonese hearing innoise test (CHINT),”
Ear and Hearing , vol. 26, no. 3, pp. 276–289,2005.[273] J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu,“Time domain audio visual speech separation,” in
Proc. of ASRU , 2019.[274] Z. Wu, S. Sivadas, Y. K. Tan, M. Bin, and R. S. M. Goh, “Multi-modalhybrid deep neural network for speech enhancement,” arXiv preprintarXiv:1606.04750 , 2016.[275] R. Xia, J. Li, M. Akagi, and Y. Yan, “Evaluation of objective intelli-gibility prediction measures for noise-reduced signals in Mandarin,” in
Proc. of ICASSP , 2012.[276] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-levelaggregation for speaker recognition in the wild,” in
Proc. of ICASSP ,2019.[277] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image captiongeneration with visual attention,” in
Proc. of ICML , 2015.[278] X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation usingminus-plus net,” in
Proc. of ICCV , 2019.[279] Y. Xu, M. Yu, S.-X. Zhang, L. Chen, C. Weng, J. Liu, and D. Yu,“Neural spatio-temporal beamformer for target speech separation,”
Proc. of Interspeech (to appear) , 2020.[280] H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative associationof vocal-tract and facial behavior,”
Speech Communication , vol. 26,no. 1, 1998.[281] D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” in
Proc. of AAAI ,2020.[282] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invarianttraining of deep models for speaker-independent multi-talker speechseparation,” in
Proc. of ICASSP , 2017.[283] A. A. Zekveld, S. E. Kramer, and J. M. Festen, “Pupil responseas an indication of effortful listening: The influence of sentenceintelligibility,”
Ear and Hearing , vol. 31, no. 4, pp. 480–490, 2010.[284] C. Zhang, K. Koishida, and J. H. Hansen, “Text-independent speakerverification based on triplet convolutional neural network embeddings,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,vol. 26, no. 9, pp. 1633–1644, 2018.[285] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-endattention based text-dependent speaker verification,” in
Proc. of SLT ,2016.[286] G. Zhao, M. Barnard, and M. Pietikainen, “Lipreading with localspatiotemporal descriptors,”
IEEE Transactions on Multimedia , vol. 11,no. 7, pp. 1254–1265, 2009.[287] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,”in
Proc. of ICCV , 2019.[288] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, andA. Torralba, “The sound of pixels,” in
Proc. of ECCV , 2018.[289] H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He, “Deep audio-visuallearning: A survey,” arXiv preprint arXiv:2001.04758 , 2020.[290] L. Zhu and E. Rahtu, “Separating sounds from a single image,” arXivpreprint arXiv:2007.07984 , 2020. [291] ——, “Visually guided sound source separation using cascaded oppo-nent filter network,” arXiv preprint arXiv:2006.03028 , 2020.[292] E. Zwicker and H. Fastl,