[PDF] CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Abstract

In this paper, we present a deep learning-based speech signal-processing mobile application, CITISEN, which can perform three functions: speech enhancement (SE), acoustic scene conversion (ASC), and model adaptation (MA). For SE, CITISEN can effectively reduce noise components from speech signals and accordingly enhance their clarity and intelligibility. For ASC, CITISEN can convert the current background sound to a different background sound. Finally, for MA, CITISEN can effectively adapt an SE model, with a few audio files, when it encounters unknown speakers or noise types; the adapted SE model is used to enhance the upcoming noisy utterances. Experimental results confirmed the effectiveness of CITISEN in performing these three functions via objective evaluation and subjective listening tests. The promising results reveal that the developed CITISEN mobile application can potentially be used as a front-end processor for various speech-related services such as voice communication, assistive hearing devices, and virtual reality headsets.

Full PDF

11 CITISEN: A Deep Learning-Based SpeechSignal-Processing Mobile Application

Alexander Chao-Fu Kang, Kuo-Hsuan Hung, Yu-Wen Chen, You-Jin Li, Ya-Hsin Lai, Kai-Chun Liu, Sze-Wei Fu,Syu-Siang Wang, Yu Tsao,

Senior Member, IEEE

Abstract —In this paper, we present a deep learning-basedspeech signal-processing mobile application, CITISEN, whichcan perform three functions: speech enhancement (SE), acousticscene conversion (ASC), and model adaptation (MA). For SE,CITISEN can effectively reduce noise components from speechsignals and accordingly enhance their clarity and intelligibility.For ASC, CITISEN can convert the current background soundto a different background sound. Finally, for MA, CITISEN caneffectively adapt an SE model, with a few audio ﬁles, whenit encounters unknown speakers or noise types; the adaptedSE model is used to enhance the upcoming noisy utterances.Experimental results conﬁrmed the effectiveness of CITISEN inperforming these three functions via objective evaluation andsubjective listening tests. The promising results reveal that thedeveloped CITISEN mobile application can potentially be usedas a front-end processor for various speech-related services suchas voice communication, assistive hearing devices, and virtualreality headsets.

Index Terms —speech enhancement, deep learning, model adap-tation, acoustic scene conversion.

I. I

NTRODUCTION

In recent years, a wide variety of speech-related applicationshave been developed. Most of these applications have beenhighly convenient for humanhuman and humanmachine com-munications. However, the following long-existing and criticalissue, which may notably limit the achievable performance ofthese applications, remains to be solved: speech distortionscaused by additive/convolutional noises and channel/deviceeffects [1]–[6]. Identifying an effective method of addressingthis distortion issue is a critical and challenging task, andnumerous approaches have been proposed to this end; amongthese approaches, speech enhancement (SE) is notable.The goal of SE is to transform noisy speech into enhancedspeech with improved quality and intelligibility [7], [8]. Inthe past several decades, SE has been widely used as a front-end unit in many voice-based applications such as automaticspeech recognition [9], [10], speaker recognition [11], speechcoding [12], hearing aids [13], [14], and cochlea implants [15],[16]. Existing SE methods can be roughly divided into threeclasses. SE methods in the ﬁrst class design a ﬁlter or gainfunction to attenuate noise components; notable techniquesinclude the Wiener ﬁlter and its extensions [17]–[19] such asthe minimum mean square error spectral estimator (MMSE)[20]–[22], maximum a posteriori spectral amplitude estimator(MAPA) [23], [24], and maximum likelihood spectral ampli-tude estimator (MLSA) [25], [26]. SE methods in the secondclass adapt speech models to extract pure speech signals fromnoisy inputs; well-known methods include harmonic models [27], linear prediction (LP) models [28], [29], and hiddenMarkov models [30]. SE Methods of the ﬁrst and secondclasses have a common limitationthe inability to effectivelycontrast non-stationary noise signals of real-world scenariosunder unexpected acoustic conditions. SE methods in the thirdclass are based on machine-learning algorithms; these methodstypically prepare a model for noisy-to-clean transformationin a data-driven manner without imposing strong statisticalconstraints. Notable SE methods belonging to this class in-clude non-negative matrix factorization [31]–[33], compressivesensing [34], sparse coding [35], [36], and robust principalcomponent analysis (RPCA) [37].An artiﬁcial neural network (ANN), as a successfulmachine-learning model, has also been used for SE becauseof its powerful nonlinear transformation capability. In [38]–[41], a shallow ANN is used to map noisy speech signals toclean ones. More recently, various types of ANNs, featuringdeep structures, have been used for SE (e.g., deep recur-rent neural networks and long-short term memory (LSTM)networks [42], [43], convolutional neural networks [44], anddeep feedforward neural networks [45], [46]). Although theeffectiveness of these deep-learning-based SE approaches hasbeen veriﬁed, their performance on a mobile application isyet to be conﬁrmed. In this paper, we present our developedspeech signal processing mobile application, CITISEN, whichsupports SE to improve speech quality and intelligibility.Based on SE, two extended functionsacoustic scene conversion(ASC) and model adaptation (MA)are also implemented inCITISEN. We conducted a series of experiments to verify theeffectiveness of these three functions. Two standard measure-ment methodsperceptual evaluation of speech quality (PESQ)[47] and short-time objective intelligibility (STOI) [48]wereused to test the SE and MA. Experimental results conﬁrm theeffectiveness of the SE and MA with notable PESQ and STOIscore improvements. Further, we conducted listening tests forintelligibility and acoustic scene identiﬁcation to test the ASCperformance. The results reveal that the intelligibility scoresdid not drop signiﬁcantly after the ASC was performed onthe original noisy speech, and the converted scene could beaccurately identiﬁed.The remainder of this paper is organized as follows. SectionII reviews related works. Section III presents the functions anduser interface of the CITISEN application. Section IV presentsthe experimental setup and results. Finally, Section V presentsthe conclusions of this study. a r X i v : . [ ee ss . A S ] A ug Fig. 1. Traditional ﬁlter-based SE architecture. FFT and IFFT denote the fastFourier transform and inverse FFT, respectively.Fig. 2. The DDAE-based SE architecture

II. RELATED WORKSIn this section, we ﬁrst review the traditional ﬁlter-basedSE method, which will be used for comparisons in the exper-iments. Then, we introduce the deep denoising autoencoder(DDAE)-based and fully convolutional network (FCN)-basedSE methods, which are used as default SE models in CITISEN.

A. TRADITIONAL GAIN FUNCTION-BASED SE METHOD

For the SE task, we generally assume that the noisy speechsignal y [ n ] contains a clean speech signal s [ n ] and noise signal v [ n ] . y [ n ] = s [ n ] + v [ n ] , (1)where n is a time index. For the MMSE SE approach, the time-domain signal, y [ n ] , is ﬁrst converted to a spectral feature, Y [ m, l ] , by a short time Fourier transform (STFT), where m and l denote the m th frequency bin and l th frame in the entireset of noisy spectral features, Y . From Eq. (2) , Y [ m, l ] canbe expressed as Y [ m, l ] = S [ m, l ] + V [ m, l ] . (2)By estimating a priori SNR and a posteriori SNR statisticsbased on a noise-estimation approach [49], we could estimatea function G [ m, l ] . The enhanced speech, ˆ S [ m, l ] , is obtainedby ﬁltering Y [ m, l ] through G [ m, l ] . Finally, an inverse FFT(IFFT) is applied to convert ˆ S [ m, l ] to ˆ s [ n ] , as shown in Fig.1. B. DEEP LEARNING-BASED SE METHOD

In the CITIZEN application, we included two deep learning-based SE methods: DDAE and FCN. These two methods havebeen conﬁrmed to yield promising results in several SE tasks[50]–[52].

1) Deep Denoising Autoencoders:

The DDAE model wasﬁrst applied in SE in [50]. During training, noisy-clean speechpairs are used to compute the mapping function from noisyto clean spectral (logarithm amplitude in this study) features.The aim of a DDAE is to transform the noisy speech signal toa clean speech signal by minimizing the reconstruction errorbetween the predicted spectral features (cid:98) S and the referenceclean spectral features S , such that θ ∗ = arg min θ E ( θ ) + ρC ( θ ) , (3)with E ( θ ) = (cid:107) φ ( Y ) − S (cid:107) F , (4)where ρ is a constant that controls the tradeoff between thereconstruction accuracy and regularization term C ( θ ) [53], φ ( . ) denotes the transformation function of the DDAE. Givennoisy spectral features, the DDAE estimates clean speech by h ( Y [ l ]) = σ ( W Y [ l ] + b ) , ... h D − ( Y [ l ]) = σ ( W D − h D − ( Y [ l ]) + b D − ) , ˆ S [ l ] = W D h D − ( Y [ l ]) + b D (5)where Y [ l ] and ˆ S [ l ] are the l th spectral feature vectors of theinput noisy and estimated clean spectral features, respectively; W · · · W D and b · · · b D are the weight matrices and biasvectors, respectively; and σ is the vector-wise non-linearactivation function. To incorporate contextual information, wemay concatenate several frames of feature vectors to form theinput and output for training the DDAE model. During testing,noisy speech signals are processed by the trained DDAE modelto reconstruct the enhanced speech signals [50].

2) Fully Convolutional Network:

Fig. 3 shows an FCNmodel, which is similar to a conventional CNN, but all thefully connected layers are removed. As reported in [51],the FCN model can deal with the high and low frequencycomponents of the raw waveform at the same time. Therelation between the output sample ˆ s [ n ] and the connectedhidden nodes R [ n ] can be represented by ˆ s [ n ] = Q (cid:62) R [ n ] , (6)where Q ∈ R q × denotes one of the learned ﬁlters, and q is the size of the ﬁlter. For the details on the structure ofthe FCN model for waveform enhancement, please refer toprevious works [44], [51]. When we use the L norm, theobjective function is deﬁned as L ( θ ) = 1 u (cid:88) u || w y ( u ) − w q ( u ) || , (7)where θ denotes the model parameters of FCN, where w y ( u ) and w q ( u ) are the u th estimated utterance and clean reference,respectively. C. MODEL ADAPTATION

When operating SEs in a real-world scenario, unknownnoise types and new users are often encountered. In such acase, the testing data may not be well covered by the trained

Fig. 3. The FCN-based SE architecture

SE model. The differences in acoustic characteristics, suchtraining/testing mismatches, may considerably degrade the SEperformance. To effectively address this mismatch issue, theadaptation of an SE model is required. Thus far, various MAapproaches have been proposed [54]–[59]. The main conceptof MA is to adjust the parameters of a pre-trained model(prepared by training data) based on a set of adaptation datato match the testing condition.For the SE MA task, we ﬁrst need to prepare adaptationdata that cover new noise types or/and speakers [60]–[62]. Theparameters of the original SE model are then adjusted basedon the adaptation data. Because the adapted SE models matchthe testing condition, the SE performance can be improved.III. CITISEN APPIn this section, we introduce the concepts of our mobileapplication, explain how all the functions are implemented,and demonstrate the user interface.

A. Speech Enhancement (SE) Function

SE is a major function of CITISEN. As shown in the blueblock of Fig. 4, given the noisy speech, the SE functionremoves background noises and generates enhanced speechwith improved quality and intelligibility. We train the SEmodels in a cloud server. Then, the trained models are loadedinto mobile devices. Because the model is trained using acloud server, a huge computational resource is not required inmobile devices. As mentioned earlier, two deep learning-basedSE methods-DDAE and FCN-are implemented in CITISEN.To reduce the latency, a small window size is used whenimplementing these SE systems. As reported in the previoussection, DDAE and FCN preform SE in the spectral and raw-waveform domains, respectively.

B. Acoustic Scene Conversion (ASC) Function

Because the SE function can extract pure speech by remov-ing background noises, the ASC is implemented base on SE.After SE extract pure speech from noisy speech, ASC mixespure speech with another new background noise. In the samewords, we can artiﬁcially convert the acoustic scene from theoriginal audio. The overall ASC function is illustrated in theorange block of Fig. 4. The main concept of ASC is similar to

Fig. 4. The SE, ASC, and MA functions in CITISEN the changing background of an image or a video [63], and ASCis a new topic in the speech signal research ﬁeld. Based on ourliterature survey, there is no standard method to evaluate thistask. Therefore, we invite real humans to conduct listeningtests. Our goal is to not only mix clean speech with a newacoustic scene but also ensure that the same levels of clarityand intelligibility are maintained. Accordingly, we designedtwo listening tests: one for the speech intelligibility scoresand the other for the scene identiﬁcation rate (SIR) of theoriginal/converted acoustic scenes.

C. Model Adaptation (MA) function

The MA function of CITISEN aims to adapt the SE modelto ﬁt unknown noises or/and speakers. The procedure of MAis illustrated in the green block in Fig. 4. We provide threedifferent MA modes: noise only, speaker only, and noise andspeaker. Based on the user environment, users can choose thebest MA mode and then upload a short recorded audio clip toa cloud server for adapting SE models. Our experiments revealthat the MA function notably increased the SE performancewhen unknown noise types were encountered (Fig. 4). Theperformance improves notably in both STOI and PESQ scores.

D. CITISEN User Interface and Usage

The CITISEN application has four pages: ”Speech Enhance-ment,” ”Acoustic Scene Conversion,” ”Model Adaptation,” and”Recording,” as shown in Fig. 5. The page name and navigatorbuttons of each page are placed on the top-left and bottom inthe application, respectively.On the ”Speech Enhancement” page, a user ﬁrst speciﬁesher/his gender identity (”Gender Identity” in Fig. 6). Then, bypressing the ”SE Model Switch” button, the user can selectone suitable SE model from a list of saved models. CITISENprovides several default SE models trained using our owncollected speech datasets. Users can also run MA to prepareadapted SE models and save them as new SE models. Then,by pressing the SE button, the noisy speech is transformed toa clean one online.In the ”Acoustic Scene Conversion” page, CITISEN mixesthe acoustic scene on enhanced speech to generate new speechsignals with the converted acoustic scene; the user interface ofthis page is shown in Fig. 7. The ”Acoustic Scene Conversion”

Fig. 5. Four main pages in CITISEN (”Speech Enhancement”, ”AcousticScene Conversion”, ”Model Adaptation”, and ”Recording”). The page nameand the navigator buttons of each page are listed on the top-left and bottomin the application, respectively.Fig. 6. CITISEN: the ”Speech Enhancement” pageFig. 7. CITISEN: the ”Acoustic Scene Conversion” page page has a ”Record Noise” button, by which users can recordand save noise signals for the ASC. The page also has avolume bar, which allows users to adjust the volume ofbackground noise and accordingly specify the SNR level ofthe converted speech. To change the acoustic scenes, usersﬁrst press the SE Model Switch button to select an SE model.Then, by pressing Background Noise Switch button, as shownon the left side of Fig. 7, an acoustic scene selection windowwill pop up and list all the acoustic scene options, as shownon the right side of Fig. 7. Users can select the target scenefor the ASC, and the speech with the converted scene will begenerated accordingly.In the ”Model Adaptation” page, there are two ﬁle uploadbuttons: ”Record Noise” and ”Record Speech,” as shown onthe left side of Fig. 8. By pressing one of these buttons, userscan record pure noise or speaker speech signals and uploadthe recorded audio to our server. To start recording, userscan simply press on one of the buttons, as shown on theleft side of Fig. 8. After ﬁnishing the recording, by pressingthe button again, CITISEN pops up a submitting window, as

Fig. 8. CITISEN: the ”Model Adaptation” page.Fig. 9. CITISEN: the ”Recording” page (recording or loading saved audioﬁles). shown on the right side of Fig. 8. The submitting windowasks the user to name the audio ﬁle, and the audio is thensent to the server. After receiving the audio ﬁle, the serverestimates an adapted SE model by ﬁne-tuning the original SEmodel using the recorded audio data. The name of the audioﬁle can also be used to name the adapted SE model, which islater sent from the server to the mobile device and appears onthe ”Speech Enhancement” and ”Acoustic Scene Conversion”pages. Accordingly, users can run SE and ASC functions usingthe adapted SE model.The ”Recording” page is used for users to record speechand noise in the current environment and to save the enhancedor converted audio ﬁles. For the ”Speech Enhancement” and”Acoustic Scene Conversion” pages, users can immediatelylisten to enhanced or converted speech online. On the otherhand, the ”Recording” page allows users to save and playbacklater on the processed audio ﬁles. Users ﬁrst record (upper pathin Fig. 9) or load an existing (bottom path in Fig. 9) audioﬁle and then press the ”SE Model Switch” button. Then, anSE model selection window pops up, as shown on the right ofFig. 10. By selecting a suitable SE model and then pressingthe run button (as shown on the left side of Fig. 10), enhancedspeech is generated. CITISEN demonstrates two spectrogramplots: noisy and enhanced speech spectrogram plots (as shownon the right side of Fig. 11), so that users can visually checkthe SE results. In addition to these two plots, users can press”Play” and ”Stop” buttons on top of spectrogram plots to playand listen to the original and processed audio ﬁles.IV. EXPERIMENTS

A. Experimental Setup

We conducted three sets of experiments. First, we testedthe performance of the SE and ASC functions using STOI

Fig. 10. CITISEN: the ”Recording” page (selecting a model to perform SE).Fig. 11. CITISEN: the ”Recording” page (demonstrating the processed speechby spectrogram plots). and PESQ metrics and listening tests. Next, we conducteda listening test to examine the intelligibility and SIR of thespeech before and after ASC. Finally, as mentioned earlier,we implemented the MA function by ﬁne-tuning the originalSE model to ﬁt unseen noise types and new speakers. Ac-cordingly, we obtained three sets of results for MA followedby SE (termed MA+SE): ”MA+SE(N)”, ”MA+SE(S)”, and”MA+SE(N+S)”, thereby denoting model adaptations on noisetype, speaker, and both noise type and speaker, respectively.In this study, TMHINT utterances [64] were used to preparethe training and testing sets. More speciﬁcally, the training setwas prepared using speech utterances from six speakers, threemales and three females. Each speaker read 200 TMHINTutterances in a quiet room, amounting to a total of 1200clean utterances. Noisy utterances were generated by artiﬁ-cially contaminating these 1200 clean training utterances withrandomly sampled noise types from a 100-noise type dataset[65] at 8 different SNR levels ( ± dB, ± dB, ± dB, and ± dB). Consequentially, 48000 noisy-clean pair utteranceswere obtained. To construct the testing set, we used the speechutterance from another two speakers (one male and one female,termed testing speaker in the following discussion), with 120utterances for each speaker. We generated noisy utterancesby artiﬁcially contaminating these 120 clean utterances withanother set of 5 noise types (car, sea wave, take-off, train,and song) at 4 different SNR levels ( ± dB, dB, and ± dB).Notably, the speakers, speech contents, and noise types weredifferent for the training and testing sets. All the training andtesting utterances were recorded at a 16 kHz sampling rate ina 16-bit format. The hyper-parameters for both the DDAE andFCN SE models are as follows: number of training epochs is40, batch size is 1, and optimizer is Adam with a learningrate of 0.001. A validation set was prepared and used todetermine the best model conﬁgurations for the SE. To avoidunstable communication and computation, we conducted the experiments ofﬂine. More speciﬁcally, we ran CITISEN to ob-tain the processed speech. Subsequently, objective evaluationsand listening tests were conducted using the processed speechofﬂine.We ﬁrst tested the performance of the SE and ASC functionsusing both objective evaluations and subjective listening tests.For the objective evaluations, PESQ [66] and STOI [67]metrics were used. PESQ was designed to evaluate the qualityof the processed speech, and the score ranged from -0.5 to4.5. A higher PESQ score indicates that the enhanced speechis closer to the clean speech. On the other hand, STOI wasdesigned to compute the speech intelligibility, and the scoresranged from 0 to 1. A higher STOI score indicates a betterspeech intelligibility.To evaluate the SE function, we tested the performanceof the DDAE and FCN SE models using the STOI andPESQ scores. The MMSE approach, which is a well-knowntraditional SE method, was also tested for comparison. For thelistening tests, we recruited twenty participants (40% males),aged between 20 and 38 years with a mean age of 21.50(standard deviation; SD = 3.97). All the participants werenative Mandarin speakers with normal hearing abilities andwere therefore able to effectively perceive the stimuli duringthe test. Each participant listened to only 80 testing utterances(40 for 0dB SNR, and 40 for 5dB SNR) spoken by onemale and one female testing speaker. These 80 sentenceshad different contents and each consisted of 10 Chinesecharacters with one of the 5 assigned background noises (car,sea wave, take-off, train, and song). During testing, eachparticipant was asked to listen and respond to 40 lower SNRtasks, followed by 40 higher-SNR tasks, under four conditions(original noisy (denoted as Noisy in the following discussion),MMSE, DDAE, and FCN). To evaluate the SE function, thesubjects were instructed to verbally repeat what they had heardand were allowed to perceive the stimuli for a maximum oftwo times. The character correct rate (CCR) was used as theevaluation metric; CCR was calculated by dividing the numberof correctly identiﬁed words by the total number of wordsunder each test condition.To test the ASC function, we requested the listeners toidentify one out of six acoustic scenes after listening to theconverted or original noisy speech. The original and convertednoisy utterances were all of 5dB SNR. Twenty participants, allnative Mandarin speakers with normal hearing, were recruitedto participate in this set of listening tests. During the tests,each participant was asked to listen 80 utterances, whereeach utterance was ﬁrst processed by one of the three SEmethods (i.e., MMSE, DDAE, and FCN) and then mixed witha different noise type at 5dB SNR. In each task, the SIRs werecalculated based on the number of participants identiﬁcationresults given the ground-truth assigned background noises.Finally, we evaluated the performance of the MA function.Based on the recorded pure noise and speaker speech signals,we performed MA in three modes, which were termed asMA(N), MA(S), and MA(N+S). For MA(N), the recordednoise signals were mixed with the clean training speech (fromthe training set) to form the new noisy-clean speech pairs,which were then used to ﬁne-tune the SE model. For MA(S), the recorded speaker speech signals were mixed with 5 purenoise signals (from the training set) to form the new noisy-clean speech pairs, which were used to ﬁne-tune the SE model.For MA(S+N), the recorded speaker speech and new noisesignals were mixed to form the new noisy-clean speech pairs,which were then used to ﬁne-tune the SE model. B. Experimental Results1) The SE experiment:

This section presents the perfor-mance of the SE function in CITISEN. Table I presentsthe STOI and PESQ scores (the ﬁrst and second columns,respectively) of Noisy and enhanced speech processed usingthe MMSE, DDAE, and FCN methods. From the results,the FCN can provide the highest PESQ and STOI scoresamong the four methods, which is consistent with the ﬁndingspresented in our previous study [51].Table II presents the subjective listening test results forNoisy and the three SE methods. From the table, it can beobserved that MMSE yields lower CCRs as compared to Noisyfor both 0dB and 5dB SNRs, which is consistent with theﬁndings of previous research; in other words, it was foundthat although the traditional SE methods effectively removebackground noise, speech intelligibility may get affected.Next, FCN outperforms DDAE and achieves CCRs that arecomparable to Noisy. The results are consistent with theSTOI results reported in Table 1. We further conducted anindependent t-test to verify the signiﬁcance of the testingresults. The independent t-test results conﬁrm that the averageCCRs of FCN are signiﬁcantly better than those of Noisy andthe other two SE methods (with p ¡ .01) for both 0dB and 5dBSNR conditions.

2) The ASC experiment:

In this subsection, we presentthe evaluation results of the ASC function in CITISEN. Asmentioned earlier, the ASC includes two parts: SE and themixing with new background noise to enhance speech. In ourimplementation, after performing SE, a particular noise type is

TABLE IA

VERAGE

STOI

AND

PESQ

SCORES FOR N OISY AND THE THREE SE METHODS OVER AND D B SNR

CONDITIONS . N

OISY DENOTES THERESULTS OF ORIGINAL NOISY WITHOUT PREFORMING

SE.

STOI PESQNoisy

MMSE

DDAE

FCN

VERAGE SPEECH RECOGNITION RESULTS (CCR

S IN % ) FOR N OISY ANDTHE THREE SE METHODS AT D B AND D B SNR

CONDITIONS . MMSE

DDAE

FCN added to the enhanced speech to generate a new noisy speechwith a converted background. To avoid the fatigue effect, weonly tested the results of the 5dB SNR condition. In thisway, each subject listened to 80 utterances, repeated what theyheard, and were asked to indicate one out of six backgroundscenes. Based on the three SE methods, namely, MMSE,DDAE, and FC, three sets of ASC speech are obtained, whichare denoted as ASC(MMSE), ASC(DDAE), and ASC(FCN),respectively. Next, the recruited participants listened to thesethree sets of ASC speech and responded to the speech contentsand acoustic scene in the background. Table III lists the CCR(in %) and SIR (in %) results of these three setups.From Table III, we ﬁrst note that ASC(MMSE),ASC(DDAE), and ASC(FCN) give similar CCR scores. It isalso noted that the CCRs are not signiﬁcantly degraded byrunning the ASC, as compared to the CCR of Noisy reportedin Table II. Furthermore, ASC(FCN) yields higher SIR thanboth ASC(MMSE) and ASC(DDAE), suggesting that the FCNserves as a better SE model for the ASC function.

3) The MA+SE experiment:

Next, we investigated the ef-fectiveness of the MA function. For this set of experiments, weused two other noise types (machine beeping and air ﬂowing)from a real hospital scenario; these noise types are signiﬁcantlydifferent from those in the training set. Table IV presentsthe STOI and PESQ scores of MA(N)+SE, MA(S)+SE, andMA(N+S)+SE, where the FCN model is used as the SEin this set of experiments. From Table IV, it can be seenthat SE yields higher STOI and PESQ scores as comparedto Noisy, thereby conﬁrming that the SE model used inCITISEN can improve speech quality and intelligibility overnoisy speech although the noise types are unknown and greatlydifferent from those used in the training set. Next, as comparedwith the SE (without MA), all three MA approaches arecapable of achieving higher PESQ and STOI scores. Morespeciﬁcally, MA(N)+SE, MA(S)+SE, and MA(N+S)+SE, re-spectively, yielded noticeable relative improvements of . [( . − . )/ . ], . [( . − . )/ . ],and . [( . − . )/ . ] in terms of STOI, andrelative improvements of . [( . − . )/ . ], . ( . − . )/ . , and . ( . − . )/ . , in terms of PESQ, as compared to SE(FCN) only. The results obtained therefore conﬁrmed theeffectiveness of the MA function. Moreover, MA(N)+SE cangive higher scores than MA(S)+SE, suggesting that noisetype adaptation is more effective in improving SE perfor-mance. Finally, MA(N+S)+SE outperforms both MA(N)+SEand MA(S)+SE in terms of STOI, showing that intelligibilityimprovements can be attained by adapting the SE model based TABLE IIIT

HE SCORES OF

CCR ( IN %) AND

SIR ( IN %) BASED ON THE

ACS

FUNCTION IN

CITISEN.

CCR SIRASC(MMSE)

ASC(DDAE)

ASC(FCN) on both noise and speaker information.

4) Qualitative Analyses:

Finally, we present the CITISEN-processed speech in Fig. 12. Figs. 12 (a), (b), (c), and (d)depict the spectrogram and waveform plots of the clean, noisy,enhanced, and ASC speeches, respectively. For each sub-ﬁgurein Fig. 12, the left column depicts the spectrogram, whilethe right side depicts the associated waveform. In addition,in this example, the car noise was used to contaminate theclean speech to produce noisy speech. Additionally, a newtrain background noise was used as the converted noise forthe enhanced speech to provide the ASC.The enhanced spectrogram illustrated in Fig. 12 (c) pre-serves several harmonic clean speech structures when com-pared with those presented in Figs. 12 (a). In addition, whencomparing the waveforms between Figs. 12 (a), (b), and (c),the enhanced waveform presented in Fig. 12 (c) depicts thesmall noise components. Both the observations demonstratethe effectiveness of the CITIZEN approach in reducing thenoise from the noisy input while providing detailed speechstructures. On the contrary, the spectra presented in Fig. 12(d) clearly illustrate different noise patterns in comparisonwith those presented in Fig. 12 (b). The result qualitativelyconﬁrms that the CITIZEN approach is capable of effectivelyperforming the ASC task.V. C

ONCLUSION

In this paper, we presented a speech signal processingmobile application called CITISEN, comprising three mainfunctions: SE, ASC, and MA. CITISEN allows users to run SEand ASC on input speech and immediately obtain enhancedand converted speech, respectively. Experimental results ﬁrstconﬁrmed the SE function of providing improved STOI andPESQ scores. Next, the effectiveness of the ASC function wasveriﬁed based on listening tests. Finally, the MA function isconﬁrmed to provide notable STOI and PESQ improvementsas compared the the results without MA. To the best of ourknowledge, the ASC function based on SE with an added noisestrategy is the ﬁrst attempt in this study and worthy of furtherinvestigation. Moreover, we conﬁrmed the effectiveness of theMA function using recorded noise and speaker audio ﬁlesonline. In this study, we only reported the results of DDAE andFCN for the SE function. In fact, CITISEN can incorporateother SE models with novel architectures, such as transformer

TABLE IVA

VERAGE

STOI

AND

PESQ

SCORES FOR DIFFERENT SE MODELS OVER -2, 0, 2,

AND D B SNR

CONDITIONS . N

OISY DENOTES THE RESULTS OFORIGINAL NOISY WITHOUT PREFORMING

SE,

AND SE DENOTES THE

FCN-

BASED SE RESULTS . MA(N)+SE, MA(S)+SE,

AND

MA(N+S)+SE

DENOTE THE RESULTS OF SE WITH ADAPTED SE MODEL USINGRECORDED NOISE , SPEAKER , AND NOISE + SPEAKER AUDIO FILES . STOI PESQNoisy SE MA(N)+SE

MA(S)+SE

MA(N+S)+SE [68], [69], and advanced objective ﬁctions, such as those basedon STOI or PESQ [70] metrics. Users can choose suitableSE models based on the use scenarios. The experimentalresults conﬁrm the feasibility of implementing SEs and severalextended functions on mobile devices. Moreover, it is veriﬁedthat CITISEN can be suitably used as an effective frontprocessing for various speech-related approaches.R

EFERENCES[1] A. Varga and H. J. Steeneken, “Assessment for automatic speech recog-nition: II. noisex-92: A database and an experiment to study the effect ofadditive noise on speech recognition systems,”

Speech communication ,vol. 12, no. 3, pp. 247–251, 1993. [2] A. L. Giraud, S. Garnier, C. Micheyl, G. Lina, A. Chays, and S. Ch´ery-Croze, “Auditory efferents involved in speech-in-noise intelligibility,”

Neuroreport , vol. 8, no. 7, pp. 1779–1783, 1997.[3] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, et al. , “Thereverb challenge: A common evaluation framework for dereverberationand recognition of reverberant speech,” in

Proc. WASPAA , pp. 1–4, 2013.[4] R. Beutelmann and T. Brand, “Prediction of speech intelligibility inspatial noise and reverberation for normal-hearing and hearing-impairedlisteners,”

The Journal of the Acoustical Society of America , vol. 120,no. 1, pp. 331–342, 2006.[5] H. J. Steeneken and T. Houtgast, “A physical method for measuringspeech-transmission quality,”

The Journal of the Acoustical Society ofAmerica , vol. 67, no. 1, pp. 318–326, 1980.[6] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochasticmatching for robust speech recognition,”

IEEE transactions on speechand Audio Processing , vol. 4, no. 3, pp. 190–202, 1996.[7] P. C. Loizou,

Speech enhancement: theory and practice . CRC press,2013.[8] J. Benesty, S. Makino, and J. Chen,

Speech enhancement . SpringerScience & Business Media, 2005.[9] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong,

Robust automatic speechrecognition: a bridge to practical applications . Academic Press, 2015.[10] A. El-Solh, A. Cuhadar, and R. A. Goubran, “Evaluation of speech en-hancement techniques for speaker identiﬁcation in noisy environments,”in

Proc. ISM , pp. 235–239, 2007.[11] J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, and P. C.Loizou, “Comparative intelligibility investigation of single-channelnoise-reduction algorithms for chinese, japanese, and english,”

TheJournal of the Acoustical Society of America , vol. 129, no. 5, pp. 3291–3301, 2011.[12] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two-stagebinaural speech enhancement with wiener ﬁlter for high-quality speechcommunication,”

Speech Communication , vol. 53, no. 5, pp. 677–689,2011.[13] T. Venema, “Compression for clinicians, chapter 7,”

The many faces ofcompression.: Thomson Delmar Learning , 2006.[14] H. Levit, “Noise reduction in hearing aids: An overview,”

J. Rehabil.Res. Develop. , vol. 38, no. 1, pp. 111–121, 2001.[15] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “Adeep denoising autoencoder approach to improving the intelligibility ofvocoded speech in cochlear implant simulation,”

IEEE Transactions onBiomedical Engineering , vol. 64, no. 7, pp. 1568–1578, 2016.[16] F. Chen, Y. Hu, and M. Yuan, “Evaluation of noise reduction methods forsentence recognition by mandarin-speaking cochlear implant listeners,”

Ear and hearing , vol. 36, no. 1, pp. 61–71, 2015.[17] P. Scalart et al. , “Speech enhancement based on a priori signal to noiseestimation,” in

Proc. ICASSP , vol. 2, pp. 629–632, 1996.[18] E. H¨ansler and G. Schmidt,

Topics in acoustic echo and noise control:selected methods for the cancellation of acoustical echoes, the reductionof background noise, and speech processing . Springer Science &Business Media, 2006.[19] J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn, “Springer handbookof speech processing,” pp. 843–872, Springer, 2008.[20] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on asinusoidal representation,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 34, no. 4, pp. 744–754, 1986.[21] T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale andpitch modiﬁcation of speech,”

IEEE Transactions on Signal Processing ,vol. 40, no. 3, pp. 497–510, 1992.[22] J. Makhoul, “Linear prediction: A tutorial review,”

Proceedings of theIEEE , vol. 63, no. 4, pp. 561–580, 1975.[23] S. Suhadi, C. Last, and T. Fingscheidt, “A data-driven approach to apriori snr estimation,”

IEEE transactions on audio, speech, and languageprocessing , vol. 19, no. 1, pp. 186–195, 2010.[24] T. Lotter and P. Vary, “Speech enhancement by MAP spectral amplitudeestimation using a super-gaussian speech model,”

EURASIP Journal onAdvances in Signal Processing , vol. 2005, no. 7, p. 354850, 2005.[25] U. Kjems and J. Jensen, “Maximum likelihood based noise covariancematrix estimation for multi-microphone speech enhancement,” in

Proc.EUSIPCO , pp. 295–299, 2012.[26] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision noise suppression ﬁlter,”

IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 28, no. 2, pp. 137–145, 1980.[27] R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, “Enhancementof speech by adaptive ﬁltering,” in

Proc. ICASSP , vol. 1, pp. 251–253,1976. [28] Y. Ephraim, “Statistical-model-based speech enhancement systems,”

Proceedings of the IEEE , vol. 80, no. 10, pp. 1526–1555, 1992.[29] B. Atal and M. Schroeder, “Predictive coding of speech signals andsubjective error criteria,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 27, no. 3, pp. 247–254, 1979.[30] L. Rabiner and B. Juang, “An introduction to hidden markov models,” ieee assp magazine , vol. 3, no. 1, pp. 4–16, 1986.[31] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in

Proc. NIPS , 2001.[32] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speechdenoising using nonnegative matrix factorization with priors,” in

Proc.ICASSP , 2008.[33] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsu-pervised speech enhancement using nonnegative matrix factorization,”

IEEE Transactions on Audio, Speech, and Language Processing , vol. 21,no. 10, pp. 2140–2151, 2013.[34] J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, and C.-H. Wu, “Compressive sensing-based speech enhancement,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 24,no. 11, pp. 2122–2131, 2016.[35] J. Eggert and E. Korner, “Sparse coding and nmf,” in

Proc. IJCNN ,2004.[36] Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, and C.-H. Wu,“Speaker identiﬁcation using discriminative features and sparse repre-sentation,”

IEEE Transactions on Information Forensics and Security ,vol. 12, pp. 1979–1987, 2017.[37] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?,”

Journal of the ACM , vol. 58, no. 3, p. 11, 2011.[38] S. Tamura, “An analysis of a noise reduction neural network,” in

Proc.ICASSP , pp. 2001–2004, 1989.[39] F. Xie and D. Van Compernolle, “A family of MLP based nonlinearspectral estimators for noise reduction,” in

Proc. ICASSP , vol. 2, pp. II–53, 1994.[40] E. A. Wan and A. T. Nelson, “Networks for speech enhancement,”

Handbook of neural networks for speech processing. Artech House,Boston, USA , vol. 139, p. 1, 1999.[41] J. Tchorz and B. Kollmeier, “SNR estimation based on amplitudemodulation analysis with applications to noise suppression,”

IEEETransactions on Speech and Audio Processing , vol. 11, no. 3, pp. 184–192, 2003.[42] A. Maas, Q. V. Le, T. M. Oneil, O. Vinyals, P. Nguyen, and A. Y. Ng,“Recurrent neural networks for noise reduction in robust ASR,” 2012.[43] M. W¨ollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Featureenhancement by bidirectional LSTM networks for conversational speechrecognition in highly non-stationary noise,” in

Proc. ICASSP , pp. 6822–6826, 2013.[44] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speechenhancement by fully convolutional networks,” in

Proc. APSIPA ASC ,pp. 006–012, 2017.[45] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speechenhancement based on deep neural networks,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 23, no. 1, pp. 7–19,2014.[46] Y. Wang and D. Wang, “Towards scaling up classiﬁcation-based speechseparation,”

IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 21, no. 7, pp. 1381–1390, 2013.[47] I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq):An objective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,”

Rec. ITU-T P. 862 ,2001.[48] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”

IEEE Transactions on Audio, Speech, and Language Processing , vol. 19,no. 7, pp. 2125–2136, 2011.[49] S.-W. Fu, Y. Tsao, and X. Lu, “Snr-aware convolutional neural networkmodeling for speech enhancement.,” in

Proc. INTERSPEECH , pp. 3768–3772, 2016.[50] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement basedon deep denoising autoencoder.,” in

Proc. INTERSPEECH , pp. 436–440,2013.[51] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-endwaveform utterance enhancement for direct evaluation metrics optimiza-tion by fully convolutional neural networks,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, no. 9, pp. 1570–1584, 2018. [52] S. Gong, Z. Wang, T. Sun, Y. Zhang, C. D. Smith, L. Xu, andJ. Liu, “Dilated fcn: Listening longer to hear better,” in

Proc. WASPAA ,pp. 254–258, 2019.[53] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, andL. Bottou, “Stacked denoising autoencoders: Learning useful represen-tations in a deep network with a local denoising criterion.,”

Journal ofmachine learning research , vol. 11, no. 12, 2010.[54] S. Chopra, S. Balakrishnan, and R. Gopalan, “Dlid: Deep learning fordomain adaptation by interpolating between domains,” in

Proc. ICML ,vol. 2, 2013.[55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning forfast adaptation of deep networks,” arXiv preprint arXiv:1703.03400 ,2017.[56] R. Laroche and M. Barlier, “Transfer reinforcement learning with shareddynamics,” in

Proc. AAAI , 2017.[57] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-lette, M. Marchand, and V. Lempitsky, “Domain-Adversarial training ofneural networks,”

J. Mach. Learn. Res. , vol. 17, pp. 2096–2030, 2016.[58] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?,” in

Proc. NeurIPS , 2014.[59] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “DeCAF: A deep convolutional activation feature for genericvisual recognition,” in

Proc. ICML , 2014.[60] S. Wang, W. Li, S. M. Siniscalchi, and C.-H. Lee, “A cross-tasktransfer learning approach to adapting deep speech enhancement modelsto unseen background noise using paired senone classiﬁers,” in

Proc.ICASSP , pp. 6219–6223, 2020.[61] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptivespeech enhancement using domain adversarial training,” in

Proc. IN-TERSPEECH , 2019.[62] C.-C. Lee, Y.-C. Lin, H.-T. Lin, H.-M. Wang, and Y. Tsao, “Seril: Noiseadaptive speech enhancement using regularization-based incrementallearning,” arXiv preprint arXiv:2005.11760 , 2020.[63] M. Seki, H. Fujiwara, and K. Sumi, “A robust background subtractionmethod for changing background,” in

Proc. WACV , pp. 207–213, 2000.[64] M. Huang, “Development of taiwan mandarin hearing in noise test,”

Department of speech language pathology and audiology, NationalTaipei University of Nursing and Health science , 2005.[65] G. Hu, “100 nonspeech environmental sounds,”

The Ohio State Univer-sity, Department of Computer Science and Engineering , 2004.[66] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (pesq)-a new method for speech qualityassessment of telephone networks and codecs,” in

Proc. ICASSP , vol. 2,pp. 749–752, 2001.[67] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”

IEEE Transactions on Audio, Speech, and Language Processing , vol. 19,no. 7, pp. 2125–2136, 2011.[68] J. Kim, M. El-Khamy, and J. Lee, “Transformer with gaussianweighted self-attention for speech enhancement,” arXiv preprintarXiv:1910.06762 , 2019.[69] S.-W. Fu, C.-F. Liao, T.-A. Hsieh, K.-H. Hung, S.-S. Wang, C. Yu,H.-C. Kuo, R. E. Zezario, Y.-J. Li, S.-Y. Chuang, et al. , “Boostingobjective scores of speech enhancement model through metricgan post-processing,” arXiv preprint arXiv:2006.10296 , 2020.[70] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generativeadversarial networks based black-box metric scores optimization forspeech enhancement,” arXiv preprint arXiv:1905.04874arXiv preprint arXiv:1905.04874