CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
Alexander Chao-Fu Kang, Kuo-Hsuan Hung, Yu-Wen Chen, You-Jin Li, Ya-Hsin Lai, Kai-Chun Liu, Sze-Wei Fu, Syu-Siang Wang, Yu Tsao
11 CITISEN: A Deep Learning-Based SpeechSignal-Processing Mobile Application
Alexander Chao-Fu Kang, Kuo-Hsuan Hung, Yu-Wen Chen, You-Jin Li, Ya-Hsin Lai, Kai-Chun Liu, Sze-Wei Fu,Syu-Siang Wang, Yu Tsao,
Senior Member, IEEE
Abstract —In this paper, we present a deep learning-basedspeech signal-processing mobile application, CITISEN, whichcan perform three functions: speech enhancement (SE), acousticscene conversion (ASC), and model adaptation (MA). For SE,CITISEN can effectively reduce noise components from speechsignals and accordingly enhance their clarity and intelligibility.For ASC, CITISEN can convert the current background soundto a different background sound. Finally, for MA, CITISEN caneffectively adapt an SE model, with a few audio files, whenit encounters unknown speakers or noise types; the adaptedSE model is used to enhance the upcoming noisy utterances.Experimental results confirmed the effectiveness of CITISEN inperforming these three functions via objective evaluation andsubjective listening tests. The promising results reveal that thedeveloped CITISEN mobile application can potentially be usedas a front-end processor for various speech-related services suchas voice communication, assistive hearing devices, and virtualreality headsets.
Index Terms —speech enhancement, deep learning, model adap-tation, acoustic scene conversion.
I. I
NTRODUCTION
In recent years, a wide variety of speech-related applicationshave been developed. Most of these applications have beenhighly convenient for humanhuman and humanmachine com-munications. However, the following long-existing and criticalissue, which may notably limit the achievable performance ofthese applications, remains to be solved: speech distortionscaused by additive/convolutional noises and channel/deviceeffects [1]–[6]. Identifying an effective method of addressingthis distortion issue is a critical and challenging task, andnumerous approaches have been proposed to this end; amongthese approaches, speech enhancement (SE) is notable.The goal of SE is to transform noisy speech into enhancedspeech with improved quality and intelligibility [7], [8]. Inthe past several decades, SE has been widely used as a front-end unit in many voice-based applications such as automaticspeech recognition [9], [10], speaker recognition [11], speechcoding [12], hearing aids [13], [14], and cochlea implants [15],[16]. Existing SE methods can be roughly divided into threeclasses. SE methods in the first class design a filter or gainfunction to attenuate noise components; notable techniquesinclude the Wiener filter and its extensions [17]–[19] such asthe minimum mean square error spectral estimator (MMSE)[20]–[22], maximum a posteriori spectral amplitude estimator(MAPA) [23], [24], and maximum likelihood spectral ampli-tude estimator (MLSA) [25], [26]. SE methods in the secondclass adapt speech models to extract pure speech signals fromnoisy inputs; well-known methods include harmonic models [27], linear prediction (LP) models [28], [29], and hiddenMarkov models [30]. SE Methods of the first and secondclasses have a common limitationthe inability to effectivelycontrast non-stationary noise signals of real-world scenariosunder unexpected acoustic conditions. SE methods in the thirdclass are based on machine-learning algorithms; these methodstypically prepare a model for noisy-to-clean transformationin a data-driven manner without imposing strong statisticalconstraints. Notable SE methods belonging to this class in-clude non-negative matrix factorization [31]–[33], compressivesensing [34], sparse coding [35], [36], and robust principalcomponent analysis (RPCA) [37].An artificial neural network (ANN), as a successfulmachine-learning model, has also been used for SE becauseof its powerful nonlinear transformation capability. In [38]–[41], a shallow ANN is used to map noisy speech signals toclean ones. More recently, various types of ANNs, featuringdeep structures, have been used for SE (e.g., deep recur-rent neural networks and long-short term memory (LSTM)networks [42], [43], convolutional neural networks [44], anddeep feedforward neural networks [45], [46]). Although theeffectiveness of these deep-learning-based SE approaches hasbeen verified, their performance on a mobile application isyet to be confirmed. In this paper, we present our developedspeech signal processing mobile application, CITISEN, whichsupports SE to improve speech quality and intelligibility.Based on SE, two extended functionsacoustic scene conversion(ASC) and model adaptation (MA)are also implemented inCITISEN. We conducted a series of experiments to verify theeffectiveness of these three functions. Two standard measure-ment methodsperceptual evaluation of speech quality (PESQ)[47] and short-time objective intelligibility (STOI) [48]wereused to test the SE and MA. Experimental results confirm theeffectiveness of the SE and MA with notable PESQ and STOIscore improvements. Further, we conducted listening tests forintelligibility and acoustic scene identification to test the ASCperformance. The results reveal that the intelligibility scoresdid not drop significantly after the ASC was performed onthe original noisy speech, and the converted scene could beaccurately identified.The remainder of this paper is organized as follows. SectionII reviews related works. Section III presents the functions anduser interface of the CITISEN application. Section IV presentsthe experimental setup and results. Finally, Section V presentsthe conclusions of this study. a r X i v : . [ ee ss . A S ] A ug Fig. 1. Traditional filter-based SE architecture. FFT and IFFT denote the fastFourier transform and inverse FFT, respectively.Fig. 2. The DDAE-based SE architecture
II. RELATED WORKSIn this section, we first review the traditional filter-basedSE method, which will be used for comparisons in the exper-iments. Then, we introduce the deep denoising autoencoder(DDAE)-based and fully convolutional network (FCN)-basedSE methods, which are used as default SE models in CITISEN.
A. TRADITIONAL GAIN FUNCTION-BASED SE METHOD
For the SE task, we generally assume that the noisy speechsignal y [ n ] contains a clean speech signal s [ n ] and noise signal v [ n ] . y [ n ] = s [ n ] + v [ n ] , (1)where n is a time index. For the MMSE SE approach, the time-domain signal, y [ n ] , is first converted to a spectral feature, Y [ m, l ] , by a short time Fourier transform (STFT), where m and l denote the m th frequency bin and l th frame in the entireset of noisy spectral features, Y . From Eq. (2) , Y [ m, l ] canbe expressed as Y [ m, l ] = S [ m, l ] + V [ m, l ] . (2)By estimating a priori SNR and a posteriori SNR statisticsbased on a noise-estimation approach [49], we could estimatea function G [ m, l ] . The enhanced speech, ˆ S [ m, l ] , is obtainedby filtering Y [ m, l ] through G [ m, l ] . Finally, an inverse FFT(IFFT) is applied to convert ˆ S [ m, l ] to ˆ s [ n ] , as shown in Fig.1. B. DEEP LEARNING-BASED SE METHOD
In the CITIZEN application, we included two deep learning-based SE methods: DDAE and FCN. These two methods havebeen confirmed to yield promising results in several SE tasks[50]–[52].
1) Deep Denoising Autoencoders:
The DDAE model wasfirst applied in SE in [50]. During training, noisy-clean speechpairs are used to compute the mapping function from noisyto clean spectral (logarithm amplitude in this study) features.The aim of a DDAE is to transform the noisy speech signal toa clean speech signal by minimizing the reconstruction errorbetween the predicted spectral features (cid:98) S and the referenceclean spectral features S , such that θ ∗ = arg min θ E ( θ ) + ρC ( θ ) , (3)with E ( θ ) = (cid:107) φ ( Y ) − S (cid:107) F , (4)where ρ is a constant that controls the tradeoff between thereconstruction accuracy and regularization term C ( θ ) [53], φ ( . ) denotes the transformation function of the DDAE. Givennoisy spectral features, the DDAE estimates clean speech by h ( Y [ l ]) = σ ( W Y [ l ] + b ) , ... h D − ( Y [ l ]) = σ ( W D − h D − ( Y [ l ]) + b D − ) , ˆ S [ l ] = W D h D − ( Y [ l ]) + b D (5)where Y [ l ] and ˆ S [ l ] are the l th spectral feature vectors of theinput noisy and estimated clean spectral features, respectively; W · · · W D and b · · · b D are the weight matrices and biasvectors, respectively; and σ is the vector-wise non-linearactivation function. To incorporate contextual information, wemay concatenate several frames of feature vectors to form theinput and output for training the DDAE model. During testing,noisy speech signals are processed by the trained DDAE modelto reconstruct the enhanced speech signals [50].
2) Fully Convolutional Network:
Fig. 3 shows an FCNmodel, which is similar to a conventional CNN, but all thefully connected layers are removed. As reported in [51],the FCN model can deal with the high and low frequencycomponents of the raw waveform at the same time. Therelation between the output sample ˆ s [ n ] and the connectedhidden nodes R [ n ] can be represented by ˆ s [ n ] = Q (cid:62) R [ n ] , (6)where Q ∈ R q × denotes one of the learned filters, and q is the size of the filter. For the details on the structure ofthe FCN model for waveform enhancement, please refer toprevious works [44], [51]. When we use the L norm, theobjective function is defined as L ( θ ) = 1 u (cid:88) u || w y ( u ) − w q ( u ) || , (7)where θ denotes the model parameters of FCN, where w y ( u ) and w q ( u ) are the u th estimated utterance and clean reference,respectively. C. MODEL ADAPTATION
When operating SEs in a real-world scenario, unknownnoise types and new users are often encountered. In such acase, the testing data may not be well covered by the trained
Fig. 3. The FCN-based SE architecture
SE model. The differences in acoustic characteristics, suchtraining/testing mismatches, may considerably degrade the SEperformance. To effectively address this mismatch issue, theadaptation of an SE model is required. Thus far, various MAapproaches have been proposed [54]–[59]. The main conceptof MA is to adjust the parameters of a pre-trained model(prepared by training data) based on a set of adaptation datato match the testing condition.For the SE MA task, we first need to prepare adaptationdata that cover new noise types or/and speakers [60]–[62]. Theparameters of the original SE model are then adjusted basedon the adaptation data. Because the adapted SE models matchthe testing condition, the SE performance can be improved.III. CITISEN APPIn this section, we introduce the concepts of our mobileapplication, explain how all the functions are implemented,and demonstrate the user interface.
A. Speech Enhancement (SE) Function
SE is a major function of CITISEN. As shown in the blueblock of Fig. 4, given the noisy speech, the SE functionremoves background noises and generates enhanced speechwith improved quality and intelligibility. We train the SEmodels in a cloud server. Then, the trained models are loadedinto mobile devices. Because the model is trained using acloud server, a huge computational resource is not required inmobile devices. As mentioned earlier, two deep learning-basedSE methods-DDAE and FCN-are implemented in CITISEN.To reduce the latency, a small window size is used whenimplementing these SE systems. As reported in the previoussection, DDAE and FCN preform SE in the spectral and raw-waveform domains, respectively.
B. Acoustic Scene Conversion (ASC) Function
Because the SE function can extract pure speech by remov-ing background noises, the ASC is implemented base on SE.After SE extract pure speech from noisy speech, ASC mixespure speech with another new background noise. In the samewords, we can artificially convert the acoustic scene from theoriginal audio. The overall ASC function is illustrated in theorange block of Fig. 4. The main concept of ASC is similar to
Fig. 4. The SE, ASC, and MA functions in CITISEN the changing background of an image or a video [63], and ASCis a new topic in the speech signal research field. Based on ourliterature survey, there is no standard method to evaluate thistask. Therefore, we invite real humans to conduct listeningtests. Our goal is to not only mix clean speech with a newacoustic scene but also ensure that the same levels of clarityand intelligibility are maintained. Accordingly, we designedtwo listening tests: one for the speech intelligibility scoresand the other for the scene identification rate (SIR) of theoriginal/converted acoustic scenes.
C. Model Adaptation (MA) function
The MA function of CITISEN aims to adapt the SE modelto fit unknown noises or/and speakers. The procedure of MAis illustrated in the green block in Fig. 4. We provide threedifferent MA modes: noise only, speaker only, and noise andspeaker. Based on the user environment, users can choose thebest MA mode and then upload a short recorded audio clip toa cloud server for adapting SE models. Our experiments revealthat the MA function notably increased the SE performancewhen unknown noise types were encountered (Fig. 4). Theperformance improves notably in both STOI and PESQ scores.
D. CITISEN User Interface and Usage
The CITISEN application has four pages: ”Speech Enhance-ment,” ”Acoustic Scene Conversion,” ”Model Adaptation,” and”Recording,” as shown in Fig. 5. The page name and navigatorbuttons of each page are placed on the top-left and bottom inthe application, respectively.On the ”Speech Enhancement” page, a user first specifiesher/his gender identity (”Gender Identity” in Fig. 6). Then, bypressing the ”SE Model Switch” button, the user can selectone suitable SE model from a list of saved models. CITISENprovides several default SE models trained using our owncollected speech datasets. Users can also run MA to prepareadapted SE models and save them as new SE models. Then,by pressing the SE button, the noisy speech is transformed toa clean one online.In the ”Acoustic Scene Conversion” page, CITISEN mixesthe acoustic scene on enhanced speech to generate new speechsignals with the converted acoustic scene; the user interface ofthis page is shown in Fig. 7. The ”Acoustic Scene Conversion”
Fig. 5. Four main pages in CITISEN (”Speech Enhancement”, ”AcousticScene Conversion”, ”Model Adaptation”, and ”Recording”). The page nameand the navigator buttons of each page are listed on the top-left and bottomin the application, respectively.Fig. 6. CITISEN: the ”Speech Enhancement” pageFig. 7. CITISEN: the ”Acoustic Scene Conversion” page page has a ”Record Noise” button, by which users can recordand save noise signals for the ASC. The page also has avolume bar, which allows users to adjust the volume ofbackground noise and accordingly specify the SNR level ofthe converted speech. To change the acoustic scenes, usersfirst press the SE Model Switch button to select an SE model.Then, by pressing Background Noise Switch button, as shownon the left side of Fig. 7, an acoustic scene selection windowwill pop up and list all the acoustic scene options, as shownon the right side of Fig. 7. Users can select the target scenefor the ASC, and the speech with the converted scene will begenerated accordingly.In the ”Model Adaptation” page, there are two file uploadbuttons: ”Record Noise” and ”Record Speech,” as shown onthe left side of Fig. 8. By pressing one of these buttons, userscan record pure noise or speaker speech signals and uploadthe recorded audio to our server. To start recording, userscan simply press on one of the buttons, as shown on theleft side of Fig. 8. After finishing the recording, by pressingthe button again, CITISEN pops up a submitting window, as
Fig. 8. CITISEN: the ”Model Adaptation” page.Fig. 9. CITISEN: the ”Recording” page (recording or loading saved audiofiles). shown on the right side of Fig. 8. The submitting windowasks the user to name the audio file, and the audio is thensent to the server. After receiving the audio file, the serverestimates an adapted SE model by fine-tuning the original SEmodel using the recorded audio data. The name of the audiofile can also be used to name the adapted SE model, which islater sent from the server to the mobile device and appears onthe ”Speech Enhancement” and ”Acoustic Scene Conversion”pages. Accordingly, users can run SE and ASC functions usingthe adapted SE model.The ”Recording” page is used for users to record speechand noise in the current environment and to save the enhancedor converted audio files. For the ”Speech Enhancement” and”Acoustic Scene Conversion” pages, users can immediatelylisten to enhanced or converted speech online. On the otherhand, the ”Recording” page allows users to save and playbacklater on the processed audio files. Users first record (upper pathin Fig. 9) or load an existing (bottom path in Fig. 9) audiofile and then press the ”SE Model Switch” button. Then, anSE model selection window pops up, as shown on the right ofFig. 10. By selecting a suitable SE model and then pressingthe run button (as shown on the left side of Fig. 10), enhancedspeech is generated. CITISEN demonstrates two spectrogramplots: noisy and enhanced speech spectrogram plots (as shownon the right side of Fig. 11), so that users can visually checkthe SE results. In addition to these two plots, users can press”Play” and ”Stop” buttons on top of spectrogram plots to playand listen to the original and processed audio files.IV. EXPERIMENTS
A. Experimental Setup
We conducted three sets of experiments. First, we testedthe performance of the SE and ASC functions using STOI
Fig. 10. CITISEN: the ”Recording” page (selecting a model to perform SE).Fig. 11. CITISEN: the ”Recording” page (demonstrating the processed speechby spectrogram plots). and PESQ metrics and listening tests. Next, we conducteda listening test to examine the intelligibility and SIR of thespeech before and after ASC. Finally, as mentioned earlier,we implemented the MA function by fine-tuning the originalSE model to fit unseen noise types and new speakers. Ac-cordingly, we obtained three sets of results for MA followedby SE (termed MA+SE): ”MA+SE(N)”, ”MA+SE(S)”, and”MA+SE(N+S)”, thereby denoting model adaptations on noisetype, speaker, and both noise type and speaker, respectively.In this study, TMHINT utterances [64] were used to preparethe training and testing sets. More specifically, the training setwas prepared using speech utterances from six speakers, threemales and three females. Each speaker read 200 TMHINTutterances in a quiet room, amounting to a total of 1200clean utterances. Noisy utterances were generated by artifi-cially contaminating these 1200 clean training utterances withrandomly sampled noise types from a 100-noise type dataset[65] at 8 different SNR levels ( ± dB, ± dB, ± dB, and ± dB). Consequentially, 48000 noisy-clean pair utteranceswere obtained. To construct the testing set, we used the speechutterance from another two speakers (one male and one female,termed testing speaker in the following discussion), with 120utterances for each speaker. We generated noisy utterancesby artificially contaminating these 120 clean utterances withanother set of 5 noise types (car, sea wave, take-off, train,and song) at 4 different SNR levels ( ± dB, dB, and ± dB).Notably, the speakers, speech contents, and noise types weredifferent for the training and testing sets. All the training andtesting utterances were recorded at a 16 kHz sampling rate ina 16-bit format. The hyper-parameters for both the DDAE andFCN SE models are as follows: number of training epochs is40, batch size is 1, and optimizer is Adam with a learningrate of 0.001. A validation set was prepared and used todetermine the best model configurations for the SE. To avoidunstable communication and computation, we conducted the experiments offline. More specifically, we ran CITISEN to ob-tain the processed speech. Subsequently, objective evaluationsand listening tests were conducted using the processed speechoffline.We first tested the performance of the SE and ASC functionsusing both objective evaluations and subjective listening tests.For the objective evaluations, PESQ [66] and STOI [67]metrics were used. PESQ was designed to evaluate the qualityof the processed speech, and the score ranged from -0.5 to4.5. A higher PESQ score indicates that the enhanced speechis closer to the clean speech. On the other hand, STOI wasdesigned to compute the speech intelligibility, and the scoresranged from 0 to 1. A higher STOI score indicates a betterspeech intelligibility.To evaluate the SE function, we tested the performanceof the DDAE and FCN SE models using the STOI andPESQ scores. The MMSE approach, which is a well-knowntraditional SE method, was also tested for comparison. For thelistening tests, we recruited twenty participants (40% males),aged between 20 and 38 years with a mean age of 21.50(standard deviation; SD = 3.97). All the participants werenative Mandarin speakers with normal hearing abilities andwere therefore able to effectively perceive the stimuli duringthe test. Each participant listened to only 80 testing utterances(40 for 0dB SNR, and 40 for 5dB SNR) spoken by onemale and one female testing speaker. These 80 sentenceshad different contents and each consisted of 10 Chinesecharacters with one of the 5 assigned background noises (car,sea wave, take-off, train, and song). During testing, eachparticipant was asked to listen and respond to 40 lower SNRtasks, followed by 40 higher-SNR tasks, under four conditions(original noisy (denoted as Noisy in the following discussion),MMSE, DDAE, and FCN). To evaluate the SE function, thesubjects were instructed to verbally repeat what they had heardand were allowed to perceive the stimuli for a maximum oftwo times. The character correct rate (CCR) was used as theevaluation metric; CCR was calculated by dividing the numberof correctly identified words by the total number of wordsunder each test condition.To test the ASC function, we requested the listeners toidentify one out of six acoustic scenes after listening to theconverted or original noisy speech. The original and convertednoisy utterances were all of 5dB SNR. Twenty participants, allnative Mandarin speakers with normal hearing, were recruitedto participate in this set of listening tests. During the tests,each participant was asked to listen 80 utterances, whereeach utterance was first processed by one of the three SEmethods (i.e., MMSE, DDAE, and FCN) and then mixed witha different noise type at 5dB SNR. In each task, the SIRs werecalculated based on the number of participants identificationresults given the ground-truth assigned background noises.Finally, we evaluated the performance of the MA function.Based on the recorded pure noise and speaker speech signals,we performed MA in three modes, which were termed asMA(N), MA(S), and MA(N+S). For MA(N), the recordednoise signals were mixed with the clean training speech (fromthe training set) to form the new noisy-clean speech pairs,which were then used to fine-tune the SE model. For MA(S), the recorded speaker speech signals were mixed with 5 purenoise signals (from the training set) to form the new noisy-clean speech pairs, which were used to fine-tune the SE model.For MA(S+N), the recorded speaker speech and new noisesignals were mixed to form the new noisy-clean speech pairs,which were then used to fine-tune the SE model. B. Experimental Results1) The SE experiment:
This section presents the perfor-mance of the SE function in CITISEN. Table I presentsthe STOI and PESQ scores (the first and second columns,respectively) of Noisy and enhanced speech processed usingthe MMSE, DDAE, and FCN methods. From the results,the FCN can provide the highest PESQ and STOI scoresamong the four methods, which is consistent with the findingspresented in our previous study [51].Table II presents the subjective listening test results forNoisy and the three SE methods. From the table, it can beobserved that MMSE yields lower CCRs as compared to Noisyfor both 0dB and 5dB SNRs, which is consistent with thefindings of previous research; in other words, it was foundthat although the traditional SE methods effectively removebackground noise, speech intelligibility may get affected.Next, FCN outperforms DDAE and achieves CCRs that arecomparable to Noisy. The results are consistent with theSTOI results reported in Table 1. We further conducted anindependent t-test to verify the significance of the testingresults. The independent t-test results confirm that the averageCCRs of FCN are significantly better than those of Noisy andthe other two SE methods (with p ¡ .01) for both 0dB and 5dBSNR conditions.
2) The ASC experiment:
In this subsection, we presentthe evaluation results of the ASC function in CITISEN. Asmentioned earlier, the ASC includes two parts: SE and themixing with new background noise to enhance speech. In ourimplementation, after performing SE, a particular noise type is
TABLE IA
VERAGE
STOI
AND
PESQ
SCORES FOR N OISY AND THE THREE SE METHODS OVER AND D B SNR
CONDITIONS . N
OISY DENOTES THERESULTS OF ORIGINAL NOISY WITHOUT PREFORMING
SE.
STOI PESQNoisy
MMSE
DDAE
FCN
VERAGE SPEECH RECOGNITION RESULTS (CCR
S IN % ) FOR N OISY ANDTHE THREE SE METHODS AT D B AND D B SNR
CONDITIONS . MMSE
DDAE
FCN added to the enhanced speech to generate a new noisy speechwith a converted background. To avoid the fatigue effect, weonly tested the results of the 5dB SNR condition. In thisway, each subject listened to 80 utterances, repeated what theyheard, and were asked to indicate one out of six backgroundscenes. Based on the three SE methods, namely, MMSE,DDAE, and FC, three sets of ASC speech are obtained, whichare denoted as ASC(MMSE), ASC(DDAE), and ASC(FCN),respectively. Next, the recruited participants listened to thesethree sets of ASC speech and responded to the speech contentsand acoustic scene in the background. Table III lists the CCR(in %) and SIR (in %) results of these three setups.From Table III, we first note that ASC(MMSE),ASC(DDAE), and ASC(FCN) give similar CCR scores. It isalso noted that the CCRs are not significantly degraded byrunning the ASC, as compared to the CCR of Noisy reportedin Table II. Furthermore, ASC(FCN) yields higher SIR thanboth ASC(MMSE) and ASC(DDAE), suggesting that the FCNserves as a better SE model for the ASC function.
3) The MA+SE experiment:
Next, we investigated the ef-fectiveness of the MA function. For this set of experiments, weused two other noise types (machine beeping and air flowing)from a real hospital scenario; these noise types are significantlydifferent from those in the training set. Table IV presentsthe STOI and PESQ scores of MA(N)+SE, MA(S)+SE, andMA(N+S)+SE, where the FCN model is used as the SEin this set of experiments. From Table IV, it can be seenthat SE yields higher STOI and PESQ scores as comparedto Noisy, thereby confirming that the SE model used inCITISEN can improve speech quality and intelligibility overnoisy speech although the noise types are unknown and greatlydifferent from those used in the training set. Next, as comparedwith the SE (without MA), all three MA approaches arecapable of achieving higher PESQ and STOI scores. Morespecifically, MA(N)+SE, MA(S)+SE, and MA(N+S)+SE, re-spectively, yielded noticeable relative improvements of . [( . − . )/ . ], . [( . − . )/ . ],and . [( . − . )/ . ] in terms of STOI, andrelative improvements of . [( . − . )/ . ], . ( . − . )/ . , and . ( . − . )/ . , in terms of PESQ, as compared to SE(FCN) only. The results obtained therefore confirmed theeffectiveness of the MA function. Moreover, MA(N)+SE cangive higher scores than MA(S)+SE, suggesting that noisetype adaptation is more effective in improving SE perfor-mance. Finally, MA(N+S)+SE outperforms both MA(N)+SEand MA(S)+SE in terms of STOI, showing that intelligibilityimprovements can be attained by adapting the SE model based TABLE IIIT
HE SCORES OF
CCR ( IN %) AND
SIR ( IN %) BASED ON THE
ACS
FUNCTION IN
CITISEN.
CCR SIRASC(MMSE)
ASC(DDAE)
ASC(FCN) on both noise and speaker information.
4) Qualitative Analyses:
Finally, we present the CITISEN-processed speech in Fig. 12. Figs. 12 (a), (b), (c), and (d)depict the spectrogram and waveform plots of the clean, noisy,enhanced, and ASC speeches, respectively. For each sub-figurein Fig. 12, the left column depicts the spectrogram, whilethe right side depicts the associated waveform. In addition,in this example, the car noise was used to contaminate theclean speech to produce noisy speech. Additionally, a newtrain background noise was used as the converted noise forthe enhanced speech to provide the ASC.The enhanced spectrogram illustrated in Fig. 12 (c) pre-serves several harmonic clean speech structures when com-pared with those presented in Figs. 12 (a). In addition, whencomparing the waveforms between Figs. 12 (a), (b), and (c),the enhanced waveform presented in Fig. 12 (c) depicts thesmall noise components. Both the observations demonstratethe effectiveness of the CITIZEN approach in reducing thenoise from the noisy input while providing detailed speechstructures. On the contrary, the spectra presented in Fig. 12(d) clearly illustrate different noise patterns in comparisonwith those presented in Fig. 12 (b). The result qualitativelyconfirms that the CITIZEN approach is capable of effectivelyperforming the ASC task.V. C
ONCLUSION
In this paper, we presented a speech signal processingmobile application called CITISEN, comprising three mainfunctions: SE, ASC, and MA. CITISEN allows users to run SEand ASC on input speech and immediately obtain enhancedand converted speech, respectively. Experimental results firstconfirmed the SE function of providing improved STOI andPESQ scores. Next, the effectiveness of the ASC function wasverified based on listening tests. Finally, the MA function isconfirmed to provide notable STOI and PESQ improvementsas compared the the results without MA. To the best of ourknowledge, the ASC function based on SE with an added noisestrategy is the first attempt in this study and worthy of furtherinvestigation. Moreover, we confirmed the effectiveness of theMA function using recorded noise and speaker audio filesonline. In this study, we only reported the results of DDAE andFCN for the SE function. In fact, CITISEN can incorporateother SE models with novel architectures, such as transformer
TABLE IVA
VERAGE
STOI
AND
PESQ
SCORES FOR DIFFERENT SE MODELS OVER -2, 0, 2,
AND D B SNR
CONDITIONS . N
OISY DENOTES THE RESULTS OFORIGINAL NOISY WITHOUT PREFORMING
SE,
AND SE DENOTES THE
FCN-
BASED SE RESULTS . MA(N)+SE, MA(S)+SE,
AND
MA(N+S)+SE
DENOTE THE RESULTS OF SE WITH ADAPTED SE MODEL USINGRECORDED NOISE , SPEAKER , AND NOISE + SPEAKER AUDIO FILES . STOI PESQNoisy SE MA(N)+SE
MA(S)+SE
MA(N+S)+SE [68], [69], and advanced objective fictions, such as those basedon STOI or PESQ [70] metrics. Users can choose suitableSE models based on the use scenarios. The experimentalresults confirm the feasibility of implementing SEs and severalextended functions on mobile devices. Moreover, it is verifiedthat CITISEN can be suitably used as an effective frontprocessing for various speech-related approaches.R
EFERENCES[1] A. Varga and H. J. Steeneken, “Assessment for automatic speech recog-nition: II. noisex-92: A database and an experiment to study the effect ofadditive noise on speech recognition systems,”
Speech communication ,vol. 12, no. 3, pp. 247–251, 1993. [2] A. L. Giraud, S. Garnier, C. Micheyl, G. Lina, A. Chays, and S. Ch´ery-Croze, “Auditory efferents involved in speech-in-noise intelligibility,”
Neuroreport , vol. 8, no. 7, pp. 1779–1783, 1997.[3] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, et al. , “Thereverb challenge: A common evaluation framework for dereverberationand recognition of reverberant speech,” in
Proc. WASPAA , pp. 1–4, 2013.[4] R. Beutelmann and T. Brand, “Prediction of speech intelligibility inspatial noise and reverberation for normal-hearing and hearing-impairedlisteners,”
The Journal of the Acoustical Society of America , vol. 120,no. 1, pp. 331–342, 2006.[5] H. J. Steeneken and T. Houtgast, “A physical method for measuringspeech-transmission quality,”
The Journal of the Acoustical Society ofAmerica , vol. 67, no. 1, pp. 318–326, 1980.[6] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochasticmatching for robust speech recognition,”
IEEE transactions on speechand Audio Processing , vol. 4, no. 3, pp. 190–202, 1996.[7] P. C. Loizou,
Speech enhancement: theory and practice . CRC press,2013.[8] J. Benesty, S. Makino, and J. Chen,
Speech enhancement . SpringerScience & Business Media, 2005.[9] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong,
Robust automatic speechrecognition: a bridge to practical applications . Academic Press, 2015.[10] A. El-Solh, A. Cuhadar, and R. A. Goubran, “Evaluation of speech en-hancement techniques for speaker identification in noisy environments,”in
Proc. ISM , pp. 235–239, 2007.[11] J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, and P. C.Loizou, “Comparative intelligibility investigation of single-channelnoise-reduction algorithms for chinese, japanese, and english,”
TheJournal of the Acoustical Society of America , vol. 129, no. 5, pp. 3291–3301, 2011.[12] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two-stagebinaural speech enhancement with wiener filter for high-quality speechcommunication,”
Speech Communication , vol. 53, no. 5, pp. 677–689,2011.[13] T. Venema, “Compression for clinicians, chapter 7,”
The many faces ofcompression.: Thomson Delmar Learning , 2006.[14] H. Levit, “Noise reduction in hearing aids: An overview,”
J. Rehabil.Res. Develop. , vol. 38, no. 1, pp. 111–121, 2001.[15] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “Adeep denoising autoencoder approach to improving the intelligibility ofvocoded speech in cochlear implant simulation,”
IEEE Transactions onBiomedical Engineering , vol. 64, no. 7, pp. 1568–1578, 2016.[16] F. Chen, Y. Hu, and M. Yuan, “Evaluation of noise reduction methods forsentence recognition by mandarin-speaking cochlear implant listeners,”
Ear and hearing , vol. 36, no. 1, pp. 61–71, 2015.[17] P. Scalart et al. , “Speech enhancement based on a priori signal to noiseestimation,” in
Proc. ICASSP , vol. 2, pp. 629–632, 1996.[18] E. H¨ansler and G. Schmidt,
Topics in acoustic echo and noise control:selected methods for the cancellation of acoustical echoes, the reductionof background noise, and speech processing . Springer Science &Business Media, 2006.[19] J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn, “Springer handbookof speech processing,” pp. 843–872, Springer, 2008.[20] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on asinusoidal representation,”
IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 34, no. 4, pp. 744–754, 1986.[21] T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale andpitch modification of speech,”
IEEE Transactions on Signal Processing ,vol. 40, no. 3, pp. 497–510, 1992.[22] J. Makhoul, “Linear prediction: A tutorial review,”
Proceedings of theIEEE , vol. 63, no. 4, pp. 561–580, 1975.[23] S. Suhadi, C. Last, and T. Fingscheidt, “A data-driven approach to apriori snr estimation,”
IEEE transactions on audio, speech, and languageprocessing , vol. 19, no. 1, pp. 186–195, 2010.[24] T. Lotter and P. Vary, “Speech enhancement by MAP spectral amplitudeestimation using a super-gaussian speech model,”
EURASIP Journal onAdvances in Signal Processing , vol. 2005, no. 7, p. 354850, 2005.[25] U. Kjems and J. Jensen, “Maximum likelihood based noise covariancematrix estimation for multi-microphone speech enhancement,” in
Proc.EUSIPCO , pp. 295–299, 2012.[26] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision noise suppression filter,”
IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 28, no. 2, pp. 137–145, 1980.[27] R. Frazier, S. Samsam, L. Braida, and A. Oppenheim, “Enhancementof speech by adaptive filtering,” in
Proc. ICASSP , vol. 1, pp. 251–253,1976. [28] Y. Ephraim, “Statistical-model-based speech enhancement systems,”
Proceedings of the IEEE , vol. 80, no. 10, pp. 1526–1555, 1992.[29] B. Atal and M. Schroeder, “Predictive coding of speech signals andsubjective error criteria,”
IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 27, no. 3, pp. 247–254, 1979.[30] L. Rabiner and B. Juang, “An introduction to hidden markov models,” ieee assp magazine , vol. 3, no. 1, pp. 4–16, 1986.[31] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in
Proc. NIPS , 2001.[32] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speechdenoising using nonnegative matrix factorization with priors,” in
Proc.ICASSP , 2008.[33] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsu-pervised speech enhancement using nonnegative matrix factorization,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 21,no. 10, pp. 2140–2151, 2013.[34] J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, and C.-H. Wu, “Compressive sensing-based speech enhancement,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 24,no. 11, pp. 2122–2131, 2016.[35] J. Eggert and E. Korner, “Sparse coding and nmf,” in
Proc. IJCNN ,2004.[36] Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, and C.-H. Wu,“Speaker identification using discriminative features and sparse repre-sentation,”
IEEE Transactions on Information Forensics and Security ,vol. 12, pp. 1979–1987, 2017.[37] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?,”
Journal of the ACM , vol. 58, no. 3, p. 11, 2011.[38] S. Tamura, “An analysis of a noise reduction neural network,” in
Proc.ICASSP , pp. 2001–2004, 1989.[39] F. Xie and D. Van Compernolle, “A family of MLP based nonlinearspectral estimators for noise reduction,” in
Proc. ICASSP , vol. 2, pp. II–53, 1994.[40] E. A. Wan and A. T. Nelson, “Networks for speech enhancement,”
Handbook of neural networks for speech processing. Artech House,Boston, USA , vol. 139, p. 1, 1999.[41] J. Tchorz and B. Kollmeier, “SNR estimation based on amplitudemodulation analysis with applications to noise suppression,”
IEEETransactions on Speech and Audio Processing , vol. 11, no. 3, pp. 184–192, 2003.[42] A. Maas, Q. V. Le, T. M. Oneil, O. Vinyals, P. Nguyen, and A. Y. Ng,“Recurrent neural networks for noise reduction in robust ASR,” 2012.[43] M. W¨ollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Featureenhancement by bidirectional LSTM networks for conversational speechrecognition in highly non-stationary noise,” in
Proc. ICASSP , pp. 6822–6826, 2013.[44] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speechenhancement by fully convolutional networks,” in
Proc. APSIPA ASC ,pp. 006–012, 2017.[45] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speechenhancement based on deep neural networks,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 23, no. 1, pp. 7–19,2014.[46] Y. Wang and D. Wang, “Towards scaling up classification-based speechseparation,”
IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 21, no. 7, pp. 1381–1390, 2013.[47] I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq):An objective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,”
Rec. ITU-T P. 862 ,2001.[48] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 19,no. 7, pp. 2125–2136, 2011.[49] S.-W. Fu, Y. Tsao, and X. Lu, “Snr-aware convolutional neural networkmodeling for speech enhancement.,” in
Proc. INTERSPEECH , pp. 3768–3772, 2016.[50] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement basedon deep denoising autoencoder.,” in
Proc. INTERSPEECH , pp. 436–440,2013.[51] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-endwaveform utterance enhancement for direct evaluation metrics optimiza-tion by fully convolutional neural networks,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 26, no. 9, pp. 1570–1584, 2018. [52] S. Gong, Z. Wang, T. Sun, Y. Zhang, C. D. Smith, L. Xu, andJ. Liu, “Dilated fcn: Listening longer to hear better,” in
Proc. WASPAA ,pp. 254–258, 2019.[53] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, andL. Bottou, “Stacked denoising autoencoders: Learning useful represen-tations in a deep network with a local denoising criterion.,”
Journal ofmachine learning research , vol. 11, no. 12, 2010.[54] S. Chopra, S. Balakrishnan, and R. Gopalan, “Dlid: Deep learning fordomain adaptation by interpolating between domains,” in
Proc. ICML ,vol. 2, 2013.[55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning forfast adaptation of deep networks,” arXiv preprint arXiv:1703.03400 ,2017.[56] R. Laroche and M. Barlier, “Transfer reinforcement learning with shareddynamics,” in
Proc. AAAI , 2017.[57] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-lette, M. Marchand, and V. Lempitsky, “Domain-Adversarial training ofneural networks,”
J. Mach. Learn. Res. , vol. 17, pp. 2096–2030, 2016.[58] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?,” in
Proc. NeurIPS , 2014.[59] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “DeCAF: A deep convolutional activation feature for genericvisual recognition,” in
Proc. ICML , 2014.[60] S. Wang, W. Li, S. M. Siniscalchi, and C.-H. Lee, “A cross-tasktransfer learning approach to adapting deep speech enhancement modelsto unseen background noise using paired senone classifiers,” in
Proc.ICASSP , pp. 6219–6223, 2020.[61] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptivespeech enhancement using domain adversarial training,” in
Proc. IN-TERSPEECH , 2019.[62] C.-C. Lee, Y.-C. Lin, H.-T. Lin, H.-M. Wang, and Y. Tsao, “Seril: Noiseadaptive speech enhancement using regularization-based incrementallearning,” arXiv preprint arXiv:2005.11760 , 2020.[63] M. Seki, H. Fujiwara, and K. Sumi, “A robust background subtractionmethod for changing background,” in
Proc. WACV , pp. 207–213, 2000.[64] M. Huang, “Development of taiwan mandarin hearing in noise test,”
Department of speech language pathology and audiology, NationalTaipei University of Nursing and Health science , 2005.[65] G. Hu, “100 nonspeech environmental sounds,”
The Ohio State Univer-sity, Department of Computer Science and Engineering , 2004.[66] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptualevaluation of speech quality (pesq)-a new method for speech qualityassessment of telephone networks and codecs,” in
Proc. ICASSP , vol. 2,pp. 749–752, 2001.[67] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithmfor intelligibility prediction of time–frequency weighted noisy speech,”
IEEE Transactions on Audio, Speech, and Language Processing , vol. 19,no. 7, pp. 2125–2136, 2011.[68] J. Kim, M. El-Khamy, and J. Lee, “Transformer with gaussianweighted self-attention for speech enhancement,” arXiv preprintarXiv:1910.06762 , 2019.[69] S.-W. Fu, C.-F. Liao, T.-A. Hsieh, K.-H. Hung, S.-S. Wang, C. Yu,H.-C. Kuo, R. E. Zezario, Y.-J. Li, S.-Y. Chuang, et al. , “Boostingobjective scores of speech enhancement model through metricgan post-processing,” arXiv preprint arXiv:2006.10296 , 2020.[70] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “Metricgan: Generativeadversarial networks based black-box metric scores optimization forspeech enhancement,” arXiv preprint arXiv:1905.04874arXiv preprint arXiv:1905.04874