[PDF] Knowing What to Listen to: Early Attention for Deep Speech Representation Learning

Abstract

Deep learning techniques have considerably improved speech processing in recent years. Speech representations extracted by deep learning models are being used in a wide range of tasks such as speech recognition, speaker recognition, and speech emotion recognition. Attention models play an important role in improving deep learning models. However current attention mechanisms are unable to attend to fine-grained information items. In this paper we propose the novel Fine-grained Early Frequency Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition. Two widely used public datasets, VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on top of several prominent deep models as backbone networks to evaluate its impact on performance compared to the original networks and other related work. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, even setting a new state-of-the-art for the speaker recognition task. We also tested our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.

Full PDF

KKnowing What to Listen to: Early Attention for Deep SpeechRepresentation Learning

Amirhossein Hajavi ∗ , Ali Etemad Department of ECE and Ingenuity LabsQueen’s University, Kingston, Canada { a.hajavi, ali.etemad } @queensu.ca Abstract

Deep learning techniques have considerably improved speechprocessing in recent years. Speech representations extractedby deep learning models are being used in a wide range oftasks such as speech recognition, speaker recognition, andspeech emotion recognition. Attention models play an impor-tant role in improving deep learning models. However currentattention mechanisms are unable to attend to ﬁne-grained in-formation items. In this paper we propose the novel Fine-grained Early Frequency Attention (FEFA) for speech sig-nals. This model is capable of focusing on information itemsas small as frequency bins. We evaluate the proposed modelon two popular tasks of speaker recognition and speech emo-tion recognition. Two widely used public datasets, VoxCeleband IEMOCAP, are used for our experiments. The model isimplemented on top of several prominent deep models asbackbone networks to evaluate its impact on performancecompared to the original networks and other related work.Our experiments show that by adding FEFA to different CNNarchitectures, performance is consistently improved by sub-stantial margins, even setting a new state-of-the-art for thespeaker recognition task. We also tested our model againstdifferent levels of added noise showing improvements in ro-bustness and less sensitivity compared to the backbone net-works.

Deep speech representation learning has been the subject ofa large number of past works. Many techniques have beendeveloped and employed for extracting representations fromspeech for related tasks such as speaker recognition (SR) and speech emotion recognition (SER) using deep learning.A signiﬁcant number of these deep learning models havebeen based on Convolutional Neural Networks (CNN) forSR (Hajavi and Etemad 2019; Okabe, Koshinaka, and Shin-oda 2018; Xie et al. 2019a; Chung, Nagrani, and Zisser-man 2018; Nagrani, Chung, and Zisserman 2017) and SER ∗ The authors would like to thank IMRSV Data Labs for theirsupport of this work. The authors would also like to acknowledgethe Natural Sciences and Engineering Research Council of Canada(NSERC) for supporting this research (grant number: CRDPJ533919-18).Copyright c (cid:13) (Albanie et al. 2018; Gideon, McInnis, and Provost 2019;Wang et al. 2020; Ghosh et al. 2016). The most common ap-proach to training CNN models for speech-related tasks isto use time-frequency inputs such as spectrograms derivedfrom raw audio signals. Given sufﬁcient data, such deeplearning models enable the extraction of better speech rep-resentations compared to other methods such as i-Vectors(Nagrani, Chung, and Zisserman 2017; Ghosh et al. 2016).Attention mechanisms have been shown to have a pos-itive impact on extracting effective deep representationsfrom input data, for instance speech signals. Considerableimprovements in accuracy of emotion recognition mod-els (Tarantino, Garner, and Lazaridis 2019; Wang et al.2020) and speaker recognition models (Zeinali et al. 2019;Bian, Chen, and Xu 2019; Okabe, Koshinaka, and Shinoda2018) are some of the examples that demonstrate the poten-tial beneﬁts of using attention mechanisms for representa-tion learning.Attention models uphold a memory-query paradigm,where the memory is a set of information items such asCNN embeddings of a region of the spectral representa-tion in speech-related tasks (Bian, Chen, and Xu 2019;Bhattacharya, Alam, and Kenny 2017), or a part of the ut-terance embedded by a recurrent cell in a recurrent neuralnetwork (RNN) (Zhang et al. 2019; Wang et al. 2020). Thequery is derived from a hidden state of the model from ei-ther the same modality or a different one (Xu et al. 2015;Bahdanau, Cho, and Bengio 2015). The majority of attentionmodels used in speech-related tasks, use features extractedfrom utterances using a deep neural network as the informa-tion items or memory, and the last hidden layer of the modelas the query (Xu et al. 2015). The general purpose of an at-tention model in generating deep representations of speechsignals is to focus on each information item individually.The information items considered in an attention modeldeﬁne the granularity of what the model can focus on. Thespectral representation of an utterance enables deep learningmodels to consider ﬁne-grained features such as frequencybins in very short time-frames. However, typical attentionmodels used on audio signals utilize an embedding obtainedfrom a CNN model as the memory and the ﬁnal embeddingof the model as query. Using embeddings obtained from a r X i v : . [ ee ss . A S ] S e p NNs, limits the granularity of the attention models to largeregions of the spectral representation. On the other hand,improving the granularity of CNN embeddings of an utter-ance leads to very large attention models which are harderto train and prone to over-ﬁtting. While there have been anumber of studies investigating various attention models us-ing CNN embeddings utterances (Bhattacharya, Alam, andKenny 2017; Bian, Chen, and Xu 2019; You et al. 2019;Safari and Hernando 2019), very limited number of studiesaim to use more ﬁne-grained attention models on spectralrepresentation of the utterance.In this paper, we address the challenge of improving gran-ularity of attention models by introducing a ﬁne-grained at-tention mechanism for audio signals. This mechanism en-ables deep learning models to focus on individual frequencybins of a spectrogram without the drawbacks of having verycomplex models that typically involve large number of pa-rameters. The aim of this model is to attend to each fre-quency bin in the spectrogram representation in order toboost the contribution of most salient bins. This mechanismalso helps reduce the importance of bins with no useful in-formation leading to more accurate representations, whichcan also lead to more robustness with respect to existingnoise in the input audio. The performance of the proposedattention mechanism has been tested using a select set ofmost prominent CNN architectures on two tasks of SR and SER . The experimental results show that deploying the ﬁne-grained frequency attention mechanism improves the per-formance of all the benchmark networks substantially whilebeing less impacted by added noise.Our contributions in this paper are as follows: • We introduce a novel attention mechanism for speech rep-resentation learning. • We test our method on two different speech-related prob-lem domains, namely speaker recognition and affectivecomputing, using two large and widely used datasets,demonstrating considerable performance gains for bothtasks. • By simply adding our ﬁne-grained frequency attentionmethod to the existing state-of-the-art model for speakerrecognition, we set a new state-of-the-art for speakerrecognition in the wild. • By testing our model against different levels of syntheticnoise, we show an improvement in robustness comparedto other models.The rest of this paper is organized as follows. First, wediscuss the related work in the area of speech representationlearning followed by particular approaches that have usedattention mechanisms for this purpose. Next, we present theproposed attention mechanism. In the following section, wediscuss the experiments along with implementation details.Next, we provide the results of our work. And ﬁnally, wesummarize and conclude the paper.

Speech representation, or utterance embedding, has beenan area of research for decades. Classical signal process-ing techniques such as Gaussian Mixture Models, HiddenMarkov Models, and Universal Background Models, wereused in many speech related tasks to obtain a proper rep-resentation of utterances. Comprehensive reviews of priorwork that have used such conventional methods for SR andSER can be found in (Hansen and Hasan 2015; El Ayadi,Kamel, and Karray 2011).Solutions based on artiﬁcial neural networks (ANN) havebeen widely used in speech-related tasks. In some of the ear-lier work in this area, speech representations extracted fromaudio signals using ANNs were fed to conventional classi-ﬁers for SR (Farrell, Mammone, and Assaleh 1994) and SER(Nicholson, Takahashi, and Nakatsu 2000).More recently, deep neural networks (DNN) have beenused for learning effective representations of utterances(Variani et al. 2014; Bhattacharya, Alam, and Kenny 2017).Most recent works on extracting deep speech representa-tions for SR have explored the impacts of different deeplearning architectures on the quality of these representations.Most prominent works include using CNN architecturessuch as ResNets for speech representation learning prior toidentiﬁcation (Xie et al. 2019a; Bian, Chen, and Xu 2019;Hajavi and Etemad 2019). Among other speech-related taskssuch as SER, DNN models have also been very successfulfor speech representation learning. Most recent studies ofSER focus on improving the accuracy of the deep learningmodels by modifying and combining different architectures.Some of the considerable attempts include using the com-bination of CNN and RNNs such as long short-term mem-ory (LSTM) networks (Xie et al. 2019b; Latif et al. 2019;Wang et al. 2020).

The performance of deep learning models has improved sig-niﬁcantly by attention models in many cases (Bahdanau,Cho, and Bengio 2015; Xu et al. 2015). A number ofstudies using attention mechanisms for SR and SER haveshown substantial improvements compared to baseline mod-els. Attention mechanisms in SR and SER have been uti-lized to focus on features extracted from utterances us-ing various deep learning models including CNN (Bhat-tacharya, Alam, and Kenny 2017; Bian, Chen, and Xu 2019;You et al. 2019; Safari and Hernando 2019; Zhao et al.2020), RNN (Zhang et al. 2019; Wang et al. 2020; Tarantino,Garner, and Lazaridis 2019), and time-delay neural net-works (TDNN) (Okabe, Koshinaka, and Shinoda 2018; Zhuet al. 2018). Through the following paragraphs we brieﬂydescribe some examples.The model proposed in (Bhattacharya, Alam, and Kenny2017) utilized self-attention to focus on features obtainedfrom a CNN model inspired by VGGNet (Simonyan andZisserman 2014). The study done in (Bian, Chen, and Xu2019) used CNN-based self-attention models to attend toeatures extracted from a deep learning model with an ar-chitecture similar to ResNet (He et al. 2016; Zhao et al.2020). A novel gated attention model was proposed in (Youet al. 2019) to attend to features extracted by a modiﬁed ver-sion of CNN, namely gated-CNN. The proposed models in(Zhang et al. 2019; Wang et al. 2020; Tarantino, Garner, andLazaridis 2019), utilized attention models to focus on dif-ferences between two sets of features extracted from the en-rollment utterance and the questioned utterance using RNN.In the common approach taken in these studies, the attentionmodels were added to the end of deep learning pipelines.The addition of attention models in this way has shown toimprove the accuracy of baseline models against in-the-wilddatasets in each of these studies.A different approach was taken in (Okabe, Koshinaka,and Shinoda 2018) and (Zhu et al. 2018). The attentionmodels used in these studies replaced the statistical poolinglayer of an X-Vector model. The proposed models utilizedTDNN to extract frame-level features from utterances. At-tention models were then used to aggregate the features intoan utterance-level embedding. The model proposed in (Zhuet al. 2018) was evaluated against the NistSRE16 evaluationset (National Institute of Standards and Technology 2016)and the proposed model in (Okabe, Koshinaka, and Shinoda2018) was evaluated against the VoxCeleb 1 test set (Na-grani, Chung, and Zisserman 2017). Both models showedsubstantial improvements compared to their baseline mod-els.The majority of the aforementioned studies have used thefeatures obtained from DNNs as the memory componentof the attention model. The queries of the attention mod-els were also originated from the last hidden layer of themodel from which the utterance-level embeddings are re-trieved. Generally, DNNs learn to extract a low-dimensionallatent representation from the input data without necessarilypreserving localization with respect to the input informationitems. Thus, while the use of the last hidden layer of a DNNfor extracting the query of an attention mechanism can beadvantageous due to its reduced number of parameters, highlevels of granularity and a localized relationship with respectto the input may not be achieved.Compared to the methods proposed in previous studies,the ﬁne-grained attention model proposed in this paper does not require embeddings obtained from DNN models, andcan operate on spectrograms extracted from raw audio sig-nals. Hence, the granularity of the attention model can beimproved to attend to frequency-level features. While differ-ent attention mechanisms depend on the speciﬁc architec-tures and models, our proposed ﬁne-grained frequency at-tention mechanism can be used along with various modelsand architectures. As proven in the experiment section, byadding the frequency attention to multiple CNN-based ar-chitectures, a substantial improvement is achieved on bothtasks of SER and SR.

The fundamental paradigm of a general attention mecha-nism is the memory-query system. Considering audio sig-nals, the memory typically consists of a set of informationitems, namely DNN embeddings, and the query is acquiredfrom the hidden state of the overall model. The memory issaved in the form of key-value tuples ( key i , value i ). Theﬁrst element of the tuple key i helps with the calculation ofthe probability factor p i , which indicates the impact of theitem over the query. p i ( k i , Query ) = exp ( key i × W ) | M | (cid:80) j =1 exp ( key j × W ) (1)Equation 1 represents a general attention in which a multi-layer perceptron (MLP) is used to determine the probability p i . The matrix W is a set of trainable weights integratedby an MLP that carries the impact of Query in determiningthe probability of the item. The ﬁnal output of the attentionmechanism with respect to a query, is the expected value ofitems with regards to the variable value i (see Equation 2). O MQuery = | M | (cid:88) i =1 p i ( k i , Query ) × value i (2)While typical attention mechanisms may allow the infor-mation items to be as ﬁne-grained as possible, the complex-ity of the attention model itself grows considerably withimproving the granularity of the memory set. Through ourproposed Fine-grained Early Frequency Attention (FEFA)model, we tackle this issue by changing the source of thequery to any hidden layer of the deep model. This is in con-trast to the other attention mechanisms where the last hiddenlayer is used as the source of the query. We also change thestructure of the memory to contain frequency bins providedby the spectrogram representation as the information items.The spectrogram representation of the speech signal isthe most commonly used feature set among deep learningmodels that exploit CNN architectures. While the numberof frequency bins may vary between studies, the overall ap-proach in calculating and using spectrogram representationsare quite similar. The spectrogram representation of an ut-terance is obtained by using Short-time Fourier transform(STFT) (see Equation 3). The symbol x ( t ) serves as the sig-nal amplitude at a given time t . W ( t − τ ) is the windowfunction applied over the signal to enforce the time windowof the STFT as well as to extract the phase information ofthe signal. ω represents the frequency band around whichthe STFT is performed. ST F T { x ( t ) } ( τ, ω ) = (cid:90) ∞−∞ x ( t ) W ( t − τ ) e − iωt dt (3)After calculating the STFT of the signal for a given fre-quency bin ω , the squared magnitude of the result is thenused as the spectrogram representation (see Equation 4). Atypical value used for the time window in speech-related )a) Fine-GrainedFrequencyAttention EnhancedFeatures DeepLearningModel (Source of theQuery) Squeeze A v e r age P oo l MLP A tt en t i on M ap Figure 1: a) The overview of the FEFA model. The modeluses the spectrogram representation of the utterance as thememory set and the feature set associated with the early lay-ers of the DNN model as the modality to extract query. b)The modules inside the FEFA model consist of a squeezefunction and an MLP module.tasks is . Hence we can drop the variable τ with thedefault value of from the formula to simplify the equa-tion as follows: Spec ( x ( t ) , ω i ) = | ST F T { x ( t ) } ( , ω i ) | (4)The ﬁnal spectrogram of the speech signal is obtainedby repeating the process for a select number of frequencybands. The selection of frequency bands are given as ahyper-parameter in the form of a set of ﬁlters called ﬁlter-bank. Each value acquired by function Spec ( x ( t ) , ω i ) repre-sents the frequency information of the signal with regards tothe ﬁlter ω i at a given time in a time-window. Havingthe frequency bins as the construction blocks of the spectro-gram representations, every individual bin can be consideredthe smallest item carrying information. For the FEFA modelwe utilize the frequency bins as information items to serveas the memory component for the attention mechanism.One of the main challenges that prevent attention mech-anisms from increasing the granularity of their memoryset, is the source of the query. In typical attention mecha-nisms the query is originated from the last hidden layer ofthe deep learning model. The complexity of such a modelwill increase considerably with regards to improvements inthe granularity of the attention mechanism. Having spectro-grams to serve as the memory component of the attentionmechanism, the complexity of the attention mechanism willmake the model very hard to train and generalize.Each layer of a given deep learning model operates over afeature set. The feature set associated with each hidden layerof the DNN is capable of serving as the target modality forextracting queries for the attention model. In the proposedFEFA model (illustrated in Figure 1 (a)), we utilize the hid-den layers earlier in the DNN model as the new source ofquery.The internal architecture of the FEFA module is shown inFigure 1 (b). The spectrogram representation of the utteranceis squeezed into a single vector using an average poolingoperation. Then, an MLP module is utilized as the kernelof the proposed FEFA model to calculate the probability of each frequency bin in the enhanced spectral representationof the utterance (See Equation 5). Accordingly, the index ofeach frequency bin in the initial feature space (spectrogramrepresentation), serves as the key for the information item. p i = exp ( index ( Spec ( x ( t ) , ω i ) , F )) × W ) | M | (cid:80) j =1 exp ( index ( Spec ( x ( t ) , ω j ) , F )) × W ) (5)An attention map is then created by calculating the ex-pected value of each frequency bin using the probability ob-tained through the MLP module (See Equation 6). The atten-tion map acquired from the attention module is then multi-plied by the original spectrogram representation of the utter-ance resulting in an enhanced representation of the utteranceto be used by the DNN. AttentionM ap = | M | (cid:88) i =1 p i × Spec ( x ( t ) , ω i ) (6)The FEFA model does not require any pre-processing orfeature extraction in addition to the STFT calculation. Hencethe model is compatible with various deep learning archi-tectures that use spectrogram representation of utterances asinput. Later on in the experiments section, we demonstratethat by adding the FEFA model to various architectures suchas ResNet, VGG, and SEResNet, considerable performanceimprovements are achieved.The memory and computational complexities of the FEFAmodel are respectively linear and quadratic with regards tothe number of frequency bins (nfft) used in calculating thespectrogram representation (See Equations 7 and 8). Henceadding multiple layers of FEFA throughout the pipeline ofthe DNN does not increase the computational complexity ofthe model drastically. Complexity ( F EF A ) ∈ Θ( nf f t ) (7) M emory ( F EF A ) ∈ Θ( nf f t ) (8) The memory set of the FEFA model is not limited to thespectrogram representation of the utterance. Considering theﬂexible mechanism of the attention module, the embeddingsof each hidden layer of the DNN can be utilized as the mem-ory set. In order to achieve this, the embeddings of the hid-den layer are ﬁrst passed through a channel-wise averagepool to imitate a single-channel spectrogram image. The re-sulting matrix is then passed through the FEFA module withthe same procedure. The employed channel-wise averagepooling mechanism , along with the query extraction mech-anism, enable the FEFA module to be used between any twolayers throughout the DNN pipeline.

The proposed FEFA model has been evaluated on two tasksof SR and SER. The FEFA model can be used with differentDNN architectures as backbone networks that take the spec-trogram representation of the utterances as the input. Hencee have used a select number of prominent CNN architec-tures commonly used in these tasks as our benchmarks. Inthe following subsections we introduce the datasets used inour experiments, implementation details regarding FEFA, aswell as the details of the backbone networks used to add ourattention mechanism onto.

We utilize two widely used datasets for experiments in twodifferent speech representation learning areas (SR and SER),namely VoxCeleb and IEMOCAP.

VoxCeleb:

For the SR task we perform our evalua-tions using the large and widely used in-the-wild VoxCelebdataset (Chung, Nagrani, and Zisserman 2018). The Vox-Celeb dataset includes voices from more than 6,000 indi-viduals. The utterances are captured from uncontrolled con-ditions such as interviews published in open-source media.The VoxCeleb dataset is available in two versions, Vox-Celeb1 which is used more commonly for evaluation andVexCeleb2 which is used solely for training purposes. In thisexperiment we follow the common practice and use the Vox-Celeb2 dataset with nearly 1.2 million utterances for trainingour model and VoxCeleb1 test set for evaluation.

IEMOCAP:

We also evaluate the FEFA model using theIEMOCAP dataset (Busso et al. 2008) for the task of SER.The IEMOCAP dataset is a multi-modal emotion recogni-tion dataset including speech recordings, videos, and motioncapture. The dataset contains 12 hours of prompted and im-provised dialogue performed by 10 actors. The audio record-ings of the dataset are divided into short utterances each con-taining one sentence. Each utterance is then scored by sev-eral people to determine the category of emotion conveyedby the utterance. In our experiments we have selected 4 emo-tion categories of

Sadness , Happiness , Angry , and

Neutral ,for a total of 6 thousand utterances. The selection of these 4emotion categories is to comply with the common practiceof SER established by majority of studies using this dataset.

Data Preparation:

For data preparation, we extract spectro-gram representations of the utterances resulting in spectro-gram images of size × T . We use 257 frequency bins tobe able to better compare our results to the state-of-the-art inSR. We follow the same practice in the SER task to maintaincontingency throughout the experiments. FEFA Details:

We utilize a single layer locally connectedMLP as the kernel of our attention model. We chose a simplekernel to minimize the impact of the latent scores learnt bymore complex networks, and instead focus on the impact ofusing early attention over frequency-bands on speech repre-sentation learning. The number of nodes used in this kernelwas set equal to the number of frequency bins in the spectralrepresentations (257 in our case). The kernel is trained usingthe Adam optimizer and back-propagation.

Backbone Networks:

In order to assess the impact of theFEFA model on different deep learning networks, we usetwo of the latest state-of-the-art models which are basedon VGGNet (Nagrani, Chung, and Zisserman 2017) andResNet (Xie et al. 2019a). We have also implemented a novel thin-SEResNet model by combining the state-of-the-art ResNet model with the SE blocks proposed in (Hu, Shen,and Sun 2018).For each of the three backbone networks, three versionswere implemented: • The model without any FEFA enhancing; • The model with one layer of FEFA enhancement; • The model with multiple layers of FEFA enhancementdistributed among layers of the DNN pipeline where thedimensions of the hidden representation has changed.The ﬁrst model used in our experiments is the VGG-basedmodel proposed in (Nagrani, Chung, and Zisserman 2017),which consists of 5 convolution layers accompanied by 3maxpooling layers. The utterance-level aggregation is doneusing a global-average-pooling and the ﬁnal embedding isacquired using a fully connected layer with ReLU activation.The second network that we use in this experiment is theResNet-based model proposed in (Xie et al. 2019a). Thismodel consists of 35 convolution layers used in the form ofresidual blocks. The shortcuts integrated in residual blockshelp the model convey the learning gradients throughout thepipeline of the model more easily, which in turn aids themodel to learn faster and more efﬁciently. This also enablesthe model to provide better queries for the FEFA module.Complete details about the hyper-parameters and implemen-tation details can be found in (Xie et al. 2019a).The ﬁnal network used in our experiments is SERes-Net. Similar to the ResNet model, the SEResNet consistsof residual blocks. The formation of blocks and number ofparameters used in the SEResNet is similar to a ResNet withan addition of a Squeeze-and-Excitation (SE) module (Hu,Shen, and Sun 2018). The SE module uses a global pool-ing layer to extract channel information inside the residualblocks. The channel information is then projected onto a la-tent space using 2 FCNs, a ReLU activation function, and aSigmoid activation function. The resulting representation isthen multiplied across channels of the ID block.

Training:

For training the backbone networks with theadded FEFA, we used a recent technique to adjust the learn-ing rate throughout the process. The cyclical learning rateproposed in (Smith 2017) helps the model to achieve a betterconvergence by changing the learning rate periodically andpreventing the model from getting trapped in local minima.

For evaluating the networks in the SR domain, we use 2commonly used metrics namely Equal error rate (EER) andidentiﬁcation accuracy (Acc). The EER is the error thresholdof the model in which the number of false positive errors isequal to the number of false negative errors. Table 1 presentsthe results, as well as the performance gain achieved by us-ing our proposed FEFA model. The ﬁrst section of the tableis dedicated to the typical attention models of self-attentionand soft-attention. The results show that FEFA models out-perform the typical attention models by a large margin. Theable 1: Speaker Recognition results. ( ∗ The result of identiﬁcation accuracy was not published and is replicated using thetrained models provided by the authors.)

Model FEFA Layers EER (%) ∆ EER (%) Acc. (%) ∆ Acc. (%)ResNet + Self-Attention (Bian, Chen, and Xu 2019) None 5.4 N/A N/A N/ACNN (unspeciﬁed) + Soft-Attention None 3.8 N/A N/A N/A(Okabe, Koshinaka, and Shinoda 2018)VGG (Nagrani, Chung, and Zisserman 2017) None 7.8 N/A 80.5 N/AVGG + FEFA Single-layer 7.4 +5.1 84.7 +5.2VGG + FEFA Multi-layer 7.6 +2.5 82.4 +2.3ResNet34 (Chung, Nagrani, and Zisserman 2018) None 4.83 N/A N/A N/AResNet50 (Chung, Nagrani, and Zisserman 2018) None 3.95 N/A N/A N/AThin-ResNet + Ghostvlad (Xie et al. 2019a) None 3.22 N/A 86.5 ∗ N/AThin-ResNet + FEFA Single-layer 3.12 +3.1 93.6 +8.2Thin-ResNet + FEFA Multi-layer 3.18 +1.2 91.7 +6.0SE-ResNet None 4.81 N/A 90.5 N/ASE-ResNet + FEFA Single-layer 3.68 +19.0 93.8 +3.6SE-ResNet + FEFA Multi-layer 4.58 +4.7 91.5 +1.1

Table 2: Speech Emotion Recognition results.

Model FEFA Layers Acc. (%) ∆ Acc. (%)Thin-ResNet None 59.72 N/AThin-ResNet+FEFA Single-layer 62.32 +4.35Thin-ResNet+FEFA Multi-layer 61.57 +3.09VGG None 52.48 N/AVGG + FEFA Single-layer 56.70 +8.21VGG + FEFA Multi-layer 55.36 +5.48SE-ResNet None 59.82 N/ASE-ResNet + FEFA Single-layer 62.28 +4.11SE-ResNet + FEFA Multi-layer 61.63 +3.02

FEFA model also surpasses the state-of-the-art values im-proving the performance by 3.1% on EER and 6.0% on Acc, achieving a new state-of-the-art for SR. For the backbonemodels of VGG and Thin-ResNet, we refer to the reportedvalues in the reference papers. As there are no publishedstudies of SEResNet architectures for SR, the implementa-tion and evaluation of this backbone model is done in thescope of this experiment. The results of all the backbonemodels show the positive impact of the FEFA module onthe models. The consistent improvement of all the back-bone networks also proves compatibility of the FEFA mod-ule with different architectures.As discussed, to evaluate the generalizability of our FEFAmodel, we also perform experiments on SER. In these exper-iments, we employ the commonly used classiﬁcation accu-racy as the evaluation metric. To comply with the commonpractice in using the IEMOCAP speech emotion dataset, weperform a k-fold cross validation for evaluating our mod-els. Given that in many recent works for emotion recogni-tion from speech, VGG- and ResNet- based architectureshave frequently been used for speech representation learn-ing (Yenigalla et al. 2018; Kim et al. 2017), we utilize thesame backbone networks for evaluating the impact of ourproposed FEFA approach. This also enables us to compareour results to the SR task performed earlier and provide amore consistent analysis of the results. Table 2 shows the re-sults of evaluating the FEFA model for the task of SER. It is evident by the results that adding the FEFA module hasa very positive impact on the performance of the backbonemodels in predicting emotion classes.Interestingly, while using multiple layers of the FEFAmodule in the deep models considerably improves the per-formance compared to the plain backbone networks (noFEFA added), the performance is consistently less promi-nent than using a single layer of FEFA. A possible reasoncould be that due to the 2D shape of convolutional kernels,some features from the time axis are convolved with the fre-quency axis. Given that temporal information have alreadybeen considered while performing the average pooling in-side the FEFA module, including these features in the fre-quency axis may have reduced the contribution of the fre-quency information in the ﬁnal attention map.

Given the inherent function of the FEFA too focus on themost salient frequency bins prior to being processed by themodel, we anticipate that DNN+FEFA architectures willbe more robust to noise compared to their backbone DNNcounterparts. In order to test this hypothesis, we evaluate theperformance of our model solution against different levelsof noise. To do so, a controlled level of synthetic noise isadded to the test utterances for the speaker recognition task.The model with the best performance (the previous state-of-the-art), Thin-ResNet + GhostVlad, is selected and testedwith the noisy utterances with and without FEFA.The added noise is selected from Gaussian (Figure 2 (a))and uniform (Figure 2 (b)) distributions. The second columnin Figure 2 depicts the effect of noise added to the spectralrepresentation of utterances, comparing it to the clean ut-terance. The last column depicts the attended areas by theFEFA model. The highlighted areas in the spectral represen-tation show the individual frequency bins with the highestcontribution. Hence by focusing on these frequency bins, theFEFA model decreases the effect of artifact noise in other ar-eas of the spectrogram on the ﬁnal learned representation.The results of this test are reported in Table 3. The testis performed using synthetic noise resulting in 3 signal-to- lean Noisya)b)

Figure 2: Robustness test by adding synthetic noise to the ut-terances. (a) the noise selected from a Gaussian distribution.(b) the noise selected from a uniform distributionnoise ratios (SNR) of 20db, 50db, and 100db. As shown bythe results, while the performance of the backbone networkis considerably affected by the added noise, the model withthe FEFA mechanism stays relatively more stable.

As shown in the ﬁrst and second rows of Table 1, CNNmodels plus general forms of attention such as self-attention(Bian, Chen, and Xu 2019) and soft-attention (Okabe,Koshinaka, and Shinoda 2018) do not perform as well asour FEFA model integrated into similar backbone networks.Our approach shows a clear enhancement performance overthe classical attention mechanisms as such attention modelsattend to parts of the latent representation that correspond tolarge areas in the input utterance spectrogram. Hence theyfail to focus on very small frequency-level features that areoften crucial in speech-related tasks.While a number of attempts have been made to achievedifferent levels of granularity with attention mechanisms,existing attention models do not achieve a ﬁne-grained so-lution. The area attention model proposed in (Li et al. 2019)achieves varying degrees of granularity by creating differentcombinations of neighboring information items. However,the information items used in the combinations are embed-dings already extracted by the DNN model, limiting the levelof granularity based on the resolution of the latent represen-tation achieved by the DNN.Another attempt for a frequency-based attention modelwas proposed by (Yadav and Rai 2020). Their attentionmodel adopted from image recognition can utilize any hid-den layer of a deep network as the source of query. The at-tention model proposed in their work uses the latent repre-sentations obtained from different layers of the CNN as thememory set. This rules out the possibility of a localized at-tention map with respect to the input. Their model also usesa shared weight CNN layer as the kernel of the attentionmodel. In this approach, and others that similarly employCNN layers for the kernel of the attention model, informa-tion items go through non-linear operations preventing themodel from maintaining a one-to-one relation between theattention map and the information items. Therefore as thisapproach may be successful for some applications, it fails inothers where the contribution of each separate information Table 3: Robustness test results for SR task. The comparisonis performed with the state-of-the-art model GhostVlad (Xieet al. 2019a) with and without FEFA.

Noise Model SNR EER (%) ∆ EER (%)w/o FEFA 20db 3.40 -5.5w/o FEFA 50db 3.85 -19.5Normal w/o FEFA 100db 4.82 -49.6+ FEFA 20db 3.12 0+ FEFA 50db 3.15 -0.9+ FEFA 100db 3.44 -10.2w/o FEFA 20db 3.32 -3.1w/o FEFA 50db 3.48 -8.0Uniform w/o FEFA 100db 3.96 -22.9+ FEFA 20db 3.12 0+ FEFA 50db 3.14 -0.6+ FEFA 100db 3.41 -9.4 item is important.Generally, the intuition behind many attention models (inspeech-related tasks or otherwise) is to focus on differentparts of some latent representation of the input to informbetter classiﬁcation. In these models, the representations aregenerally learned irrespective of important known informa-tion items in the input. Speech depends on frequency con-tent to convey information. In fact, humans have evolved tounderstand different facts about the source of speech (e.g.identity, intent, emotions, etc.) based on factors such as tone,pitch, and others (Hansen and Hasan 2015). By learning toexploit speciﬁc frequency bins in the input that may con-tain effective task-related information, DNNs can learn topay more attention to those particular bins to achieve betterperformance.

In this paper, a novel attention mechanism is proposed thatallows deep learning models to focus on ﬁne-grained infor-mation items, namely frequency bins without, increasing thecomplexity of the model. The proposed FEFA model usesthe spectrogram representation of the model as the inputand provides a better representation of the spectrogram byattending to each frequency bin individually. We evaluatedour attention mechanism on two tasks of speaker recognitionand speech emotion recognition. The comparison betweenmodels enhanced by FEFA and the original backbone net-works shows consistent improvement in the performance ofdeep learning models in both tasks.Our analysis shows that using multiple layers of the FEFAmodule does not have as much positive impact as a singlelayer. A possible future route is to study the factors con-tributing to this effect. The intended outcome of such studywould be to design a solution to beneﬁt both features fromfrequency axis and time axis from the latent layers.The current version of FEFA model utilizes simple aver-age pooling mechanisms and MLP as the internal compo-nents of the attention mechanism. Another possible futureroute is to improve the internal architecture of the FFA mod-ule using more complex neural networks and different tem-poral pooling operations. eferences

Albanie, S.; Nagrani, A.; Vedaldi, A.; and Zisserman, A. 2018.Emotion recognition in speech using cross-modal transfer in thewild. the 26th ACM International Conference on Multimedia .Bhattacharya, G.; Alam, M. J.; and Kenny, P. 2017. Deep SpeakerEmbeddings for Short-Duration Speaker Veriﬁcation.

INTER-SPEECH

Neurocomputing

Lan-guage Resources and Evaluation

INTERSPEECH

Pattern Recognition

IEEE Transactions on Speech and Audio Processing

INTER-SPEECH

IEEE Transactions on Af-fective Computing

INTERSPEECH

IEEE Signal ProcessingMagazine

CVPR

CVPR the 25th ACM International Conference on Mul-timedia

INTER-SPEECH

ICML

INTERSPEECH

Neural Computing& Applications

INTERSPEECH

INTERSPEECH arXiv preprintarXiv:1409.1556 .Smith, L. N. 2017. Cyclical learning rates for training neural net-works.

IEEE Winter Conference on Applications of Computer Vi-sion (WACV)

INTERSPEECH

ICASSP

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing

ICML

ICASSP

INTERSPEECH

ICASSP

AAAI