A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition
Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Björn W. Schuller, Jiajun Liu
11 A Novel Policy for Pre-trainedDeep Reinforcement Learning forSpeech Emotion Recognition
Thejan Rajapakshe ∗† , Rajib Rana † , Sara Khalifa ‡ , Bj ¨orn W. Schuller §(cid:107) , Jiajun Liu ‡† University of Southern Queensland, Australia ‡ Distributed Sensing Systems Group, Data61, CSIRO Australia § GLAM – Group on Language, Audio & Music, Imperial College London, UK (cid:107)
ZD.B Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany ∗ [email protected] (cid:70)
Abstract —Reinforcement Learning (RL) is a semi-supervised learningparadigm where an agent learns by interacting with an environment.Deep learning in combination with RL provides an efficient method tolearn how to interact with the environment called Deep ReinforcementLearning (deep RL). Deep RL has gained tremendous success in gam-ing – such as AlphaGo, but its potential has rarely being explored forchallenging tasks like Speech Emotion Recognition (SER). Deep RLbeing used for SER can potentially improve the performance of anautomated call centre agent by dynamically learning emotion-awareresponses to customer queries. While the policy employed by the RLagent plays a major role in action selection, there is no current RL policytailored for SER. In addition, an extended learning period is a generalchallenge for deep RL, which can impact the speed of learning for SER.Therefore, in this paper, we introduce a novel policy – the “Zeta policy”which is tailored for SER and apply pre-training in deep RL to achievea faster learning rate. Pre-training with a cross dataset was also studiedto discover the feasibility of pre-training the RL agent with a similardataset in a scenario where real environmental data is not available.The IEMOCAP and SAVEE datasets were used for the evaluation withthe problem being to recognise the four emotions happy, sad, angry, andneutral in the utterances provided. The experimental results show thatthe proposed “Zeta policy” performs better than existing policies. Theyalso support that pre-training can reduce the training time and is robustto a cross-corpus scenario.
Index Terms —Machine Learning, Deep Reinforcement Learning,Speech Emotion Recognition
NTRODUCTION
Reinforcement Learning (RL), a semi-supervised machinelearning technique, allows an agent to take actions and inter-act with an environment to maximise the total rewards. RL'smain popularity comes from its super-human performancein solving some games like AlphaGo [1] and AlphaStar [2].RL has been also employed for audio-based applicationsshowing its potential for audio enhancement [3], [4], au-tomatic speech recognition [5], and spoken dialogue sys-tems [6], [7]. The potential of using RL for speech emotionrecognition has been recently demonstrated for robot ap-plications where the robot can detect an unsafe situation earlier given a human utterance [8]. An emotion detectionagent was trained to achieve an accuracy-latency trade-off by punishing wrong classifications as well as too latepredictions through the reward function. Motivated by thisrecent study, we employ RL for speech emotion recognitionwhere a potential application of the proposed system couldbe an intelligent call centre agent, learning over time howto communicate with human customers in an emotionallyintelligent way. We consider the speech utterance as “state”and the classified emotion as “action”. We consider a correctclassification as a positive reward and negative reward,otherwise.RL widely uses Q-learning, a simple, yet quite powerfulalgorithm to create a state-action mapping, namely Q-table,for the agent. This, however, is intractable for a large orcontinuous state and/or action space. First, the amountof memory required to save and update that table wouldincrease as the number of states increases. Second, theamount of time required to explore each state to create therequired Q-table would be unrealistic. To address these is-sues, the deep Q-Learning algorithm uses a neural networkto approximate a Q-value function. Deep Q-Learning is animportant algorithm enabling deep Reinforcement Learning(deep RL) [9].The standard deep Q-learning algorithm employs astochastic exploration strategy called (cid:15) -greedy, which fol-lows a greedy policy according to the current Q-valueestimate and chooses a random action with probability (cid:15) .Since the application of RL in speech emotion recognition(SER) is mostly unexplored, there is not enough evidence ifthe (cid:15) -greedy policy is best suited for SER. In this article, weinvestigate the feasibility of a tailored policy for SER. Wepropose a Zeta-policy and provide analysis supporting itssuperior performance compared to (cid:15) -greedy and some otherpopular policies.A major challenge of deep RL is that it often requires
1. Code: https://github.com/jayaneetha/Zeta-Policy a r X i v : . [ c s . S D ] J a n a prohibitively large amount of training time and data toreach a reasonable accuracy, making it inapplicable in real-world settings [10]. Leveraging humans to provide demon-strations (known as learning from demonstration (LfD)) inRL has recently gained traction as a possible way of speed-ing up deep RL [11], [12], [13]. In LfD, actions demonstratedby the human are considered as the ground truth labelsfor a given input game/image frame. An agent closelysimulates the demonstrator’s policy at the start, and lateron, learns to surpass the demonstrator [10]. However, LfDholds a distinct challenge, in the sense that it often requiresthe agent to acquire skills from only a few demonstrationsand interactions due to the time and expense of acquiringthem [14]. Therefore, LfDs are generally not scalable, espe-cially for high-dimensional problems.We propose the technique of pre-training the underlyingdeep neural networks to speed up training in deep RL.It enables the RL agent to learn better features leading tobetter performance without changing the policy learningstrategies [10]. In supervised methods, pre-training helpsregularisation and enables faster convergence compared torandomly initialised networks [15]. Various studies (e. g.,[16], [17]) have explored pre-training in speech recognitionand achieved improved results. However, pre-training indeep RL is hardly explored in the area of speech emotionrecognition. In this paper, we present the analysis show-ing that pre-training can reduce the training time. In ourenvisioned scenario, the agent might be trained with onecorpus but might need to interact with other corpora. Totest the performance in those scenarios, we also analyseperformance for cross-corpus pre-training. ELATED W ORK AND B ACKGROUND
Reinforcement learning has not been widely explored forspeech emotion recognition. The closest match of our workis EmoRL where the authors use deep RL to determine thebest position to split the utterance to send to the emotionrecognition model [8]. The RL agent chooses to decide an“action” whether to wait for more audio data or termi-nate and trigger prediction. Once the terminate action isselected, the agent stops processing the audio stream andstarts classifying the emotion. The authors aim to achieve atrade-off between accuracy and latency by penalising wrongclassifications as well delayed predictions through rewards.In contrast, our focus is on developing a new policy tailoredfor SER, and apply pre-training to achieve faster learningrate.Another related study has used RL to develop a musicrecommendation system based on the mood of the listener.The users select their current mood by selecting “pleasure”and “energy”, then, the application selects a song out of itsrepository. The users can provide feedback on the systemrecommended audio by answering the question “Does thisaudio match the mood you set?” [18]. Here, the key focus isto learn the mapping of a song to the selected mood, how-ever, in this article, we focus on the automatic determinationof the emotion. Yu and Yang introduced an emotion-basedtarget reward function [19], which again did not have afocus on SER. A group of studies used RL for audio enhancement. Shenet al. showed that introducing RL for enhancing the audioutterances can reduce the testing phase character error rateby about 7 % of automatic speech recognition with noise [3].Fakoor et al. also used RL for speech enhancement whileconsidering the noise suppression as a black box and onlytaking the feedback from the output as the reward. Theyachieved an increase of 42 % of the signal to noise ratio ofthe output [4].On the topic of proposing new policies, Lagoudakis andParr used a modification of approximate policy iteration forpendulum balancing and bicycle riding domains [20]. Wepropose a policy which is tailored for SER. Many studiescan be found in the literature using Deep learning to captureemotion and related features from speech signal, however,none of them had a focus on RL [21], [22], [23].An important aspect of our study is incorporating pre-training in RL to achieve a faster learning rate in SER.Gabriel et al. used human demonstrations of playing Atarigames to pre-train a deep RL agent to improve the trainingtime of the RL agent [10]. Hester et al. in their introductionof Deep Q-Learning from Demonstrations, presented priorrecorded demonstrations to a deep RL system of playing42 games – 41 of them had improved performance and 11out of them achieved state-of-the-art performance [24]. Anevidential advance in performance and accuracy is observedin their results of the case study with Speech CommandsDataset. None of these studies focuses on using pre-trainingin RL for SER, which is our focus.
RL architecture mainly consist of two major componentsnamely “Environment” and “Agent” while there are threemajor signals passing between those two agents as “currentstate”, “reward” and “next state”. An agent interacts with anunknown environment and observes a state s t ∈ S at everytime step t ∈ [0 , T ] , where S is the state space and T isthe terminal time. The agent selects an action a ∈ A , where A is the action space and then, the environment changesto s t +1 and the agent receives a reward scalar r t +1 whichrepresents a feedback on how good the selected action onthe environment is. The agent learns a policy π to mapmost actions a for a given state s . The objective of RL is tolearn π ∗ , an optimal policy which maximises the cumulativereward that can map the most suitable actions to a givenstate s . Before introducing Q-Learning, introduction to a number ofkey terms is necessary.
Model Free and Model based Policy:
The model learnsthe transition probability from the current state for an actionto the next state. Although it is straight forward, model-based algorithms become impractical as the state space andaction space grows. In contrary, model-free algorithms relyon trial-and-error to update its knowledge and does notrequire to store all the combination of states and actions.
On-Policy learning and Off-policy learning:
These twopolicies can be described in-terms of Target Policy andBehaviour Policy. Target Policy is the policy, which an agent
TABLE 1Fundamental Structure of Q-Table state action Q ( state, action ) ... ... ... tries to learn by learning value function. Behavior Policy isused by an agent for action selection or in other words usedto interact with the environment.In On-policy learning, target policy and the behaviouralpolicies are the same, but different in Off-policy learning.Off-policy learning enables continuous exploration resultingin learning an optimal policy, whereas on-policy learningcan only offer learning sub-optimal policy.In this paper, we consider widely used Q-Learning,which represents a model-free off-policy learning. Q-Learning maintains a Q-table which contains Q-values foreach state, and action combination. This Q-table is updatedeach time the agent receives a reward from the environment.A straight forward way to store this information would bein a Table 1. Q-values from the Q-table are used for theaction selection process. A policy takes Q-values as inputand outputs the action to be selected by the agent. Someof the policies are Epsilon-Greedy policy [25], BoltzmannQ Policy, Max-Boltzmann Q Policy [26], and Boltzmann-Gumbel exploration Policy [27].The main disadvantage of Q-Learning is that the statespace should be discrete and the Q-table cannot store Q-values for continues state space. Another disadvantage isthat the Q-table grows larger with increasing state spacewhich will not be manageable at a certain point. This isknown as “curse of dimensionality” [26] and the complexityof evaluating a policy scales up with O ( n ) when n is thenumber of states in a problem. Deep Q-Learning offers asolution to these challenges. A Neural Network can be used to approximate the Q-valuesbased on the state as input. This is more tractable thanstoring every possible state in a table like Table 1. Q = neural network.predict ( state ) (1)When the problems and states are becoming more complex,the Neural Network may need to be “deep”, meaning a fewhidden layers may not suffice to capture all the intricatedetails of that knowledge, hence the use of Deep NeuralNetworks (DNNs). Studies were carried out to incorporateDNNs to Q-Learning by replacing the Q-table with a DNNknown as deep Q-Learning, and the involvement of DeepLearning to Reinforcement Learning is now known as DeepReinforcement Learning.Deep Q-Learning process uses two neural networksmodels: the inferring model and the target networks. Thesenetworks have the same architecture but different weights.Every N steps, the weights from the inferring network arecopied to the target network. Using both of these networksleads to more stability in the learning process and helps thealgorithm to learn more effectively. Since the RL agent learns only by interacting with theenvironment and the gained reward, the RL agent needsa set of experiences to start training for the experiencereplay. A parameter nb warm up is used to define thenumber of warm-up steps to be performed before initiatingRL training. During this period, actions are selected totallythrough a random function for a given state, and no Q-valueis involved. The state, action, reward, and next state arestored in the memory buffer and are used to sample theexperience replay.
Q-Learning policies are responsible for deciding the bestaction to be selected based on the Q-values as their input.Exploration and exploitation are used to improve the expe-rience of the experience replay by involving randomness tothe action selection. Epsilon-Greedy policy, Max-BoltzmannQ-Policy, and Linear Annealed wrapper on Epsilon-Greedypolicy are some of the Q-Learning policies used today [28],[29].The Epsilon-Greedy policy adopts a greedy algorithm toselect an action out from the Q values, The (cid:15) value of thispolicy determines the exploitation and exploration ratio. − (cid:15) is the probability of choosing exploitation on actionselection. A random action is selected by a uniform randomdistribution at exploration and an action with maximumQ-value is selected on exploitation. The Linear Annealedwrapper on the Epsilon-Greedy policy is changing the (cid:15) value of the Epsilon-Greedy policy at each step. An (cid:15) valuerange and the number of steps are given as parameters. Thiswrapper linearly changes and updates the (cid:15) value of theEpsilon-Greedy policy at each step.The Max-Boltzmann policy also uses (cid:15) as a parameter. (cid:15) value is considered when determining exploitation andexploration. Exploration in Max-Boltzmann policy is similaras the Epsilon-Greedy policy. At exploitation, instead ofselecting the action with maximum Q-value as in Epsilon-Greedy policy; Max-Boltzmann policy randomly selects anaction from a distribution which is similar to the Q-values.This introduces more randomness yet, usage of Q-values into the action selection process. Emotion recognition from speech has gained a fair amountof attention over the past years among machine learning re-searchers and many studies are being carried out to improvethe performance of SER from both feature extraction andemotion classification stages [30]. Hidden Markov Models,Support Vector Machines, Gaussian Mixture Models, andArtificial Neural Networks are some of the classifiers usedfor SER in the literature [31], [32], [33], [34]. With the tremen-dous success of DNN architectures, numerous studies havesuccessfully used them and achieved good performances,e.g., [35], [36], [37], [38].Using DNN, Supervised and guided unsupervised/self-supervised techniques are being dominantly developed forSER, however, there is still a gap in the literature for dy-namically updating SER systems. Although some studies,e. g., Learning ++ [39] and Bagging ++ [40] use incremental learning/Online learning, there is a major difference be-tween Online learning and RL. Online learning is usuallyused for a constant stream of data, where after once usingan example, it is discarded. Whereas, RL constitutes a seriesof state-action pairs that either draw a positive or a negativereward. If it draws a positive reward, the entire string ofactions leading up to that positive reward is reinforced, butif it draws a negative reward, the actions are penalised. infer the q valuesfrom DNN Zeta Policy n > εselect a randomaction from action list true (exploration) select the action withhighest q value false (exploitation) q values actionstate, reward, action, next state, doneget the next state andfeed to next iteration
State No done = 1
Inferring Model random number fromuniform distribution (n)Environment r e w a r d Model Retraining
Yes
Historyzeta_nb_step (ζ)current_step (cs) false cs > ζselect random action froma distribution of q values true
Target Model
Fig. 1. Flow of the Zeta Policy and the connection with related compo-nents in the RL Architecture.
ETHODOLOGY
Figure 1 shows our proposed architecture and process flow.It dominantly focuses on the process flow with the “ZetaPolicy”, yet showing the inter connection between the in-ferring and the target model and the interaction betweenthe agent and the environment through action and reward.Through out this section we will gradually elaborate ondifferent components.Inferring model is used to approximate the Q-valuesused in action selection for a given state by the environ-ment. After a reward signal is received by the agent after an action is executed in the environment, attributes (state,reward, action, next state and terminal signal) are storedin the experience memory (History DB in Figure 1). Afterevery N update number of steps, a batch of samples fromthe experience memory is used and updates the parametersof Inferring Model. Target Model is used to approximatethe Q target values. Usage of Q target value is explained insection 3.4.4. Parameters of the Target model is updatedafter N copy number of steps. N update and N copy are hyper-parameters. A novel RL policy: “Zeta Policy” is introduced in thisstudy, which takes inspiration from the Boltzmann Q Policyand the Max-Boltzmann Q Policy. This policy uses the current step ( cs ) (see Figure 1) of the RL training cycle anddecides on how the action selection is performed. Figure 1shows how the action selection process is performed by theZeta Policy with the connection through other componentsin the RL architecture. The Zeta nb step ( ζ ) is a hyper-parameter to the policy and it routes to the action selectionprocess. If cs < ζ , the policy follows an exploitation andexploration process, where exploitation selects the actionwith the highest Q-value and exploration selects a ran-dom action from a discrete uniform random distributionto include uniform randomness to the experiences. Theparameter (cid:15) compared with a random number n from anuniform distribution and used to determine the explorationand exploration route. If cs > ζ , a random value from adistribution similar to the Q-value distribution is picked asthe selected action. Experiments were carried out to find theeffect of the parameters ζ and (cid:15) on the performance of RL.In the SER context, a state s is defined as an utterance inthe dataset and an action a is the classified label (emotion)to the state s . A reward r is the reward returned by theenvironment after comparing the ground truth with theaction a . If an action (classified emotion) and ground truthare similar, i. e., the agent has inferred the correct label tothe given state, the reward resembles and else the rewardis − . The reward is accumulated throughout the episode,and the mean reward is calculated. The higher the meanreward, the better the performance of the RL agent. Thestandard deviation of the reward is also calculated, sincethis value interprets how robust the RL predictions are. Pre-training allows the RL agent’s inferring DNN to opti-mise its parameters and learning the features required fora similar problem. To use a pre-trained DNN in RL, wereplace the softmax layer with a Q -value output layer. Asthe inferring model is optimised with learnt features, thereis no necessity of a warm-up period to collect experiencereplay. In fact, extended training time is a key shortcomingof RL [10]. One key contribution of the present paper is touse pre-training to reduce the training time by reducing thewarm-up period. In this section, we explain the experimental setup includingfeature extraction and model configuration. Figure 2 shows
Agent
Convo 2d + Max Pooling LSTM Fully Connected Dense(Output)FeatureMatrix(Input)
Q Values
Reward ( rt )
Observe State ( St+1 )
EnvironmentDeep Neural Network
Action ( at )
PolicyTarget ModelDNNExperience Replay Memory
Fig. 2. Deep Reinforcement Learning ArchitectureTABLE 2Distribution of utterances in the two considered datasets by emotion
Emotion IEMOCAP SAVEE
Happy 895 60Sad 1200 60Angry 1200 60Neutral 1200 120 the experimental setup architecture of the Environment,Agent, Deep Neural Network and Policy used in this study.
This study uses two popular datasets in SER: IEMOCAP[41] and SAVEE [42]. The IEMOCAP dataset features fivesessions; each session includes speech segments from twospeakers and is labelled with nine emotional categories.However, we use happiness, sadness, anger, and neutral forconsistency with the literature. The dataset was collectedfrom ten speakers (five male and five female). We took amaximum of 1 200 segments from each emotion for equaldistribution of emotions within the dataset.Note that the SAVEE dataset is relatively smaller com-pared to IEMOCAP. It is collected form 4 male speakers andhas 8 labels for emotions which we filtered out keeping hap-piness, sadness, anger, and neutral segments for alignmentwith IEMOCAP and the literature.Table 2 shows the utterances’ distribution of the twodatasets with 30 % of each dataset being used as a subset forpre-training, and the rest being used for the RL execution.20% of the RL execution data subset is used in the testingphase. RL Testing phase executes the whole pipeline of RLalgorithm but does not update the model parameters. RLwhich is a different paradigm of Machine Learning thanSupervised Learning does not need to have a testing datasetas it contentiously interacting with an environment. Butsince this specific study is related to a classification problem,we included a testing phase by proving a dataset which theRL agent has not seen before.We remove the silent sections of the audio segmentsand process only the initial two seconds of the audio. Forsegments less than two seconds in length, we use zeropadding. The reason is when using the two seconds for seg-ment length, there will be more zero-padded segments if thesegment length is increased and it becomes an identifiablefeature within the dataset in the model training.
We use Mel Frequency Cepstral Coefficients (MFCC) torepresent the speech utterances. MFCC are widely used inspeech audio analysis [35], [43]. We extract 40 MFCCs fromthe Mel-spectrograms with a frame length of 2,048 and ahop length of 512 using Librosa [44].
A supervised learning approach is followed to identify thebest suited DNN model as the inferring model. Even-thoughthe last layer of supervised learning models output a proba-bility vector, RL also learn the representaions in the hiddenlayers with similar mechanism. DNN architecture of the su-pervised learning model is similar to the DNN architectureof the inferring model except the output later. Activationfunction of the output layer of supervised later is Softmaxwhereas it is Linear Activation in RL. Different architecturescontaining Convolutional Neural Networks (CNNs), Long-Short Term Memory (LSTM) Recurrent Neural Networks(RNNs) and Dense Layers were evaluated with similardataset and DNN model architecture with highest testingaccuracy was selected as the Inferring DNN architecture.We use the popular Deep Learning API Keras [45] withTensorFlow [46] as the backend for modelling and trainingpurposes in this study. We model the RL agent’s DNN witha combination of CNNs and LSTM. The use of a CNN-LSTM combined model is motivated by the ability to learntemporal and frequency components in the speech signal[47]. We stack the LSTM on a CNN layer and pass it on to aset of fully connected layers to learn discriminative features[35].As shown in Deep Neural Network component in Fig-ure 2, the initial layers of the inferring DNN model aretwo 2D-convolutional layers with filter sizes 5 and 3, re-spectively. Batch Normalisation is applied after the first2D convolutional layer. The output from the second 2Dconvolutional layer is passed on to an LSTM layer of 16 cells,then to a fully connected layer of 265 units. A dropout layerof rate 0.3 is applied before the dense layer which outputsthe Q-values. Number of outputs in the dense layer is equalto the the number of actions which is number of classes ofclassification. Activation function of the last layer is keptas linear function since the output Q-values should not benormalised. The input shape of the model is × , where40 is the number of MFCCs and 87 is the number of framesin the MFCC spectrum. An Adam optimiser with a learningrate 0.00025 is used when compiling the model.The output layer of the neural network model returnsan array of Q-values corresponding to each action for agiven state as input. All the Q learning policies used inthe paper (Zeta, Epsilon-Greedy, Max-Boltzmann Q Policy,and Linear Annealed wrapper on Epsilon-Greedy) then usethese output Q-values for corresponding action selection.We use the number of steps on executing Inferring modelupdate ( N update ) equal to 4 and number of steps on execut-ing parameter copy from Inferring model to Target model( N copy ) equal to 10 000.Supervised learning is used to pre-train the inferringmodel. A DNN which is similar architecture of the inferringmodel, but with Softmax activation function at output layer S t d . d e v o f R e w a d Std. Dev of Reward v tep M e a n R e w a r d Mean Reward v tep
IEM.-ZetaIEM.-MaxB.IEM.-EpsG.IEM.-LA SAV.-ZetaSAV.-MaxB.SAV.-EpsG.SAV.-LA
Fig. 3. Comparison across the policies Zeta policy ( Zeta ) , Epsilon-Greedy policy ( EpsG. ) , Max-Boltzmann Q-policy ( MaxB. ) , and Linear Annealedwrapper on Epsilon-Greedy policy ( LA ) for the two datasets IEMOCAP ( IEM ) and SAVEE ( SAV ) S t d . d e v o f R e w a r d Std. Dev of Reward v tep - Zeta M e a n R e w a r d Mean Reward v tep - Zeta S t d . d e v o f R e w a r d Std. Dev of Reward v tep - MaxB. M e a n R e w a r d Mean Reward v tep - MaxB. S t d . d e v o f R e w a r d Std. Dev of Reward v tep - Ep G. M e a n R e w a r d Mean Reward v tep - Ep G. S t d . d e v o f R e w a r d Std. Dev of Reward v tep - LA M e a n R e w a r d Mean Reward v tep - LA
Effect of ϵ on Std. dev of Reward and Mean Reward | tep ϵ 500000
Fig. 4. Comparison of mean reward and standard deviation of reward with policies by changing the (cid:15) value of the Zeta policy ( Zeta ) , the Epsilon-Greedy policy ( EpsG. ) , the Max-Boltzmann Q-policy ( MaxB. ) , and Linear Annealed wrapper on the Epsilon-Greedy policy ( LA ) . is trained with pre-training data subset before starting theRL execution. The model is trained for 64 epochs with abatch size 128. Once the pre-training is completed parame-ters of the pre-trained DNN model is copied to the inferringand target models in the RL Agent. Mathematical formulation of Q-Learning is based on thepopular Bellman’s Equation for State-Value function (Eq 2). In (2), v ( s ) is the value at state s , R t is the reward at time t and γ is the discount factor. v ( s ) = E [ R t +1 + γv ( s t +1 ) | S t = s ] (2)Equation (2) can be re-written to obtain the Bellman’s State-Action Value function known as Q-function as follows; q π ( s, a ) = E π [ R t +1 + γq π ( S t +1 , A t +1 ) | S t = s, A t = a ] (3) S t d . d e v o f R e w a r d Std. Dev of Reward vs steps - Zeta M e a n R e w a r d Mean Reward vs steps - Zeta S t d . d e v o f R e w a r d Std. Dev of Reward vs steps - LA M e a n R e w a r d Mean Reward vs s eps - LA
Effect of No. steps on Std. dev of Reward and Mean Reward | eps: 0.0125
Fig. 5. Comparison of mean reward and standard deviation of reward with policies by changing the number of steps of the Zeta Policy ( Zeta ) andthe Linear Annealed wrapper on the Epsilon-Greedy policy ( LA ) . Here q π ( s, a ) is the Q-value of the state s following action a under the policy π .We use a DNN to approximate the Q-values and q π ( s, a ) is considered as the target Q-value giving the loss func-tion (4) L = Σ( Q target − Q ) (4) Q target = R ( s t +1 , a t +1 ) + γQ ( s t +1 , a t +1 ) (5)Combining Equations (4) and (5), updated loss function canbe written as; L = Σ( R ( s t +1 , a t +1 ) + γQ ( s t +1 , a t +1 ) − Q ( s t , a t )) (6)Minimising the Loss value L is the optimisation problemsolved in the training phase. Q target is obtained by inferring the state s from the targetnetwork via the function (1). The output layer of the DNNmodel is a Dense layer with Linear activation function.Adopting the feed-forward pass in deep neural networksto the last layer l , Q value for an action a can be obtained byequation (7), Q a = W l x l − + b l (7)where x l − is the input for the layer l . Backpropergationpass in parameter optimising updates the W l and b l valuesin each later l to minimise the Loss value L . VALUATION
Experiments were carried out to evaluate the efficiency ofthe newly proposed policy “Zeta-Policy” and the improve-ment that can be gained by introducing the pre-trainingaspect to the RL paradigm.
The key contribution presented in this article is the ZetaPolicy. Therefore, we first compare its performance withthat of three commonly used policies: the Epsilon-Greedypolicy (EpsG.), the Max-Boltzmann Q Policy (MaxB.), andLinear Annealed wrapper on Epsilon-Greedy policy (LA). Figure 3 presents the results of these comparisons. Standarddeviation and mean reward are plotted against the stepnumber.We notice from Figure 3 that the Zeta policy has alower standard deviation of the reward, which suggeststhat the robustness of the selected actions (i. e., classifiedlabels/emotions) of the Zeta policy is higher than that ofthe compared policies.The Zeta policy outperforms the other compared policieswith higher mean reward. The mean reward of the Zetapolicy converges to a value around 0.78 for the IEMOCAPdataset and 0.7 for the MSP-IMPROV dataset. These valuesare higher than that of other policies compared, whichmeans that the RL Agent selects the actions more correctlythan other policies. Since the inferring DNN model of the RLAgent is the same for all experiments, we can infer that theZeta policy has played a major role in this out-performance. (cid:15) value
All the compared policies — Zeta Policy, Epsilon-Greedypolicy, Max-Boltzmann Q policy, and the Linear Annealedwrapper on Epsilon-Greedy policy use the parameter (cid:15) in their action selection process. (cid:15) is used to decide theexploration and exploitation within the policy. − (cid:15) is theprobability of selecting exploitation out of exploration andexploitation. Hence, (cid:15) ranges between 0 and 1.Several experiments were carried out to find the rangeof (cid:15) with the values . , . , . , and . . Figure 4shows the effect of changing the (cid:15) value on the standarddeviation of the reward and mean reward for all policies.The Zeta policy shows a noticeable change in thestandard deviation of the reward and mean reward for (cid:15) = 0 . . The reason being that the Zeta policy performswell in the lower (cid:15) scenarios is: when cs < ζ , the Zetapolicy picks actions with the exploration and exploitationstrategy and (cid:15) is the probability of exploration. A randomaction from a uniform random distribution is picked in theexploration strategy. A lower (cid:15) means lower randomnessin the period cs < ζ which has a higher probability of S t d . d e v o f R e w a r d Std. Dev of Reward vs steps M e a R e w a r d Mea Reward vs steps
Reduci g Warm-up period by pre-train | Zeta policy
IEM p/t Warmup: 50000IEM no p/t Warmup: 50000IEM p/t Warmup: 10000IEM no p/t Warmup: 10000
Fig. 6. Effect of the warm-up period and pre-training (p/t) on the performance. S t d . d e v o f R e w a r d Std. Dev of Reward vs steps M e a n R e w a r d Mean Reward vs steps
IEM p/t by IEMIEM p/t by SAVIEM no p/t0 20000 40000 60000 80000 100000Steps0.000.250.500.751.00 S t d . d e v o f R e w a r d Std. Dev of Reward vs steps M e a n R e w a r d Mean Reward vs steps
SAV p/t by IEMSAV p/t by SAVSAV no p/t
Pre-training and Cross Dataset pre-training | Zeta policy
Fig. 7. Performance impact on pre-training (p/t) and cross dataset pre-training by the IEMOCAP ( IEM ) dataset and the SAVEE ( SAV ) dataset selecting an action based on the Q -values which leads theRL algorithms to correct the false predictions. The Zeta policy uses the parameter
Zeta nb step ( ζ ) todetermine the route and the Linear Annealing uses theparameter nb step to determine the gradient of the (cid:15) valueused in the Epsilon-Greedy component. Experiments weredefined to examine the behavior of the performance of theRL agent by changing the above parameters to the values
500 000 ,
250 000 ,
100 000 and
50 000 . Figure 5 was drawnwith the output from experiments. Looking at the curveof the standard deviation of Zeta policy, the robustness ofthe RL agent has increased with the increase of the numberof steps. The graph shows that the most robust curve isobserved for the step size
500 000 . The warm-up period is executed in the RL algorithm tocollect experiences for the RL agent to sample the experiencereplay. But with the pre-training, inferring DNN, the RL agent is already trained to learn the features. This leadsthe RL agent to produce better Q values than an RL agentinferring a DNN based on randomly initialised parameters.An experiment was executed to identify the possibility of re-ducing the warm-up period after pre-training and yet keepthe RL agent performance unchanged shown in Figure 6that features the generated results. Observing both standarddeviation of the reward and mean reward in Figure 6, pre-training has improved the robustness and performance ofthe prediction. The time taken to achieve the highest per-formance has reduced since the warm-up period is reduced.This makes the RL training time lower and time taken foroptimising the RL agent as well.Speed-up of the training period by pre-training wascalculated by considering the number of steps needed toreach the mean reward value of 0.6. Mean number of stepstaken to reach mean reward of 0.6 without pre-trainingwas 77126 whilst with pre-training (warm-up 10000) was48623. The speed-up of training period by pre-training andreducing warm-up steps was 1.63x. TABLE 3Testing accuracy values of the two datasets IEMOCAP and SAVEEunder each policy after 700000 steps of RL training.
Policy \ Dataset IEMOCAP SAVEEZeta 54.29 ± ± Max-Boltzmann 51.92 ± ± ± ± ± ± In our envisioned scenario an agent, although pre-trainedwith one corpus, is expected to be robust to other cor-pus/dialects. In order to experiment the behaviour of crossdataset pre-training on RL, we pre-trained the RL Agentwith SAVEE pre-train data subset for the IEMOCAP RLagent and plotted the reward curve and standard deviationof the reward curve in Figure 7. The graph shows that thepre-training has always improved the performance of the RLagent and cross dataset pre-training has not degraded theperformance drastically. This practice can be used in real-world applications with RL implementations, where thereis a lower number of training data available. The RL agentcan be pre-trained with a dataset which is aligned with theproblem and deployed for act with the real environment.
Accuracy of a machine learning model is a popular attributethat uses to benchmark against the other parallel models.Since RL is a method of dynamic programming, it does notcomprise of accuracy attribute. But, as this specific studyis focused on a classification problem, we calculated theaccuracy of the RL agent with the logs of the environment.Equation 8 is used to calculate the accuracy value of anepisode. Accuracy value of the testing phase after RL ex-ecution of 700000 steps was calculated and tabulated in theTable 3. accuracy = No. correct inferencesNo. of utterances × (8)Studying the Table 3, it is observed that the Zeta policyoutperforms the other compared policies in both datasets.Also, these results can be compared with the results ofsupervised learning methods even though they are diversemachine learning paradigms. ONCLUSION
This study was carried out to discover the feasibility ofusing a novel reinforcement learning policy named asZeta policy for speech emotion recognition problems. Pre-training the RL agent was also studied to reduce the trainingtime and minimise the warm-up period. The evaluatedresults show that the proposed Zeta policy performs betterthan the existing policies. We also provided an analysis ofthe relevant parameters epsilon and the number of steps,which shows the operating range of these two parameters.The results also confirm that pre-training can reduce thetraining time to reach maximum performance by reducingthe warm-up period. We show that the proposed Zeta Policy with pre-training is robust to a cross-corpus scenario. Inthe future, one should study a cross-language scenario andexplore the feasibility of using the novel Zeta policy withother RL algorithms. R EFERENCES [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,and D. Hassabis, “Mastering the game of Go with deep neuralnetworks and tree search,”
Nature , vol. 529, no. 7587, pp. 484–489,2016. [Online]. Available: https://doi.org/10.1038/nature16961[2] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu,A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds,P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka,A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg,A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard,D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre,Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. W ¨unsch,K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu,D. Hassabis, C. Apps, and D. Silver, “Grandmaster level inStarCraft II using multi-agent reinforcement learning,”
Nature
ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , 2019,pp. 6750–6754.[4] R. Fakoor, X. He, I. Tashev, and S. Zarar, “ReinforcementLearning To Adapt Speech Enhancement to Instantaneous InputSignal Quality,” arXiv:1711.10791 [cs] , 2018. [Online]. Available:http://arxiv.org/abs/1711.10791[5] H. Chung, H. B. Jeon, and J. G. Park, “Semi-supervised Trainingfor Sequence-to-Sequence Speech Recognition Using Reinforce-ment Learning,” in
Proceedings of the International Joint Conferenceon Neural Networks . Institute of Electrical and Electronics Engi-neers Inc., 7 2020.[6] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker, “Rein-forcement learning for spoken dialogue systems.” in
Nips , 1999,pp. 956–962.[7] T. Paek, “Reinforcement learning for spoken dialogue systems:Comparing strengths and weaknesses for practical deployment,”in
Proc. Dialog-on-Dialog Workshop, Interspeech , 2006.[8] E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter,“EmoRL: Continuous Acoustic Emotion Classification usingDeep Reinforcement Learning,”
Proceedings - IEEE InternationalConference on Robotics and Automation , pp. 4445–4450, 4 2018.[Online]. Available: http://arxiv.org/abs/1804.04053[9] J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A TheoreticalAnalysis of Deep Q-Learning,” arXiv , 1 2019. [Online]. Available:http://arxiv.org/abs/1901.00137[10] G. V. d. l. Cruz Jr, Y. Du, and M. E. Taylor, “Pre-training NeuralNetworks with Human Demonstrations for Deep ReinforcementLearning,” in
Adaptive Learning Agents (ALA) , 2019. [Online].Available: http://arxiv.org/abs/1709.04083[11] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets,M. Yeo, A. Makhzani, H. K ¨uttler, J. Agapiou, J. Schrittwieser et al. ,“Starcraft ii: A new challenge for reinforcement learning,” arXiv ,vol. 2017, no. 1708.04782, 2017.[12] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,D. Horgan, J. Quan, A. Sendonaris, I. Osband et al. , “Deep q-learning from demonstrations,” in
Proceedings AAAI , 2018.[13] V. Kurin, S. Nowozin, K. Hofmann, L. Beyer, and B. Leibe, “Theatari grand challenge dataset,” arXiv , no. 1705.10998, 2017.[14] S. Calinon, “Learning from demonstration (programming bydemonstration),”
Encyclopedia of Robotics , pp. 1–8, 2018.[15] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuningin context-dependent dbn-hmms for real-world speech recogni-tion,” in
Proc. NIPS Workshop on Deep Learning and UnsupervisedFeature Learning , 2010.[16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deepneural network features and semi-supervised training for lowresource speech recognition,” in . IEEE, 2013, pp. 6704–6708. [17] Y. Liu and K. Kirchhoff, “Graph-based semi-supervised acousticmodeling in dnn-based speech recognition,” in . IEEE, 2014, pp. 177–182.[18] J. Stockholm and P. Pasquier, “Reinforcement Learning of ListenerResponse for Mood Classification of Audio,” in , vol. 4, 2009,pp. 849–853.[19] H. Yu and P. Yang, “An Emotion-Based Approach to Reinforce-ment Learning Reward Design,” in , 2019, pp.346–351.[20] M. G. Lagoudakis and R. Parr, “Reinforcement learning as classifi-cation: leveraging modern classifiers,” ser. ICML’03. AAAI Press,2003, pp. 424–431.[21] H. Han, K. Byun, and H. G. Kang, “A deep learning-based stress detection algorithm with speech signal,” in AVSU 2018 - Proceedings of the 2018 Workshop on Audio-VisualScene Understanding for Immersive Multimedia, Co-located withMM 2018 . New York, NY, USA: Association for ComputingMachinery, Inc, 10 2018, pp. 11–15. [Online]. Available:https://dl.acm.org/doi/10.1145/3264869.3264875[22] S. Latif, R. Rana, and J. Qadir, “Adversarial Machine Learning AndSpeech Emotion Recognition: Utilizing Generative AdversarialNetworks For Robustness,” arXiv , 11 2018. [Online]. Available:http://arxiv.org/abs/1811.11402[23] R. Rana, S. Latif, R. Gururajan, A. Gray, G. Mackenzie,G. Humphris, and J. Dunn, “Automated screening fordistress: A perspective for the future,”
European Journal ofCancer Care , vol. 28, no. 4, 7 2019. [Online]. Available:https://pubmed.ncbi.nlm.nih.gov/30883964/[24] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband,J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep Q-learning fromDemonstrations,” arXiv:1704.03732 [cs] , 2017. [Online]. Available:http://arxiv.org/abs/1704.03732[25] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D.dissertation, Cambridge, UK, 1989.[26] M. Wiering, “Explorations in Efficient Reinforcement Learning,”Ph.D. dissertation, 1999. [Online]. Available: https://dare.uva.nl/search?identifier=6ac07651-85ee-4c7b-9cab-86eea5b818f4[27] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu, “Boltzmannexploration done right,” ser. NIPS’17. Curran Associates Inc.,2017, pp. 6287–6296.[28] L. Pan, Q. Cai, Q. Meng, W. Chen, and L. Huang,“Reinforcement learning with dynamic boltzmann softmaxupdates,” in
IJCAI International Joint Conference on ArtificialIntelligence arXiv , 92018. [Online]. Available: http://arxiv.org/abs/1809.01906[30] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo-tion recognition: Features, classification schemes, and databases,”
Pattern Recognition , vol. 44, no. 3, pp. 572–587, 2011.[31] D. Ververidis and C. Kotropoulos, “Emotional speech recognition:Resources, features, and methods,”
Speech Communication , vol. 48,no. 9, pp. 1162–1181, 2006.[32] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-basedspeech emotion recognition,” in , vol. 2, 2003, pp. II–1.[33] D. A. Cairns and J. H. L. Hansen, “Nonlinear analysis andclassification of speech under stressed conditions,”
The Journal ofthe Acoustical Society of America
Inter-speech 2014 , 9 2014.[37] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross CorpusSpeech Emotion Classification - An Effective Transfer LearningTechnique,” 2018.[38] K. Y. Huang, C. H. Wu, Q. B. Hong, M. H. Su, and Y. H. Chen,“Speech Emotion Recognition Using Deep Neural Network Con-sidering Verbal and Nonverbal Speech Sounds,” in
ICASSP, IEEEInternational Conference on Acoustics, Speech and Signal Processing -Proceedings , vol. 2019-May. Institute of Electrical and ElectronicsEngineers Inc., 5 2019, pp. 5866–5870.[39] R. Polikar, L. Upda, S. S. Upda, and V. Honavar, “Learn++: anincremental learning algorithm for supervised neural networks,”
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applica-tions and Reviews) , vol. 31, no. 4, pp. 497–508, 2001.[40] Q. L. Zhao, Y. H. Jiang, and M. Xu, “Incremental learning byheterogeneous bagging ensemble,” in
International Conference onAdvanced Data Mining and Applications . Springer, 2010, pp. 1–12.[41] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactiveemotional dyadic motion capture database,”
Language Resourcesand Evaluation , vol. 42, no. 4, p. 335, 2008. [Online]. Available:https://doi.org/10.1007/s10579-008-9076-6[42] S. Haq, P. J. B. Jackson, and J. Edge, “Audio-visual feature selectionand reduction for emotion classification,” 2008.[43] S. Davis and P. Mermelstein, “Comparison of parametric rep-resentations for monosyllabic word recognition in continuouslyspoken sentences,”
IEEE transactions on acoustics, speech, and signalprocessing , vol. 28, no. 4, pp. 357–366, 1980.[44] B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Batten-berg, and O. Nieto, “librosa: Audio and music signal analysis inpython,” vol. 8, 2015.[45] F. Chollet and others,
Keras , 2015. [Online]. Available: https://keras.io[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg,M. Wicke, Y. Yu, and X. Zheng,
TensorFlow: Large-Scale MachineLearning on Heterogeneous Systems arXiv:2001.00378 [cs, eess] ,2020. [Online]. Available: http://arxiv.org/abs/2001.00378
Thejan Rajapakshe received bachelor degreein Applied Sciences from Rajarata University ofSri Lanka, Mihintale and Bachelor of InformationTechnology from University of Colombo, Schoolof Computing in 2016. He is currently a researchscholar at the University of Southern Queens-land (USQ). He is also an Associate TechnicalTeam Lead at CodeGen International - Research& Development. His research interests includereinforcement learning, speech processing anddeep learning. Rajib Rana is an experimental computer scien-tist, Advance Queensland Research Fellow anda Senior Lecturer in the University of South-ern Queensland. He is also the Director of IoTHealth research program at the University ofSouthern Queensland. He is the recipient of theprestigious Young Tall Poppy QLD Award 2018as one of Queensland’s most outstanding sci-entists for achievements in the area of scientificresearch and communication. Rana’s researchwork aims to capitalise on advancements intechnology along with sophisticated information and data processingto better understand disease progression in chronic health conditionsand develop predictive algorithms for chronic diseases, such as mentalillness and cancer. His current research focus is on Unsupervised Rep-resentation Learning. He received his B.Sc. degree in Computer Sci-ence and Engineering from Khulna University, Bangladesh with PrimeMinister and President’s Gold medal for outstanding achievements andPh.D. in Computer Science and Engineering from the University of NewSouth Wales, Sydney, Australia in 2011. He received his postdoctoraltraining at Autonomous Systems Laboratory, CSIRO before joining theUniversity of Southern Queensland as Faculty in 2015.
Sara Khalifa is currently a senior researchscientist at the Distributed Sensing Systemsresearch group, Data61—CSIRO. She is alsoan honorary adjunct lecturer at University ofQueensland and conjoint lecturer at University ofNew South Wales. Her research interests rotatearound the broad aspects of mobile and ubiqui-tous computing, mobile sensing and Internet ofThings (IoT). She obtained a PhD in ComputerScience and Engineering from UNSW (Sydney,Australia). Her PhD dissertation received the2017 John Makepeace Bennett Award which is awarded by CORE(Computing Research and Education Association of Australasia) to thebest PhD dissertation of the year within Australia and New Zealand inthe field of Computer Science. Her research has been recognised bymultiple iAwards including 2017 NSW Mobility Innovation of the year,2017 NSW R&D Innovation of the year, National Merit R&D Innovationof the year, and the Merit R&D award at the Asia Pacific ICT Alliance(APICTA) Awards, commonly known as the ”Oscar” of the ICT industryin the Asia Pacific, among others.
Bj ¨orn W. Schuller received his diploma in 1999,his doctoral degree for his study on AutomaticSpeech and Emotion Recognition in 2006, andhis habilitation and Adjunct Teaching Professor-ship in the subject area of Signal Processingand Machine Intelligence in 2012, all in electri-cal engineering and information technology fromTUM in Munich/Germany. He is Professor of Ar-tificial Intelligence in the Department of Comput-ing at the Imperial College London/UK, wherehe heads GLAM — the Group on Language,Audio & Music, Full Professor and head of the ZD.B Chair of Embed-ded Intelligence for Health Care and Wellbeing at the University ofAugsburg/Germany, and CEO of audEERING. He was previously fullprofessor and head of the Chair of Complex and Intelligent Systemsat the University of Passau/Germany. Professor Schuller is Fellow ofthe IEEE, Golden Core Member of the IEEE Computer Society, SeniorMember of the ACM, President-emeritus of the Association for theAdvancement of Affective Computing (AAAC), and was elected memberof the IEEE Speech and Language Processing Technical Committee. He(co-)authored 5 books and more than 800 publications in peer-reviewedbooks, journals, and conference proceedings leading to more than over-all 25 000 citations (h-index = 73). Schuller is general chair of ACII 2019,co-Program Chair of Interspeech 2019 and ICMI 2019, repeated AreaChair of ICASSP, and former Editor in Chief of the IEEE Transactions onAffective Computing next to a multitude of further Associate and GuestEditor roles and functions in Technical and Organisational Committees.