Is this you? Create Your Porfile

Dario Bertero

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dario Bertero is active.

Explore More

Publication

Featured researches published by Dario Bertero.

empirical methods in natural language processing | 2016

Real-Time Speech Emotion and Sentiment Recognition for Interactive Dialogue Systems

Dario Bertero; Farhad Bin Siddique; Chien-Sheng Wu; Yan Wan; Ricky Ho Yin Chan; Pascale Fung

In this paper, we describe our approach of enabling an interactive dialogue system to recognize user emotion and sentiment in realtime. These modules allow otherwise conventional dialogue systems to have “empathy” and answer to the user while being aware of their emotion and intent. Emotion recognition from speech previously consists of feature engineering and machine learning where the first stage causes delay in decoding time. We describe a CNN model to extract emotion from raw speech input without feature engineering. This approach even achieves an impressive average of 65.7% accuracy on six emotion categories, a 4.5% improvement when compared to the conventional feature based SVM classification. A separate, CNN-based sentiment analysis module recognizes sentiments from speech recognition results, with 82.5 Fmeasure on human-machine dialogues when trained with out-of-domain data.

north american chapter of the association for computational linguistics | 2015

HLTC-HKUST: A Neural Network Paraphrase Classifier using Translation Metrics, Semantic Roles and Lexical Similarity Features

Dario Bertero; Pascale Fung

This paper describes the system developed by our team (HLTC-HKUST) for task 1 of SemEval 2015 workshop about paraphrase classification and semantic similarity in Twitter. We trained a neural network classifier over a range of features that includes translation metrics, lexical and syntactic similarity score and semantic features based on semantic roles. The neural network was trained taking into consideration in the objective function the six different similarity levels provided in the corpus, in order to give as output a more fine-grained estimation of the similarity level of the two sentences, as required by subtask 2. With an F-score of 0.651 in the binary paraphrase classification subtask 1, and a Pearson coefficient of 0.697 for the sentence similarity subtask 2, we achieved respectively the 6th place and the 3rd place, above the average of what obtained by the other contestants.

north american chapter of the association for computational linguistics | 2016

A Long Short-Term Memory Framework for Predicting Humor in Dialogues

Dario Bertero; Pascale Fung

We propose a first-ever attempt to employ a Long Short-Term memory based framework to predict humor in dialogues. We analyze data from a popular TV-sitcom, whose canned laughters give an indication of when the audience would react. We model the setuppunchline relation of conversational humor with a Long Short-Term Memory, with utterance encodings obtained from a Convolutional Neural Network. Out neural network framework is able to improve the F-score of 8% over a Conditional Random Field baseline. We show how the LSTM effectively models the setup-punchline relation reducing the number of false positives and increasing the recall. We aim to employ our humor prediction model to build effective empathetic machine able to understand jokes.

international conference on acoustics, speech, and signal processing | 2017

A first look into a Convolutional Neural Network for speech emotion detection

Dario Bertero; Pascale Fung

We propose a real-time Convolutional Neural Network model for speech emotion detection. Our model is trained from raw audio on a small dataset of TED talks speech data, manually annotated into three emotion classes: “Angry”, “Happy” and “Sad”. It achieves an average accuracy of 66.1%, 5% higher than a feature-based SVM baseline, with an evaluation time of few hundred milliseconds. We also provide an in-depth model visualization and analysis. We show how our neural network effectively activates during the speech sections of the waveform regardless of the emotion, ignoring the silence parts which do not contain information. On the frequency domain the CNN filters distribute throughout all the spectrum range, with higher concentration around the average pitch range related to that emotion. Each filter also activates at multiple frequency intervals, presumably due to the additional contribution of amplitude-related feature learning. Our work will allow faster and more accurate emotion detection modules for human-machine empathetic dialog systems and other related applications.

2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA) | 2016

Towards a corpus of speech emotion for interactive dialog systems

Dario Bertero; Farhad Bin Siddique; Pascale Fung

We present and discuss an ongoing data collection and annotation effort to build a large corpus on speech emotion detection. We collected 207 hours of public speech data from TED talks. We highlight the expected relation between the API output emotion distribution and the common features of a TED talk. We then employed manual annotators to improve the quality of the annotations, building a two-level annotation process where to a main emotion label we add multiple secondary emotion descriptors. Additional annotations were also obtained through crowdsourcing, and we provide an analysis of the agreement between multiple annotators. We conducted a comparison between the automatic and manual labeling, obtaining an average accuracy of 28.4% of the automatic multiclass annotation API, as well as automatic classification with a CNN, reaching more than 60% average accuracy in various experimental settings. We also discuss various active learning methods we are using to select the samples to be annotated in order to obtain more relevant data at a faster pace. A large speech emotion detection corpus will enable more accurate emotion detection systems, which can then be integrated into dialog systems to recognize and react to the user emotion.

spoken language technology workshop | 2016

Multimodal deep neural nets for detecting humor in TV sitcoms

Dario Bertero; Pascale Fung

We propose a novel approach of combining acoustic and language features to predict humor in dialogues with a deep neural network. We analyze data from three popular TV-sitcoms whose canned laughters give an indication of when the audience would react. We model the setup-punchline sequential relation of conversational humor with a Long Short-Term Memory network, with utterance encodings obtained from two Convolutional Neural Networks, one to model word-level language features and the other to model frame-level acoustic and prosodic features. Our neural network framework is able to improve the F-score of over 5% over a Conditional Random Field baseline trained on a similar acoustic and language feature combination, achieving a much higher recall. It is also more effective over a language features-only setting, with a F-score of 10% higher. It also has a good generalization performance, reaching in most cases precision values of over 70% when trained and tested over different sitcoms.

language resources and evaluation | 2016