Srinivas Parthasarathy
University of Texas at Dallas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Srinivas Parthasarathy.
IEEE Transactions on Affective Computing | 2017
Carlos Busso; Srinivas Parthasarathy; Alec Burmania; Mohammed Abdelwahab; Najmeh Sadoughi; Emily Mower Provost
We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and emotion while also promoting naturalness in the recordings. Studies on emotion perception often require stimuli with fixed lexical content, but that convey different emotions. These stimuli can also serve as an instrument to understand how emotion modulates speech at the phoneme level, in a manner that controls for coarticulation. Such audiovisual data are not easily available from natural recordings. A common solution is to record actors reading sentences that portray different emotions, which may not produce natural behaviors. We propose an alternative approach in which we define hypothetical scenarios for each sentence that are carefully designed to elicit a particular emotion. Two actors improvise these emotion-specific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emotions. We describe the context in which this corpus was recorded, the key features of the corpus, the areas in which this corpus can be useful, and the emotional content of the recordings. The paper also provides the performance for speech and facial emotion classifiers. The analysis brings novel classification evaluations where we study the performance in terms of inter-evaluator agreement and naturalness perception, leveraging the large size of the audiovisual database.
conference of the international speech communication association | 2016
Srinivas Parthasarathy; Carlos Busso
Conventional emotion classification methods focus on predefined segments such as sentences or speaking turns that are labeled and classified at the segment level. However, the emotional state dynamically fluctuates during human interactions, so not all the segments have the same relevance. We are interested in detecting regions within the interaction where the emotions are particularly salient, which we refer to as emotional hotspots. A system with this capability can have real applications in many domains. A key step towards building such a system is to define reliable hotspot labels, which will dictate the performance of machine learning algorithms. Creating groundtruth labels from scratch is both expensive and time consuming. This paper also demonstrates that defining those emotionally salient segments using perceptual evaluation is a hard problem resulting in low inter-evaluator agreement. Instead, we propose to define emotionally salient regions leveraging existing time-continuous emotional labels. The proposed approach relies on the qualitative agreement (QA) method, which dynamically captures increasing or decreasing trends across emotional traces provided by multiple evaluators. The proposed method is more reliable than just averaging traces across evaluators, providing the flexibility to define hotspots at various reliability levels without having to recollect new perceptual evaluations.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Srinivas Parthasarathy; Roddy Cowie; Carlos Busso
Automatic emotion recognition in realistic domains is a challenging task given the subtle expressive behaviors that occur during human interactions. The challenges start with noisy emotional descriptors provided by multiple evaluators, which are characterized by low interevaluator agreement. Studies have suggested that evaluators are more consistent in detecting qualitative relations between episodes (i.e., emotional contrasts), rather than absolute scores (i.e., the actual emotion). Based on these observations, this study explores the use of relative labels to train machine learning algorithms that can rank expressive behaviors. Instead of deriving relative labels from expensive and time-consuming subjective evaluations, the labels are extracted from existing time-continuous evaluations over expressive attributes annotated with FEELTRACE. We rely on the qualitative agreement (QA) analysis to estimate relative labels which are used to train rank-based classifiers (rankers). The experimental evaluation on the SEMAINE database demonstrates the benefits of the proposed approach. The ranking performance using the QA-based labels compare favorably against preference learning rankers trained with relative labels obtained by simply aggregating the absolute values of the emotional traces across evaluators, which is the common approach used by other studies.
international conference on acoustics, speech, and signal processing | 2017
Srinivas Parthasarathy; Chunlei Zhang; John H. L. Hansen; Carlos Busso
Expressive speech introduces variations in the acoustic features affecting the performance of speech technology such as speaker verification systems. It is important to identify the range of emotions for which we can reliably estimate speaker verification tasks. This paper studies the performance of a speaker verification system as a function of emotions. Instead of categorical classes such as happiness or anger, which have important intra-class variability, we use the continuous attributes arousal, valence, and dominance which facilitate the analysis. We evaluate an speaker verification system trained with the i-vector framework with a probabilistic linear discriminant analysis (PLDA) back-end. The study relies on a subset of the MSP-PODCAST corpus, which has naturalistic recordings from 40 speakers. We train the system with neutral speech, creating mismatches on the testing set. The results show that speaker verification errors increase when the values of the emotional attributes increase. For neutral/moderate values of arousal, valence and dominance, the speaker verification performance are reliable. These results are also observed when we artificially force the sentences to have the same duration.
international conference on acoustics, speech, and signal processing | 2017
Srinivas Parthasarathy; Reza Lotfian; Carlos Busso
Studies have shown that ranking emotional attributes through preference learning methods has significant advantages over conventional emotional classification/regression frameworks. Preference learning is particularly appealing for retrieval tasks, where the goal is to identify speech conveying target emotional behaviors (e.g., positive samples with low arousal). With recent advances in deep neural networks (DNNs), this study explores whether a preference learning framework relying on deep learning can outperform conventional ranking algorithms. We use a deep learning ranker implemented with the RankNet algorithm to evaluate preference between emotional sentences in terms of dimensional attributes (arousal, valence and dominance). The results show improved performance over ranking algorithms trained with support vector machine (SVM) (i.e., RankSVM). The results are significantly better than performance reported in previous work, demonstrating the potential of RankNet to retrieve speech with target emotional behaviors.
international conference on acoustics, speech, and signal processing | 2016
Taufiq Hasan; Mohammed Abdelwahab; Srinivas Parthasarathy; Carlos Busso; Yang Liu
Research on automatic speech summarization typically focuses on optimizing objective evaluation criteria, such as the ROUGE metric, which depend on word and phrase overlaps between automatic and manually generated summary documents. However, the actual quality of the speech summarizer largely depends on how the end-users perceive the audio output. This work focuses on the task of composing summarized audio streams with the aim of improving the quality and interest perceived by the end-user. First, using crowd-sourced summary annotations on a broadcast news corpus, we train a rank-SVM classifier to learn the relative importance of each sentence in a news story. Acoustic, lexical and structural features are used for training. In addition, we investigate the perceived emotion level in each sentence to aid the summarizer in selecting interesting sentences, yielding an emotion-aware summarizer. Next, we propose several methods to combine these sentences to generate a compressed audio stream. Subjective evaluations are performed to evaluate the quality of the generated summaries on the following criterion: interest, abruptness, informativeness, attractiveness, and overall quality. The results indicate that users are most sensitive to the linguistic coherence and continuity of the audio stream.
IEEE Transactions on Affective Computing | 2016
Alec Burmania; Srinivas Parthasarathy; Carlos Busso
conference of the international speech communication association | 2017
Srinivas Parthasarathy; Carlos Busso
conference of the international speech communication association | 2018
Kusha Sridhar; Srinivas Parthasarathy; Carlos Busso
conference of the international speech communication association | 2018
Srinivas Parthasarathy; Carlos Busso