Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Fillipe D. M. de Souza is active.

Publication


Featured researches published by Fillipe D. M. de Souza.


computer vision and pattern recognition | 2014

Rate-Invariant Analysis of Trajectories on Riemannian Manifolds with Application in Visual Speech Recognition

Jingyong Su; Anuj Srivastava; Fillipe D. M. de Souza; Sudeep Sarkar

In statistical analysis of video sequences for speech recognition, and more generally activity recognition, it is natural to treat temporal evolutions of features as trajectories on Riemannian manifolds. However, different evolution patterns result in arbitrary parameterizations of these trajectories. We investigate a recent framework from statistics literature that handles this nuisance variability using a cost function/distance for temporal registration and statistical summarization & modeling of trajectories. It is based on a mathematical representation of trajectories, termed transported square-root vector field (TSRVF), and the L2 norm on the space of TSRVFs. We apply this framework to the problem of speech recognition using both audio and visual components. In each case, we extract features, form trajectories on corresponding manifolds, and compute parametrization-invariant distances using TSRVFs for speech classification. On the OuluVS database the classification performance under metric increases significantly, by nearly 100% under both modalities and for all choices of features. We obtained speaker-dependent classification rate of 70% and 96% for visual and audio components, respectively.


computer vision and pattern recognition | 2015

Temporally coherent interpretations for long videos using pattern theory

Fillipe D. M. de Souza; Sudeep Sarkar; Anuj Srivastava; Jingyong Su

Graph-theoretical methods have successfully provided semantic and structural interpretations of images and videos. A recent paper introduced a pattern-theoretic approach that allows construction of flexible graphs for representing interactions of actors with objects and inference is accomplished by an efficient annealing algorithm. Actions and objects are termed generators and their interactions are termed bonds; together they form high-probability configurations, or interpretations, of observed scenes. This work and other structural methods have generally been limited to analyzing short videos involving isolated actions. Here we provide an extension that uses additional temporal bonds across individual actions to enable semantic interpretations of longer videos. Longer temporal connections improve scene interpretations as they help discard (temporally) local solutions in favor of globally superior ones. Using this extension, we demonstrate improvements in understanding longer videos, compared to individual interpretations of non-overlapping time segments. We verified the success of our approach by generating interpretations for more than 700 video segments from the YouCook data set, with intricate videos that exhibit cluttered background, scenarios of occlusion, viewpoint variations and changing conditions of illumination. Interpretations for long video segments were able to yield performance increases of about 70% and, in addition, proved to be more robust to different severe scenarios of classification errors.


international conference on pattern recognition | 2014

Pattern Theory-Based Interpretation of Activities

Fillipe D. M. de Souza; Sudeep Sarkar; Anuj Srivastava; Jingyong Su

We present a novel framework, based on Germanders pattern theoretic concepts, for high-level interpretation of video activities. This framework allows us to elegantly integrate ontological constraints and machine learning classifiers in one formalism to construct high-level semantic interpretations that describe video activity. The unit of analysis is a generator that could represent either an ontological label as well as a group of features from a video. These generators are linked using bonds with different constraints. An interpretation of a video is a configuration of these connected generators, which results in a graph structure that is richer than conventional graphs used in computer vision. The quality of the interpretation is quantified by an energy function that is optimized using Markov Chain Monte Carlo based simulated annealing. We demonstrate the superiority of our approach over a purely machine learning based approach (SVM) using more than 650 video shots from the You Cook dataset. This dataset is very challenging in terms of complexity of background, presence of camera motion, object occlusion, clutter, and actor variability. We find significantly improved performance in nearly all cases. Our results show that the pattern theory inference process is able to construct the correct interpretation by leveraging the ontological constraints even when the machine learning classifier is poor and the most confident labels are wrong.


international conference on pattern recognition | 2016

Building semantic understanding beyond deep learning from sound and vision

Fillipe D. M. de Souza; Sudeep Sarkar; Guillermo Cámara-Chávez

Deep learning-based models have recently been widely successful at outperforming traditional approaches in several computer vision applications such as image classification, object recognition and action recognition. However, those models are not naturally designed to learn structural information that can be important to tasks such as human pose estimation and structured semantic interpretation of video events. In this paper, we demonstrate how to build structured semantic understanding of audio-video events by reasoning on multiple-label decisions of deep visual models and auditory models using Grenanders structures for imposing semantic consistency. The proposed structured model does not require joint training of the structural semantic dependencies and deep models. Instead they are independent components linked by Grenanders structures. Furthermore, we exploited Grenanders structures as a means to facilitate and enrich the model with fusion of multimodal sensory data; in particular, auditory features with visual features. Overall, we observed improvements in the quality of semantic interpretations using deep models and auditory features in combination with Grenanders structures, reflecting as numerical improvements of up to 11.5% and 12.3% in precision and recall, respectively.


Pattern Recognition Letters | 2016

Pattern theory for representation and inference of semantic structures in videos

Fillipe D. M. de Souza; Sudeep Sarkar; Anuj Srivastava; Jingyong Su

The framework is robust to severe scenarios of feature classification errors.Flexible representation scheme to capture large range of structural variations.Interpretation performance improved the state-of-the-art in challenging scenarios. Semantic structures for video interpretation are formed using generators and bonds, the fundamental units of representation in pattern theory. Generators represent features and ontological concepts such as actions and objects whereas bonds are members of generators that encode ontological constraints and allow generators to connect to each other. The resulting configurations of connected generators provide scene interpretations; the inference goal is to parse a given video data and generate high-probability configurations. The probabilistic structures are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints, co-occurrence frequencies, etc). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data.Display Omitted We develop a combinatorial approach to represent and infer semantic interpretations of video contents using tools from Grenanders pattern theory. Semantic structures for video interpretation are formed using generators and bonds, the fundamental units of representation in pattern theory. Generators represent features and ontological items, such as actions and objects, whereas bonds are threads used to connect generators while respecting appropriate constraints. The resulting configurations of partially-connected generators are termed scene interpretations. Our goal is to parse a given video data set into high-probability configurations. The probabilistic models are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints, co-occurrence frequencies, etc). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data. The proposed framework is able to obtain 20% higher classification rates, compared to a purely machine learning-based baseline, despite artificial insertion of low-level processing errors. In an uncontrolled scenario, video interpretation performance rates are found to be double that of the baseline.


computer vision and pattern recognition | 2013

Rate-invariant comparisons of covariance paths for visual speech recognition

Jingyong Su; Anuj Srivastava; Fillipe D. M. de Souza; Sudeep Sarkar

An important problem in speech, and generally activity, recognition is to develop analyses that are invariant to the execution rates. We introduce a theoretical framework that provides a parametrization-invariant metric for comparing parametrized paths on Riemannian manifolds. Treating instances of activities as parametrized paths on a Riemannian manifold of covariance matrices, we apply this framework to the problem of visual speech recognition from image sequences. We represent each sequence as a path on the space of covariance matrices, each covariance matrix capturing spatial variability of visual features in a frame, and perform simultaneous pairwise temporal alignment and comparison of paths. This removes the temporal variability and helps provide a robust metric for visual speech classification. We evaluated this idea on the OuluVS database and the rank-1 nearest neighbor classification rate improves from 32% to 57% due to temporal alignment.


International Journal of Computer Vision | 2017

Spatially Coherent Interpretations of Videos Using Pattern Theory

Fillipe D. M. de Souza; Sudeep Sarkar; Anuj Srivastava; Jingyong Su


national conference on artificial intelligence | 2018

An Inherently Explainable Model for Video Activity Interpretation.

Sathyanarayanan N. Aakur; Fillipe D. M. de Souza; Sudeep Sarkar


canadian conference on computer and robot vision | 2017

Towards a Knowledge-Based Approach for Generating Video Descriptions

Sathyanarayanan N. Aakur; Fillipe D. M. de Souza; Sudeep Sarkar


arXiv: Computer Vision and Pattern Recognition | 2017

Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos.

Sathyanarayanan N. Aakur; Fillipe D. M. de Souza; Sudeep Sarkar

Collaboration


Dive into the Fillipe D. M. de Souza's collaboration.

Top Co-Authors

Avatar

Sudeep Sarkar

University of South Florida

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Guillermo Cámara-Chávez

Universidade Federal de Ouro Preto

View shared research outputs
Researchain Logo
Decentralizing Knowledge