Mohamed R. Amer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohamed R. Amer is active.

Explore More

Publication

Featured researches published by Mohamed R. Amer.

computer vision and pattern recognition | 2011

Multiobject tracking as maximum weight independent set

William Brendel; Mohamed R. Amer; Sinisa Todorovic

This paper addresses the problem of simultaneous tracking of multiple targets in a video. We first apply object detectors to every video frame. Pairs of detection responses from every two consecutive frames are then used to build a graph of tracklets. The graph helps transitively link the best matching tracklets that do not violate hard and soft contextual constraints between the resulting tracks. We prove that this data association problem can be formulated as finding the maximum-weight independent set (MWIS) of the graph. We present a new, polynomial-time MWIS algorithm, and prove that it converges to an optimum. Similarity and contextual constraints between object detections, used for data association, are learned online from object appearance and motion properties. Long-term occlusions are addressed by iteratively repeating MWIS to hierarchically merge smaller tracks into longer ones. Our results demonstrate advantages of simultaneously accounting for soft and hard contextual constraints in multitarget tracking. We outperform the state of the art on the benchmark datasets.

european conference on computer vision | 2012

Cost-Sensitive top-down/bottom-up inference for multiscale activity recognition

Mohamed R. Amer; Dan Xie; Mingtian Zhao; Sinisa Todorovic; Song-Chun Zhu

This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high-resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end, we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors --- called α process; 2) bottom-up inference based on detecting activity parts --- called β process; and 3) top-down inference based on detecting activity context --- called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus.

computer vision and pattern recognition | 2012

Sum-product networks for modeling activities with stochastic structure

Mohamed R. Amer; Sinisa Todorovic

This paper addresses recognition of human activities with stochastic structure, characterized by variable spacetime arrangements of primitive actions, and conducted by a variable number of actors. We demonstrate that modeling aggregate counts of visual words is surprisingly expressive enough for such a challenging recognition task. An activity is represented by a sum-product network (SPN). SPN is a mixture of bags-of-words (BoWs) with exponentially many mixture components, where subcomponents are reused by larger ones. SPN consists of terminal nodes representing BoWs, and product and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. The connectivity of SPN and parameters of BoW distributions are learned under weak supervision using the EM algorithm. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video in terms of activity detection and localization. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. A new Volleyball dataset is compiled and annotated for evaluation. Our classification accuracy and localization precision and recall are superior to those of the state-of-the-art on the benchmark and our Volleyball datasets.

international conference on computer vision | 2011

A chains model for localizing participants of group activities in videos

Mohamed R. Amer; Sinisa Todorovic

Given a video, we would like to recognize group activities, localize video parts where these activities occur, and detect actors involved in them. This advances prior work that typically focuses only on video classification. We make a number of contributions. First, we specify a new, mid-level, video feature aimed at summarizing local visual cues into bags of the right detections (BORDs). BORDs seek to identify the right people who participate in a target group activity among many noisy people detections. Second, we formulate a new, generative, chains model of group activities. Inference of the chains model identifies a subset of BORDs in the video that belong to occurrences of the activity, and organizes them in an ensemble of temporal chains. The chains extend over, and thus localize, the time intervals occupied by the activity. We formulate a new MAP inference algorithm that iterates two steps: i) Warps the chains of BORDs in space and time to their expected locations, so the transformed BORDs can better summarize local visual cues; and ii) Maximizes the posterior probability of the chains. We outperform the state of the art on benchmark UT-Human Interaction and Collective Activities datasets, under reasonable running times.

international conference on computer vision | 2013

Monte Carlo Tree Search for Scheduling Activity Recognition

Mohamed R. Amer; Sinisa Todorovic; Alan Fern; Song-Chun Zhu

This paper addresses recognition of human activities with stochastic structure, characterized by variable space-time arrangements of primitive actions, and conducted by a variable number of actors. Our approach classifies the activity of interest as well as identifies the relevant foreground in the video. Each activity representation is considered as a mixture distribution of BoWs captured by a Sum-Product Network (SPN). In our approach, SPN represents a linear mixture of many bags-of-words (BoWs) where each BoW represents an important foreground part of the activity. This mixture distribution is efficiently computed by organizing the BoWs in a hierarchy, where children BoWs are nested within parent BoWs. SPN allows us to model this mixture since it consists of terminal nodes representing BoWs, product nodes, and sum nodes organized in a number of layers. The products are aimed at encoding particular configurations of primitive actions, and the sums serve to capture their alternative configurations. SPN inference amounts to parsing the SPN graph, which yields the most probable explanation (MPE) of the video foreground. SPN inference has linear complexity in the number of nodes, under fairly general conditions, enabling fast and scalable recognition. The connectivity of SPN and the parameters of BoW distributions are learned under weak supervision using a variational EM algorithm. For our evaluation, we have compiled and annotated a new Volleyball dataset. Our classification accuracy and localization results are superior to those of the state of the art on current benchmarks as well as our Volleyball datasets.

european conference on computer vision | 2014

HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos

Mohamed R. Amer; Peng Lei; Sinisa Todorovic

This paper addresses the problem of recognizing and localizing coherent activities of a group of people, called collective activities, in video. Related work has argued the benefits of capturing long-range and higher-order dependencies among video features for robust recognition. To this end, we formulate a new deep model, called Hierarchical Random Field (HiRF). HiRF models only hierarchical dependencies between model variables. This effectively amounts to modeling higher-order temporal dependencies of video features. We specify an efficient inference of HiRF that iterates in each step linear programming for estimating latent variables. Learning of HiRF parameters is specified within the max-margin framework. Our evaluation on the benchmark New Collective Activity and Collective Activity datasets, demonstrates that HiRF yields superior recognition and localization as compared to the state of the art.

workshop on applications of computer vision | 2014

Multimodal fusion using dynamic hybrid models

Mohamed R. Amer; Behjat Siddiquie; Saad M. Khan; Ajay Divakaran; Harpreet S. Sawhney

We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a temporal discriminative model for event detection, and classification, which enables modeling long range temporal dynamics. We evaluate our approach on audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art.

computer vision and pattern recognition | 2016

Facial Attributes Classification Using Multi-task Representation Learning

Max Ehrlich; Timothy J. Shields; Timur Almaev; Mohamed R. Amer

This paper presents a new approach for facial attribute classification using a multi-task learning approach. Unlike other approaches that uses hand engineered features, our model learns a shared feature representation that is wellsuited for multiple attribute classification. Learning a joint feature representation enables interaction between different tasks. For learning this shared feature representation we use a Restricted Boltzmann Machine (RBM) based model, enhanced with a factored multi-task component to become Multi-Task Restricted Boltzmann Machine (MT-RBM). Our approach operates directly on faces and facial landmark points to learn a joint feature representation over all the available attributes. We use an iterative learning approach consisting of a bottom-up/top-down pass to learn the shared representation of our multi-task model and at inference we use a bottom-up pass to predict the different tasks. Our approach is not restricted to any type of attributes, however, for this paper we focus only on facial attributes. We evaluate our approach on three publicly available datasets, the Celebrity Faces (CelebA), the Multi-task Facial Landmarks (MTFL), and the ChaLearn challenge dataset. We show superior classification performance improvement over the state-of-the-art.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

Sum Product Networks for Activity Recognition

Mohamed R. Amer; Sinisa Todorovic

This paper addresses detection and localization of human activities in videos. We focus on activities that may have variable spatiotemporal arrangements of parts, and numbers of actors. Such activities are represented by a sum-product network (SPN). A product node in SPN represents a particular arrangement of parts, and a sum node represents alternative arrangements. The sums and products are hierarchically organized, and grounded onto space-time windows covering the video. The windows provide evidence about the activity classes based on the Counting Grid (CG) model of visual words. This evidence is propagated bottom-up and top-down to parse the SPN graph for the explanation of the video. The node connectivity and model parameters of SPN and CG are jointly learned under two settings, weakly supervised, and supervised. For evaluation, we use our new Volleyball dataset, along with the benchmark datasets VIRAT, UT-Interactions, KTH, and TRECVID MED 2011. Our video classification and activity localization are superior to those of the state of the art on these datasets.

international conference on acoustics, speech, and signal processing | 2014

Emotion detection in speech using deep networks

Mohamed R. Amer; Behjat Siddiquie; Colleen Richey; Ajay Divakaran

We propose a novel staged hybrid model for emotion detection in speech. Hybrid models exploit the strength of discriminative classifiers along with the representational power of generative models. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative representations. Our proposed hybrid model consists of a generative model, which is used for unsupervised representation learning of short term temporal phenomena and a discriminative model, which is used for event detection and classification of long range temporal dynamics. We evaluate our approach on multiple audio-visual datasets (AVEC, VAM, and SPD) and demonstrate its superiority compared to the state-of-the-art.

Explore More