Saad M. Khan
University of Central Florida
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saad M. Khan.
european conference on computer vision | 2006
Saad M. Khan; Mubarak Shah
Occlusion and lack of visibility in dense crowded scenes make it very difficult to track individual people correctly and consistently. This problem is particularly hard to tackle in single camera systems. We present a multi-view approach to tracking people in crowded scenes where people may be partially or completely occluding each other. Our approach is to use multiple views in synergy so that information from all views is combined to detect objects. To achieve this we present a novel planar homography constraint to resolve occlusions and robustly determine locations on the ground plane corresponding to the feet of the people. To find tracks we obtain feet regions over a window of frames and stack them creating a space time volume. Feet regions belonging to the same person form contiguous spatio-temporal regions that are clustered using a graph cuts segmentation approach. Each cluster is the track of a person and a slice in time of this cluster gives the tracked location. Experimental results are shown in scenes of dense crowds where severe occlusions are quite common. The algorithm is able to accurately track people in all views maintaining correct correspondences across views. Our algorithm is ideally suited for conditions when occlusions between people would seriously hamper tracking performance or if there simply are not enough features to distinguish between different people.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2009
Saad M. Khan; Mubarak Shah
Occlusion and lack of visibility in crowded and cluttered scenes make it difficult to track individual people correctly and consistently, particularly in a single view. We present a multi-view approach to solving this problem. In our approach we neither detect nor track objects from any single camera or camera pair; rather evidence is gathered from all the cameras into a synergistic framework and detection and tracking results are propagated back to each view. Unlike other multi-view approaches that require fully calibrated views our approach is purely image-based and uses only 2D constructs. To this end we develop a planar homographic occupancy constraint that fuses foreground likelihood information from multiple views, to resolve occlusions and localize people on a reference scene plane. For greater robustness this process is extended to multiple planes parallel to the reference plane in the framework of plane to plane homologies. Our fusion methodology also models scene clutter using the Schmieder and Weathersby clutter measure, which acts as a confidence prior, to assign higher fusion weight to views with lesser clutter. Detection and tracking are performed simultaneously by graph cuts segmentation of tracks in the space-time occupancy likelihood data. Experimental results with detailed qualitative and quantitative analysis, are demonstrated in challenging multi-view, crowded scenes.
computer vision and pattern recognition | 2008
Pingkun Yan; Saad M. Khan; Mubarak Shah
In this paper we present a novel approach using a 4D (x,y,z,t) action feature model (4D-AFM) for recognizing actions from arbitrary views. The 4D-AFM elegantly encodes shape and motion of actors observed from multiple views. The modeling process starts with reconstructing 3D visual hulls of actors at each time instant. Spatiotemporal action features are then computed in each view by analyzing the differential geometric properties of spatio-temporal volumes (3D STVs) generated by concatenating the actorpsilas silhouette over the course of the action (x, y, t). These features are mapped to the sequence of 3D visual hulls over time (4D) to build the initial 4D-AFM. Actions are recognized based on the scores of matching action features from the input videos to the model points of 4D-AFMs by exploiting pairwise interactions of features. Promising recognition results have been demonstrated on the multi-view IXMAS dataset using both single and multi-view input videos.
international conference on computer vision | 2007
Pingkun Yan; Saad M. Khan; Mubarak Shah
In this paper, a novel object class detection method based on 3D object modeling is presented. Instead of using a complicated mechanism for relating multiple 2D training views, the proposed method establishes spatial connections between these views by mapping them directly to the surface of 3D model. The 3D shape of an object is reconstructed by using a homographic framework from a set of model views around the object and is represented by a volume consisting of binary slices. Features are computed in each 2D model view and mapped to the 3D shape model using the same homographic framework. To generalize the model for object class detection, features from supplemental views are also considered. A codebook is constructed from all of these features and then a 3D feature model is built. Given a 2D test image, correspondences between the 3D feature model and the testing view are identified by matching the detected features. Based on the 3D locations of the corresponding features, several hypotheses of viewing planes can be made. The one with the highest confidence is then used to detect the object using feature location matching. Performance of the proposed method has been evaluated by using the PASCAL VOC challenge dataset and promising results are demonstrated.
international conference on computer vision | 2007
Saad M. Khan; Pingkun Yan; Mubarak Shah
This paper presents a purely image-based approach to fusing foreground silhouette information from multiple arbitrary views. Our approach does not require 3D constructs like camera calibration to carve out 3D voxels or project visual cones in 3D space. Using planar homographies and foreground likelihood information from a set of arbitrary views, we show that visual hull intersection can be performed in the image plane without requiring to go in 3D space. This process delivers a 2D grid of object occupancy likelihoods representing a cross-sectional slice of the object. Subsequent slices of the object are obtained by extending the process to planes parallel to a reference plane in a direction along the body of the object. We show that homographies of these new planes between views can be computed in the framework of plane to plane homologies using the homography induced by a reference plane and the vanishing point of the reference direction. Occupancy grids are stacked on top of each other, creating a three dimensional data structure that encapsulates the object shape and location. Object structure is finally segmented out by minimizing an energy functional over the surface of the object in a level sets formulation. We show the application of our method on complicated object shapes as well as cluttered environments containing multiple objects.
Proceedings of SPIE, the International Society for Optical Engineering | 2006
Fahd Rafi; Saad M. Khan; Khurram Shafiq; Mubarak Shah
In this paper we present an algorithm for the autonomous navigation of an unmanned aerial vehicle (UAV) following a moving target. The UAV in consideration is a fixed wing aircraft that has physical constraints on airspeed and maneuverability. The target however is not considered to be constrained and can move in any general pattern. We show a single circular pattern navigation algorithm that works for targets moving at any speed with any pattern where other methods switch between different navigation strategies in different scenarios. Simulation performed takes into consideration that the aircraft also needs to visually track the target using a mounted camera. The camera is also controlled by the algorithm according to the position and orientation of the aircraft and the position of the target. Experiments show that the algorithm presented successfully tracks and follows moving targets.
acm multimedia | 2005
Saad M. Khan; Mubarak Shah
Most work in human activity recognition is limited to relatively simple behaviors like sitting down, standing up or other dramatic posture changes. Very little has been achieved in detecting more complicated behaviors especially those characterized by the collective participation of several individuals. In this work we present a novel approach to recognizing the class of activities characterized by their rigidity in formation for example people parades, airplane flight formations or herds of animals. The central idea is to model the entire group as a collective rather than focusing on each individual separately. We model the formation as a 3D polygon with each corner representing a participating entity. Tracks from the entities are treated as tracks of feature points on the 3D polygon. Based on the rank of the track matrix we can determine if the 3D polygon under consideration behaves rigidly or undergoes non-rigid deformation. Our method is invariant to camera motion and does not require an a priori model or a training phase.
computer vision and pattern recognition | 2010
Saad M. Khan; Hui Cheng; Dennis Lee Matthies; Harpreet S. Sawhney
We present an approach that uses detailed 3D models to detect and classify objects into fine levels of vehicle categories. Unlike other approaches that use silhouette information to fit a 3D model, our approach uses complete appearance from the image. Each 3D model has a set of salient location markers that are determined a-priori. These salient locations represent a sub-sampling of 3D locations that make up the model. Scene conditions are simulated in the rendering of 3D models and the salient locations are used to bootstrap a HoG based feature classifier. HoG features are computed in both rendered and real scenes and a novel object match score the ‘Salient Feature Match Distribution Matrix’ is computed. For each 3D model we also learn the patterns of misalignment with other vehicle types and use it as an additional cue for classification. Results are presented on a challenging aerial video dataset consisting of vehicle imagery from various viewpoints and environmental conditions.1
workshop on applications of computer vision | 2014
Mohamed R. Amer; Behjat Siddiquie; Saad M. Khan; Ajay Divakaran; Harpreet S. Sawhney
We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a temporal discriminative model for event detection, and classification, which enables modeling long range temporal dynamics. We evaluate our approach on audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art.
international conference on multimedia and expo | 2013
Behjat Siddiquie; Saad M. Khan; Ajay Divakaran; Harpreet S. Sawhney
We present a novel approach for multi-modal affect analysis in human interactions that is capable of integrating data from multiple modalities while also taking into account temporal dynamics. Our fusion approach, Joint Hidden Conditional Random Fields (JHRCFs), combines the advantages of purely feature level (early fusion) fusion approaches with late fusion (CRFs on individual modalities) to simultaneously learn the correlations between features from multiple modalities as well as their temporal dynamics. Our approach addresses major shortcomings of other fusion approaches such as the domination of other modalities by a single modality with early fusion and the loss of cross-modal information with late fusion. Extensive results on the AVEC 2011 dataset show that we outperform the state-of-the-art on the Audio Sub-Challenge, while achieving competitive performance on the Video Sub-Challenge and the Audiovisual Sub-Challenge.