Is this you? Create Your Porfile

Mun Wai Lee

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mun Wai Lee is active.

Explore More

Publication

Featured researches published by Mun Wai Lee.

Proceedings of the IEEE | 2010

I2T: Image Parsing to Text Description

Benjamin Z. Yao; Xiong Yang; Liang Lin; Mun Wai Lee; Song-Chun Zhu

In this paper, we present an image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding. The proposed I2T framework follows three steps: 1) input images (or video frames) are decomposed into their constituent visual patterns by an image parsing engine, in a spirit similar to parsing sentences in natural language; 2) the image parsing results are converted into semantic representation in the form of Web ontology language (OWL), which enables seamless integration with general knowledge bases; and 3) a text generation engine converts the results from previous steps into semantically meaningful, human readable, and query-able text reports. The centerpiece of the I2T framework is an and-or graph (AoG) visual knowledge representation, which provides a graphical representation serving as prior knowledge for representing diverse visual patterns and provides top-down hypotheses during the image parsing. The AoG embodies vocabularies of visual elements including primitives, parts, objects, scenes as well as a stochastic image grammar that specifies syntactic relations (i.e., compositional) and semantic relations (e.g., categorical, spatial, temporal, and functional) between these visual elements. Therefore, the AoG is a unified model of both categorical and symbolic representations of visual knowledge. The proposed I2T framework has two objectives. First, we use semiautomatic method to parse images from the Internet in order to build an AoG for visual knowledge representation. Our goal is to make the parsing process more and more automatic using the learned AoG model. Second, we use automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications. In the case studies at the end of this paper, we demonstrate two automatic I2T systems: a maritime and urban scene video surveillance system and a real-time automatic driving scene understanding system.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2009

Human Pose Tracking in Monocular Sequence Using Multilevel Structured Models

Mun Wai Lee; Ramakant Nevatia

Tracking human body poses in monocular video has many important applications. The problem is challenging in realistic scenes due to background clutter, variation in human appearance and self-occlusion. The complexity of pose tracking is further increased when there are multiple people whose bodies may inter-occlude. We proposed a three-stage approach with multi-level state representation that enables a hierarchical estimation of 3D body poses. Our method addresses various issues including automatic initialization, data association, self and inter-occlusion. At the first stage, humans are tracked as foreground blobs and their positions and sizes are coarsely estimated. In the second stage, parts such as face, shoulders and limbs are detected using various cues and the results are combined by a grid-based belief propagation algorithm to infer 2D joint positions. The derived belief maps are used as proposal functions in the third stage to infer the 3D pose using data-driven Markov chain Monte Carlo. Experimental results on several realistic indoor video sequences show that the method is able to track multiple persons during complex movement including sitting and turning movements with self and inter-occlusion.

european conference on computer vision | 2004

Human Upper Body Pose Estimation in Static Images

Mun Wai Lee; Isaac Cohen

Estimating human pose in static images is challenging due to the high dimensional state space, presence of image clutter and am- biguities of image observations. We present an MCMC framework for estimating 3D human upper body pose. A generative model, comprising of the human articulated structure, shape and clothing models, is used to formulate likelihood measures for evaluating solution candidates. We adopt a data-driven proposal mechanism for searching the solution space efficiently. We introduce the use of proposal maps, which is an efficient way of implementing inference proposals derived from multiple types of image cues. Qualitative and quantitative results show that the technique is effective in estimating 3D body pose over a variety of images.

ieee workshop on motion and video computing | 2002

Particle filter with analytical inference for human body tracking

Mun Wai Lee; Isaac Cohen; Soon Ki Jung

The paper introduces a framework that integrates analytical inference into the particle filtering scheme for human body tracking. The analytical inference is provided by body parts detection, and is used to update subsets of state parameters representing the human pose. This reduces the degree of randomness and decreases the required number of particles. This new technique is a significant improvement over the standard particle filtering, with the advantages of performing automatic track initialization, recovering from tracking failures, and reducing the computational load.

ieee workshop on motion and video computing | 2007

Body Part Detection for Human Pose Estimation and Tracking

Mun Wai Lee; Ram Nevatia

Accurate 3-D human body pose tracking from a monocular video stream is important for a number of applications. We describe a novel hierarchical approach for tracking human pose that uses edge-based features during the coarse stage and later other features for global optimization. At first, humans are detected by motion and tracked by fitting an ellipse in the image. Then, body components are found using edge features and used to estimate the 2D positions of the body joints accurately. This helps to bootstrap the estimation of 3D pose using a sampling-based search method in the last stage. We present experiment results with sequences of different realistic scenes to illustrate the performance of the method.

IEEE MultiMedia | 2014

Joint Video and Text Parsing for Understanding Events and Answering Queries

Kewei Tu; Meng Meng; Mun Wai Lee; Tae Eun Choe; Song-Chun Zhu

This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.

workshop on applications of computer vision | 2005

Dynamic Human Pose Estimation using Markov Chain Monte Carlo Approach

Mun Wai Lee; Ramakant Nevatia

This paper addresses the problem of tracking human body pose in monocular video including automatic pose initialization and re-initialization after tracking failures caused by partial occlusion or unreliable observations. We proposed a method based on data-driven Markov chain Monte Carlo (DD-MCMC) that uses bottom-up techniques to generate state proposals for pose estimation and initialization. This method allows us to exploit different image cues and consolidate the inferences using a representation known as the proposal maps. We present experimental results with an indoor video sequence.

international conference on pattern recognition | 2006

Optimal Global Mosaic Generation from Retinal Images

Tae Eun Choe; Isaac Cohen; Mun Wai Lee; Gérard G. Medioni

We present a method to construct a mosaic from multiple color and fluorescein retinal images. A set of images taken from different views at different times is difficult to register sequentially due to variations in color and intensity across images. We propose a method to register images globally in order to minimize the registration error and to find optimal registration pairs. The reference frame that gives the minimum registration error is found by the Floyd-Warshalls all-pairs shortest path algorithm, and all other images are registered to this reference frame using an affine transformation model. We present experimental results to validate the proposed method

computer vision and pattern recognition | 2008

SAVE: A framework for semantic annotation of visual events

Mun Wai Lee; Asaad Hakeem; Niels Haering; Song-Chun Zhu

In this paper we propose a framework that performs automatic semantic annotation of visual events (SAVE). This is an enabling technology for content-based video annotation, query and retrieval with applications in Internet video search and video data mining. The method involves identifying objects in the scene, describing their inter-relations, detecting events of interest, and representing them semantically in a human readable and query-able format. The SAVE framework is composed of three main components. The first component is an image parsing engine that performs scene content extraction using bottom-up image analysis and a stochastic attribute image grammar, where we define a visual vocabulary from pixels, primitives, parts, objects and scenes, and specify their spatio-temporal or compositional relations; and a bottom-up top-down strategy is used for inference. The second component is an event inference engine, where the video event markup language (VEML) is adopted for semantic representation, and a grammar-based approach is used for event analysis and detection. The third component is the text generation engine that generates text report using head-driven phrase structure grammar (HPSG). The main contribution of this paper is a framework for an end-to-end system that infers visual events and annotates a large collection of videos. Experiments with maritime and urban scenes indicate the feasibility of the proposed approach.

international conference on computer vision | 2005

3D human action recognition using spatio-temporal motion templates

Fengjun Lv; Ramakant Nevatia; Mun Wai Lee

Our goal is automatic recognition of basic human actions, such as stand, sit and wave hands, to aid in natural communication between a human and a computer. Human actions are inferred from human body joint motions, but such data has high dimensionality and large spatial and temporal variations may occur in executing the same action. We present a learning-based approach for the representation and recognition of 3D human action. Each action is represented by a template consisting of a set of channels with weights. Each channel corresponds to the evolution of one 3D joint coordinate and its weight is learned according to the Neyman-Pearson criterion. We use the learned templates to recognize actions based on χ2 error measurement. Results of recognizing 22 actions on a large set of motion capture sequences as well as several annotated and automatically tracked sequences show the effectiveness of the proposed algorithm.

Explore More