Kris M. Kitani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kris M. Kitani is active.

Explore More

Publication

Featured researches published by Kris M. Kitani.

european conference on computer vision | 2012

Activity forecasting

Kris M. Kitani; Brian D. Ziebart; James Andrew Bagnell; Martial Hebert

We address the task of inferring the future actions of people from noisy visual input. We denote this task activity forecasting. To achieve accurate activity forecasting, our approach models the effect of the physical environment on the choice of human actions. This is accomplished by the use of state-of-the-art semantic scene understanding combined with ideas from optimal control theory. Our unified model also integrates several other key elements of activity analysis, namely, destination forecasting, sequence smoothing and transfer learning. As proof-of-concept, we focus on the domain of trajectory-based activity analysis from visual input. Experimental results demonstrate that our model accurately predicts distributions over future actions of individuals. We show how the same techniques can improve the results of tracking algorithms by leveraging information about likely goals and trajectories.

computer vision and pattern recognition | 2011

Fast unsupervised ego-action learning for first-person sports videos

Kris M. Kitani; Takahiro Okabe; Yoichi Sato; Akihiro Sugimoto

Portable high-quality sports cameras (e.g. head or helmet mounted) built for recording dynamic first-person video footage are becoming a common item among many sports enthusiasts. We address the novel task of discovering first-person action categories (which we call ego-actions) which can be useful for such tasks as video indexing and retrieval. In order to learn ego-action categories, we investigate the use of motion-based histograms and unsupervised learning algorithms to quickly cluster video content. Our approach assumes a completely unsupervised scenario, where labeled training videos are not available, videos are not pre-segmented and the number of ego-action categories are unknown. In our proposed framework we show that a stacked Dirichlet process mixture model can be used to automatically learn a motion histogram codebook and the set of ego-action categories. We quantitatively evaluate our approach on both in-house and public YouTube videos and demonstrate robust ego-action categorization across several sports genres. Comparative analysis shows that our approach outperforms other state-of-the-art topic models with respect to both classification accuracy and computational speed. Preliminary results indicate that on average, the categorical content of a 10 minute video sequence can be indexed in under 5 seconds.

computer vision and pattern recognition | 2013

Pixel-Level Hand Detection in Ego-centric Videos

Cheng Li; Kris M. Kitani

We address the task of pixel-level hand detection in the context of ego-centric cameras. Extracting hand regions in ego-centric videos is a critical step for understanding hand-object manipulation and analyzing hand-eye coordination. However, in contrast to traditional applications of hand detection, such as gesture interfaces or sign-language recognition, ego-centric videos present new challenges such as rapid changes in illuminations, significant camera motion and complex hand-object manipulations. To quantify the challenges and performance in this new domain, we present a fully labeled indoor/outdoor ego-centric hand detection benchmark dataset containing over 200 million labeled pixels, which contains hand images taken under various illumination conditions. Using both our dataset and a publicly available ego-centric indoors dataset, we give extensive analysis of detection performance using a wide range of local appearance features. Our analysis highlights the effectiveness of sparse features and the importance of modeling global illumination. We propose a modeling strategy based on our findings and show that our model outperforms several baseline approaches.

international conference on computer vision | 2009

Using individuality to track individuals: Clustering individual trajectories in crowds using local appearance and frequency trait

Daisuke Sugimura; Kris M. Kitani; Takahiro Okabe; Yoichi Sato; Akihiro Sugimoto

In this work, we propose a method for tracking individuals in crowds. Our method is based on a trajectory-based clustering approach that groups trajectories of image features that belong to the same person. The key novelty of our method is to make use of a persons individuality, that is, the gait features and the temporal consistency of local appearance to track each individual in a crowd. Gait features in the frequency domain have been shown to be an effective biometric cue in discriminating between individuals, and our method uses such features for tracking people in crowds for the first time. Unlike existing trajectory-based tracking methods, our method evaluates the dissimilarity of trajectories with respect to a group of three adjacent trajectories. In this way, we incorporate the temporal consistency of local patch appearance to differentiate trajectories of multiple people moving in close proximity. Our experiments show that the use of gait features and the temporal consistency of local appearance contributes to significant performance improvement in tracking people in crowded scenes.

european conference on computer vision | 2014

Action-Reaction: Forecasting the Dynamics of Human Interaction

De-An Huang; Kris M. Kitani

Forecasting human activities from visual evidence is an emerging area of research which aims to allow computational systems to make predictions about unseen human actions. We explore the task of activity forecasting in the context of dual-agent interactions to understand how the actions of one person can be used to predict the actions of another. We model dual-agent interactions as an optimal control problem, where the actions of the initiating agent induce a cost topology over the space of reactive poses – a space in which the reactive agent plans an optimal pose trajectory. The technique developed in this work employs a kernel-based reinforcement learning approximation of the soft maximum value function to deal with the high-dimensional nature of human motion and applies a mean-shift procedure over a continuous cost function to infer a smooth reaction sequence. Experimental results show that our proposed method is able to properly model human interactions in a high dimensional space of human poses. When compared to several baseline models, results show that our method is able to generate highly plausible simulations of human interaction.

computer vision and pattern recognition | 2016

Going Deeper into First-Person Activity Recognition

Minghuang Ma; Haoqi Fan; Kris M. Kitani

We bring together ideas from recent work on feature design for egocentric action recognition under one framework by exploring the use of deep convolutional neural networks (CNN). Recent work has shown that features such as hand appearance, object attributes, local hand motion and camera ego-motion are important for characterizing first-person actions. To integrate these ideas under one framework, we propose a twin stream network architecture, where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocentric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations. Our extensive experiments on benchmark egocentric action datasets show that our deep architecture enables recognition rates that significantly outperform state-of-the-art techniques - an average 6:6% increase in accuracy over all datasets. Furthermore, by learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% (objects). We also include the results of extensive ablative analysis to highlight the importance of network design decisions.

international conference on computer vision | 2013

Model Recommendation with Virtual Probes for Egocentric Hand Detection

Cheng Li; Kris M. Kitani

Egocentric cameras can be used to benefit such tasks as analyzing fine motor skills, recognizing gestures and learning about hand-object manipulation. To enable such technology, we believe that the hands must detected on the pixel-level to gain important information about the shape of the hands and fingers. We show that the problem of pixel-wise hand detection can be effectively solved, by posing the problem as a model recommendation task. As such, the goal of a recommendation system is to recommend the n-best hand detectors based on the probe set - a small amount of labeled data from the test distribution. This requirement of a probe set is a serious limitation in many applications, such as ego-centric hand detection, where the test distribution may be continually changing. To address this limitation, we propose the use of virtual probes which can be automatically extracted from the test distribution. The key idea is that many features, such as the color distribution or relative performance between two detectors, can be used as a proxy to the probe set. In our experiments we show that the recommendation paradigm is well-equipped to handle complex changes in the appearance of the hands in first-person vision. In particular, we show how our system is able to generalize to new scenarios by testing our model across multiple users.

computer vision and pattern recognition | 2012

Coupling eye-motion and ego-motion features for first-person activity recognition

Keisuke Ogaki; Kris M. Kitani; Yusuke Sugano; Yoichi Sato

We focus on the use of first-person eye movement and ego-motion as a means of understanding and recognizing indoor activities from an “inside-out” camera system. We show that when eye movement captured by an inside looking camera is used in tandem with ego-motion features extracted from an outside looking camera, the classification accuracy of first-person actions can be improved. We also present a dataset of over two hours of realistic indoor desktop actions, including both eye tracking information and a high quality outside camera video. We run experiments and show that our joint feature is effective and robust over multiple users.

human computer interaction with mobile devices and services | 2016

NavCog: a navigational cognitive assistant for the blind

Cole Gleason; Chengxiong Ruan; Kris M. Kitani; Hironobu Takagi; Chieko Asakawa

Turn-by-turn navigation is a useful paradigm for assisting people with visual impairments during mobility as it reduces the cognitive load of having to simultaneously sense, localize and plan. To realize such a system, it is necessary to be able to automatically localize the user with sufficient accuracy, provide timely and efficient instructions and have the ability to easily deploy the system to new spaces. We propose a smartphone-based system that provides turn-by-turn navigation assistance based on accurate real-time localization over large spaces. In addition to basic navigation capabilities, our system also informs the user about nearby points-of-interest (POI) and accessibility issues (e.g., stairs ahead). After deploying the system on a university campus across several indoor and outdoor areas, we evaluated it with six blind subjects and showed that our system is capable of guiding visually impaired users in complex and unfamiliar environments.

computer vision and pattern recognition | 2015

Learning scene-specific pedestrian detectors without real data

Hironori Hattori; Vishnu Naresh Boddeti; Kris M. Kitani; Takeo Kanade

We consider the problem of designing a scene-specific pedestrian detector in a scenario where we have zero instances of real pedestrian data (i.e., no labeled real data or unsupervised real data). This scenario may arise when a new surveillance system is installed in a novel location and a scene-specific pedestrian detector must be trained prior to any observations of pedestrians. The key idea of our approach is to infer the potential appearance of pedestrians using geometric scene data and a customizable database of virtual simulations of pedestrian motion. We propose an efficient discriminative learning method that generates a spatially-varying pedestrian appearance model that takes into the account the perspective geometry of the scene. As a result, our method is able to learn a unique pedestrian classifier customized for every possible location in the scene. Our experimental results show that our proposed approach outperforms classical pedestrian detection models and hybrid synthetic-real models. Our results also yield a surprising result, that our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.

Explore More