Andrei Barbu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrei Barbu is active.

Explore More

Publication

Featured researches published by Andrei Barbu.

computer vision and pattern recognition | 2013

Recognize Human Activities from Partially Observed Videos

Yu Cao; Daniel Paul Barrett; Andrei Barbu; Siddharth Narayanaswamy; Haonan Yu; Aaron Michaux; Yuewei Lin; Sven J. Dickinson; Jeffrey Mark Siskind; Song Wang

Recognizing human activities in partially observed videos is a challenging problem and has many practical applications. When the unobserved subsequence is at the end of the video, the problem is reduced to activity prediction from unfinished activity streaming, which has been studied by many researchers. However, in the general case, an unobserved subsequence may occur at any time by yielding a temporal gap in the video. In this paper, we propose a new method that can recognize human activities from partially observed videos in the general case. Specifically, we formulate the problem into a probabilistic framework: 1) dividing each activity into multiple ordered temporal segments, 2) using spatiotemporal features of the training video samples in each segment as bases and applying sparse coding (SC) to derive the activity likelihood of the test video sample at each segment, and 3) finally combining the likelihood at each segment to achieve a global posterior for the activities. We further extend the proposed method to include more bases that correspond to a mixture of segments with different temporal lengths (MSSC), which can better represent the activities with large intra-class variations. We evaluate the proposed methods (SC and MSSC) on various real videos. We also evaluate the proposed methods on two special cases: 1) activity prediction where the unobserved subsequence is at the end of the video, and 2) human activity recognition on fully observed videos. Experimental results show that the proposed methods outperform existing state-of-the-art comparison methods.

computer vision and pattern recognition | 2014

Seeing What You're Told: Sentence-Guided Activity Recognition in Video

Narayanaswamy Siddharth; Andrei Barbu; Jeffrey Mark Siskind

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.

international conference on robotics and automation | 2010

Learning physically-instantiated game play through visual observation

Andrei Barbu; Siddharth Narayanaswamy; Jeffrey Mark Siskind

We present an integrated vision and robotic system that plays, and learns to play, simple physically-instantiated board games that are variants of TIC TAC TOE and HEXAPAWN. We employ novel custom vision and robotic hardware designed specifically for this learning task. The game rules can be parametrically specified. Two independent computational agents alternate playing the two opponents with the shared vision and robotic hardware, using pre-specified rule sets. A third independent computational agent, sharing the same hardware, learns the game rules solely by observing the physical play, without access to the pre-specified rule set, using inductive logic programming with minimal background knowledge possessed by human children. The vision component of our integrated system reliably detects the position of the board in the image and reconstructs the game state after every move, from a single image. The robotic component reliably moves pieces both between board positions and to and from off-board positions as needed by an arbitrary parametrically-specified legal-move generator. Thus the rules of games learned solely by observing physical play can drive further physical play. We demonstrate our system learning to play six different games.

Journal of Artificial Intelligence Research | 2015

A compositional framework for grounding language inference, generation, and acquisition in video

Haonan Yu; N. Siddharth; Andrei Barbu; Jeffrey Mark Siskind

We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) affect the meaning of a sentence and how it is grounded in video. We exploit this methodology in three ways. In the first, a video clip along with a sentence are taken as input and the participants in the event described by the sentence are highlighted, even when the clip depicts multiple similar simultaneous events. In the second, a video clip is taken as input without a sentence and a sentence is generated that describes an event in that clip. In the third, a corpus of video clips is paired with sentences which describe some of the events in those clips and the meanings of the words in those sentences are learned. We learn these meanings without needing to specify which attribute of the video clips each word in a given sentence refers to. The learned meaning representations are shown to be intelligible to humans.

european conference on computer vision | 2014

Seeing is worse than believing: Reading people's minds better than computer-vision methods recognize actions

Andrei Barbu; Daniel Paul Barrett; Wei Chen; Narayanaswamy Siddharth; Caiming Xiong; Jason J. Corso; Christiane Fellbaum; Catherine Hanson; Stephen José Hanson; Sébastien Hélie; Evguenia Malaia; Barak A. Pearlmutter; Jeffrey Mark Siskind; Thomas M. Talavage; Ronnie B. Wilbur

We had human subjects perform a one-out-of-six class action recognition task from video stimuli while undergoing functional magnetic resonance imaging (fMRI). Support-vector machines (SVMs) were trained on the recovered brain scans to classify actions observed during imaging, yielding average classification accuracy of 69.73% when tested on scans from the same subject and of 34.80% when tested on scans from different subjects. An apples-to-apples comparison was performed with all publicly available software that implements state-of-the-art action recognition on the same video corpus with the same cross-validation regimen and same partitioning into training and test sets, yielding classification accuracies between 31.25% and 52.34%. This indicates that one can read people’s minds better than state-of-the-art computer-vision methods can perform action recognition.

empirical methods in natural language processing | 2015

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Yevgeni Berzak; Andrei Barbu; Daniel Harari; Boris Katz; Shimon Ullman

Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

international joint conference on artificial intelligence | 2017

Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context

Rohan Paul; Andrei Barbu; Sue Felshin; Boris Katz; Nicholas Roy

A robots ability to understand or ground natural language instructions is fundamentally tied to its knowledge about the surrounding world. We present an approach to grounding natural language utterances in the context of factual information gathered through natural-language interactions and past visual observations. A probabilistic model estimates, from a natural language utterance, the objects,relations, and actions that the utterance refers to, the objectives for future robotic actions it implies, and generates a plan to execute those actions while updating a state representation to include newly acquired knowledge from the visual-linguistic context. Grounding a command necessitates a representation for past observations and interactions; however, maintaining the full context consisting of all possible observed objects, attributes, spatial relations, actions, etc., over time is intractable. Instead, our model, Temporal Grounding Graphs, maintains a learned state representation for a belief over factual groundings, those derived from natural-language interactions, and lazily infers new groundings from visual observations using the context implied by the utterance. This work significantly expands the range of language that a robot can understand by incorporating factual knowledge and observations of its workspace in its inference about the meaning and grounding of natural-language utterances.

international conference on robotics and automation | 2011

A visual language model for estimating object pose and structure in a generative visual domain

Siddharth Narayanaswamy; Andrei Barbu; Jeffrey Mark Siskind

We present a generative domain of visual objects by analogy to the generative nature of human language. Just as small inventories of phonemes and words combine in a grammatical fashion to yield myriad valid words and utterances, a small inventory of physical parts combine in a grammatical fashion to yield myriad valid assemblies. We apply the notion of a language model from speech recognition to this visual domain to similarly improve the performance of the recognition process over what would be possible by only applying recognizers to the components. Unlike the context-free models for human language, our visual language models are context sensitive and formulated as stochastic constraint-satisfaction problems. And unlike the situation for human language where all components are observable, our methods deal with occlusion, successfully recovering object structure despite unobservable components. We demonstrate our system with an integrated robotic system for disassembling structures that performs whole-scene reconstruction consistent with a language model in the presence of noisy feature detectors.

empirical methods in natural language processing | 2016

Anchoring and Agreement in Syntactic Annotations

Yevgeni Berzak; Yan Huang; Andrei Barbu; Anna Korhonen; Boris Katz

We present a study on two key characteristics of human syntactic annotations: anchoring and agreement. Anchoring is a well known cognitive bias in human decision making, where judgments are drawn towards pre-existing values. We study the influence of anchoring on a standard approach to creation of syntactic resources where syntactic annotations are obtained via human editing of tagger and parser output. Our experiments demonstrate a clear anchoring effect and reveal unwanted consequences, including overestimation of parsing performance and lower quality of annotations in comparison with human-based annotations. Using sentences from the Penn Treebank WSJ, we also report systematically obtained inter-annotator agreement estimates for English dependency parsing. Our agreement results control for parser bias, and are consequential in that they are on par with state of the art parsing performance for English newswire. We discuss the impact of our findings on strategies for future annotation efforts and parser evaluations.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

Saying What You're Looking For: Linguistics Meets Video Search

Daniel Paul Barrett; Andrei Barbu; N. Siddharth; Jeffrey Mark Siskind

We present an approach to searching large video corpora for clips which depict a natural-language query in the form of a sentence. Compositional semantics is used to encode subtle meaning differences lost in other approaches, such as the difference between two sentences which have identical words but entirely different meaning: The person rode the horse versus The horse rode the person. Given a sentential query and a natural-language parser, we produce a score indicating how well a video clip depicts that sentence for each clip in a corpus and return a ranked list of clips. Two fundamental problems are addressed simultaneously: detecting and tracking objects, and recognizing whether those tracks depict the query. Because both tracking and object detection are unreliable, our approach uses the sentential query to focus the tracker on the relevant participants and ensures that the resulting tracks are described by the sentential query. While most earlier work was limited to single-word queries which correspond to either verbs or nouns, we search for complex queries which contain multiple phrases, such as prepositional phrases, and modifiers, such as adverbs. We demonstrate this approach by searching for 2,627 naturally elicited sentential queries in 10 Hollywood movies.

Explore More