Benjamin Van Durme
Johns Hopkins University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Benjamin Van Durme.
meeting of the association for computational linguistics | 2014
Xuchen Yao; Benjamin Van Durme
Answering natural language questions using the Freebase knowledge base has recently been explored as a platform for advancing the state of the art in open domain semantic parsing. Those efforts map questions to sophisticated meaning representations that are then attempted to be matched against viable answer candidates in the knowledge base. Here we show that relatively modest information extraction techniques, when paired with a webscale corpus, can outperform these sophisticated approaches by roughly 34% relative gain.
meeting of the association for computational linguistics | 2014
Svitlana Volkova; Glen Coppersmith; Benjamin Van Durme
Existing models for social media personal analytics assume access to thousands of messages per user, even though most users author content only sporadically over time. Given this sparsity, we: (i) leverage content from the local neighborhood of a user; (ii) evaluate batch models as a function of size and the amount of messages in various types of neighborhoods; and (iii) estimate the amount of time and tweets required for a dynamic model to predict user preferences. We show that even when limited or no selfauthored data is available, language from friend, retweet and user mention communications provide sufficient evidence for prediction. When updating models over time based on Twitter, we find that political preference can be often be predicted using roughly 100 tweets, depending on the context of user selection, where this could mean hours, or weeks, based on the author’s tweeting frequency.
ieee automatic speech recognition and understanding workshop | 2011
Aren Jansen; Benjamin Van Durme
Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting.
Semantics in Text Processing. STEP 2008 Conference Proceedings | 2008
Benjamin Van Durme; Lenhart K. Schubert
We present results for a system designed to perform Open Knowledge Extraction, based on a tradition of compositional language processing, as applied to a large collection of text derived from the Web. Evaluation through manual assessment shows that well-formed propositions of reasonable quality, representing general world knowledge, given in a logical form potentially usable for inference, may be extracted in high volume from arbitrary input sentences. We compare these results with those obtained in recent work on Open Information Extraction, indicating with some examples the quite different kinds of output obtained by the two approaches. Finally, we observe that portions of the extracted knowledge are comparable to results of recent work on class attribute extraction.
north american chapter of the association for computational linguistics | 2015
Pushpendre Rastogi; Benjamin Van Durme; Raman Arora
Multiview LSA (MVLSA) is a generalization of Latent Semantic Analysis (LSA) that supports the fusion of arbitrary views of data and relies on Generalized Canonical Correlation Analysis (GCCA). We present an algorithm for fast approximate computation of GCCA, which when coupled with methods for handling missing values, is general enough to approximate some recent algorithms for inducing vector representations of words. Experiments across a comprehensive collection of test-sets show our approach to be competitive with the state of the art.
international joint conference on artificial intelligence | 2011
Shane Bergsma; Benjamin Van Durme
Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12% over string edit distance alone.
meeting of the association for computational linguistics | 2009
Benjamin Van Durme; Phillip Michalak; Lenhart K. Schubert
Existing work in the extraction of commonsense knowledge from text has been primarily restricted to factoids that serve as statements about what may possibly obtain in the world. We present an approach to deriving stronger, more general claims by abstracting over large sets of factoids. Our goal is to coalesce the observed nominals for a given predicate argument into a few predominant types, obtained as WordNet synsets. The results can be construed as generically quantified sentences restricting the semantic type of an argument position of a predicate.
meeting of the association for computational linguistics | 2014
Xuchen Yao; Jonathan Berant; Benjamin Van Durme
We contrast two seemingly distinct approaches to the task of question answering (QA) using Freebase: one based on information extraction techniques, the other on semantic parsing. Results over the same test-set were collected from two state-ofthe-art, open-source systems, then analyzed in consultation with those systems’ creators. We conclude that the differences between these technologies, both in task performance, and in how they get there, is not significant. This suggests that the semantic parsing community should target answering more compositional open-domain questions that are beyond the reach of more direct information extraction methods.
conference on information and knowledge management | 2007
Marius Pasca; Benjamin Van Durme; Nikesh Garera
Challenging the implicit reliance on document collections, this paper discusses the pros and cons of using query logs rather than document collections, as self-contained sources of data in textual information extraction. The differences are quantified as part of a large-scale study on extracting prominent attributes or quantifiable properties of classes (e.g., top speed, price and fuel consumption for CarModel) from unstructured text. In a head-to-head qualitative comparison, a lightweight extraction method produces class attributes that are 45% more accurate on average, when acquired from query logs rather than Web documents.
international conference on acoustics, speech, and signal processing | 2015
Keith Levin; Aren Jansen; Benjamin Van Durme
The task of zero resource query-by-example keyword search has received much attention in recent years as the speech technology needs of the developing world grow. These systems traditionally rely upon dynamic time warping (DTW) based retrieval algorithms with runtimes that are linear in the size of the search collection. As a result, their scalability substantially lags that of their supervised counterparts, which take advantage of efficient word-based indices. In this paper, we present a novel audio indexing approach called Segmental Randomized Acoustic Indexing and Logarithmic-time Search (S-RAILS). S-RAILS generalizes the original frame-based RAILS methodology to word-scale segments by exploiting a recently proposed acoustic segment embedding technique. By indexing word-scale segments directly, we avoid higher cost frame-based processing of RAILS while taking advantage of the improved lexical discrimination of the embeddings. Using the same conversational telephone speech benchmark, we demonstrate major improvements in both speed and accuracy over the original RAILS system.