Kapil Thadani
Columbia University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kapil Thadani.
european conference on machine learning | 2007
Tony Jebara; Yingbo Song; Kapil Thadani
Clustering has recently enjoyed progress via spectral methods which group data using only pairwise affinities and avoid parametric assumptions. While spectral clustering of vector inputs is straightforward, extensions to structured data or time-series data remain less explored. This paper proposes a clustering method for time-series data that couples non-parametric spectral clustering with parametric hidden Markov models (HMMs). HMMs add some beneficial structural and parametric assumptions such as Markov properties and hidden state variables which are useful for clustering. This article shows that using probabilistic pairwise kernel estimates between parametric models provides improved experimental results for unsupervised clustering and visualization of real and synthetic datasets. Results are compared with a fully parametric baseline method (a mixture of hidden Markov models) and a non-parametric baseline method (spectral clustering with non-parametric time-series kernels).
meeting of the association for computational linguistics | 2011
Yves Petinot; Kathleen R. McKeown; Kapil Thadani
We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.
north american chapter of the association for computational linguistics | 2010
Mukund Jha; Jacob Andreas; Kapil Thadani; Sara Rosenthal; Kathleen R. McKeown
This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowd-sourcing judgments. We develop and present a system to extract prepositional phrases and their potential attachments from ungrammatical and informal sentences and pose the subsequent disambiguation tasks as multiple choice questions to workers from Amazons Mechanical Turk service. Our analysis shows that this two-step approach is capable of producing reliable annotations on informal and potentially noisy blog text, and this semi-automated strategy holds promise for similar annotation projects in new genres.
association for information science and technology | 2016
Kathy McKeown; Hal Daumé; Snigdha Chaturvedi; John Paparrizos; Kapil Thadani; Pablo Barrio; Or Biran; Suvarna Bothe; Michael Collins; Kenneth R. Fleischmann; Luis Gravano; Rahul Jha; Ben King; Kevin McInerney; Taesun Moon; Arvind Neelakantan; Diarmuid O'Seaghdha; Dragomir R. Radev; Clay Templeton; Simone Teufel
New scientific concepts, interpreted broadly, are continuously introduced in the literature, but relatively few concepts have a long‐term impact on society. The identification of such concepts is a challenging prediction task that would help multiple parties—including researchers and the general public—focus their attention within the vast scientific literature. In this paper we present a system that predicts the future impact of a scientific concept, represented as a technical term, based on the information available from recently published research articles. We analyze the usefulness of rich features derived from the full text of the articles through a variety of approaches, including rhetorical sentence analysis, information extraction, and time‐series analysis. The results from two large‐scale experiments with 3.8 million full‐text articles and 48 million metadata records support the conclusion that full‐text features are significantly more useful for prediction than metadata‐only features and that the most accurate predictions result from combining the metadata and full‐text features. Surprisingly, these results hold even when the metadata features are available for a much larger number of documents than are available for the full‐text features.
web search and data mining | 2017
Aasish Pappu; Roi Blanco; Yashar Mehdad; Amanda Stent; Kapil Thadani
Text analytics systems often rely heavily on detecting and linking entity mentions in documents to knowledge bases for downstream applications such as sentiment analysis, question answering and recommender systems. A major challenge for this task is to be able to accurately detect entities in new languages with limited labeled resources. In this paper we present an accurate and lightweight, multilingual named entity recognition (NER) and linking (NEL) system. The contributions of this paper are three-fold: 1) Lightweight named entity recognition with competitive accuracy; 2) Candidate entity retrieval that uses search click-log data and entity embeddings to achieve high precision with a low memory footprint; and 3) efficient entity disambiguation. Our system achieves state-of-the-art performance on TAC KBP 2013 multilingual data and on English AIDA CONLL data.
international conference on computational linguistics | 2012
Kapil Thadani; Scott Martin; Michael White
Monolingual alignment is frequently required for natural language tasks that involve similar or comparable sentences. We present a new model for monolingual alignment in which the score of an alignment decomposes over both the set of aligned phrases as well as a set of aligned dependency arcs. Optimal alignments under this scoring function are decoded using integer linear programming while model parameters are learned using standard structured prediction approaches. We evaluate our joint aligner on the Edinburgh paraphrase corpus and show significant gains over a Meteor baseline and a state-of-the-art phrase-based aligner.
north american chapter of the association for computational linguistics | 2010
Kathleen R. McKeown; Sara Rosenthal; Kapil Thadani; Coleman Moore
Sentence fusion enables summarization and question-answering systems to produce output by combining fully formed phrases from different sentences. Yet there is little data that can be used to develop and evaluate fusion techniques. In this paper, we present a methodology for collecting fusions of similar sentence pairs using Amazons Mechanical Turk, selecting the input pairs in a semi-automated fashion. We evaluate the results using a novel technique for automatically selecting a representative sentence from multiple responses. Our approach allows for rapid construction of a high accuracy fusion corpus.
meeting of the association for computational linguistics | 2011
Kapil Thadani; Kathleen R. McKeown
The task of aligning corresponding phrases across two related sentences is an important component of approaches for natural language problems such as textual inference, paraphrase detection and text-to-text generation. In this work, we examine a state-of-the-art structured prediction model for the alignment task which uses a phrase-based representation and is forced to decode alignments using an approximate search approach. We propose instead a straightforward exact decoding technique based on integer linear programming that yields order-of-magnitude improvements in decoding speed. This ILP-based decoding strategy permits us to consider syntactically-informed constraints on alignments which significantly increase the precision of the model.
international conference on computational linguistics | 2008
Kapil Thadani; Kathleen R. McKeown
The task of identifying redundant information in documents that are generated from multiple sources provides a significant challenge for summarization and QA systems. Traditional clustering techniques detect redundancy at the sentential level and do not guarantee the preservation of all information within the document. We discuss an algorithm that generates a novel graph-based representation for a document and then utilizes a set cover approximation algorithm to remove redundant text from it. Our experiments show that this approach offers a significant performance advantage over clustering when evaluated over an annotated dataset.
annual meeting of the special interest group on discourse and dialogue | 2016
Junyi Jessy Li; Kapil Thadani; Amanda Stent
Although human-written summaries of documents tend to involve significant edits to the source text, most automated summarizers are extractive and select sentences verbatim. In this work we examine how elementary discourse units (EDUs) from Rhetorical Structure Theory can be used to extend extractive summarizers to produce a wider range of human-like summaries. Our analysis demonstrates that EDU segmentation is effective in preserving human-labeled summarization concepts within sentences and also aligns with near-extractive summaries constructed by news editors. Finally, we show that using EDUs as units of content selection instead of sentences leads to stronger summarization performance in near-extractive scenarios, especially under tight budgets.