Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Alpa Jain is active.

Publication


Featured researches published by Alpa Jain.


meeting of the association for computational linguistics | 2006

Names and Similarities on the Web: Fact Extraction in the Fast Lane

Marius Pasca; Dekang Lin; Jeffrey P. Bigham; Andrei Lifchits; Alpa Jain

In a new approach to large-scale extraction of facts from unstructured text, distributional similarities become an integral part of both the iterative acquisition of high-coverage contextual extraction patterns, and the validation and ranking of candidate facts. The evaluation measures the quality and coverage of facts extracted from one hundred million Web documents, starting from ten seed facts and using no additional knowledge, lexicons or complex tools.


web search and data mining | 2011

Dynamic relationship and event discovery

Anish Das Sarma; Alpa Jain; Cong Yu

This paper studies the problem of dynamic relationship and event discovery. A large body of previous work on relation extraction focuses on discovering predefined and static relationships between entities. In contrast, we aim to identify temporally defined (e.g., co-bursting) relationships that are not predefined by an existing schema, and we identify the underlying time constrained events that lead to these relationships. The key challenges in identifying such events include discovering and verifying dynamic connections among entities, and consolidating binary dynamic connections into events consisting of a set of entities that are connected at a given time period. We formalize this problem and introduce an efficient end-to-end pipeline as a solution. In particular, we introduce two formal notions, global temporal constraint cluster and local temporal constraint cluster, for detecting dynamic events. We further design efficient algorithms for discovering such events from a large graph of dynamic relationships. Finally, detailed experiments on real data show the effectiveness of our proposed solution.


ACM Transactions on Database Systems | 2009

A quality-aware optimizer for information extraction

Alpa Jain; Panagiotis G. Ipeirotis

A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.


international conference on data engineering | 2009

Join Optimization of Information Extraction Output: Quality Matters!

Alpa Jain; Panagiotis G. Ipeirotis; AnHai Doan; Luis Gravano

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality---and, of course, the execution time---of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.


information reuse and integration | 2007

Acronym-Expansion Recognition and Ranking on the Web

Alpa Jain; Silviu Cucerzan; Saliha Azzam

The paper presents a study on large-scale automatic extraction of acronyms and associated expansions from Web data and from the user interactions with this data through Web search engines. We investigate three information sources for extracting and ranking acronym-expansion pairs, as provided by a large-scale search engine: the crawled web documents, the search engine logs, and the search results. We evaluate and compare the acronym-expansion pairs generated from these sources on three dimensions: (1) the precision and recall of each source; (2) the overlap and inclusion among the acronym-expansion sets; and (3) the rank-order correlation of the ordered expansion sets. Our results show that all three data sources play an important role in building a comprehensive up-to-date collection of acronym-expansion pairs.


international world wide web conferences | 2011

Domain-independent entity extraction from web search query logs

Alpa Jain; Marco Pennacchiotti

Query logs of a Web search engine have been increasingly used as a vital source for data mining. This paper presents a study on large-scale domain-independent entity extraction from search query logs. We present a completely unsupervised method to extract entities by applying pattern-based heuristics and statistical measures. We compare against existing techniques that use Web documents as well as search logs, and show that we improve over the state of the art. We also provide an in-depth qualitative analysis outlining differences and commonalities between these methods.


international conference on data engineering | 2009

Exploring a Few Good Tuples from Text Databases

Alpa Jain; Divesh Srivastava

Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden data. There are a large number of text databases available, and not every text database is necessarily relevant to every relation. Hence, it is important to be able to quickly explore the utility of running an extractor for a specific relation over a given text database before carrying out the expensive extraction task. In this paper, we present a novel exploration methodology of {\em finding a few good tuples} for a relation that can be extracted from a database which allows for judging the relevance of the database for the relation. Specifically, we propose the notion of a good(k,


international conference on management of data | 2010

I4E: interactive investigation of iterative information extraction

Anish Das Sarma; Alpa Jain; Divesh Srivastava

\ell


conference on information and knowledge management | 2010

Organizing query completions for web search

Alpa Jain; Gilad Mishne

) query as one that can return any


international conference on management of data | 2009

Building query optimizers for information extraction: the SQoUT project

Alpa Jain; Panagiotis G. Ipeirotis; Luis Gravano

k

Collaboration


Dive into the Alpa Jain's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

AnHai Doan

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar

Panagiotis G. Ipeirotis

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge