Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Amit Madaan is active.

Publication


Featured researches published by Amit Madaan.


international conference on data engineering | 2011

Web-scale information extraction with vertex

Pankaj Gulhane; Amit Madaan; Rupesh Rasiklal Mehta; Jeyashankher Ramamirtham; Rajeev Rastogi; Sandeepkumar Satpal; Srinivasan H. Sengamedu; Ashwin Tengli; Charu Tiwari

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.


conference on information and knowledge management | 2012

Matching product titles using web-based enrichment

Vishrawas Gopalakrishnan; Suresh Iyengar; Amit Madaan; Rajeev Rastogi; Srinivasan H. Sengamedu

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.


international world wide web conferences | 2011

A classification based framework for concept summarization

Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan

In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries. Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.


international world wide web conferences | 2008

Web page sectioning using regex­-based template

Rupesh R. Mehta; Amit Madaan

This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.


international conference on data mining | 2012

A Classification Based Framework for Concept Summarization

Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan

In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries.\eat{ We demonstrate the efficacy of our method on seven broad query types - sports, news, celebrities, events, travel, country and abstract.} Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.\eat{ in terms of satisfying important visual and semantic properties both qualitatively and quantitatively. We observe from editorial evaluation that around


Archive | 2007

Techniques for inducing high quality structural templates for electronic documents

V. G. Vinod Vydiswaran; Rupesh R. Mehta; Amit Madaan

78


Archive | 2008

Site-specific information-type detection methods and systems

Rupesh R. Mehta; Amit Madaan

\% of our summaries are of high enough quality to be shown directly to the web users with minimal or no modifications.


Archive | 2008

STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS

Amit Madaan; V. G. Vinod Vydiswaran; Rupesh R. Mehta


Archive | 2009

HIGH PRECISION MULTI ENTITY EXTRACTION

Amit Madaan; Charu Tiwari


Archive | 2009

ROBUST XPATHS FOR WEB INFORMATION EXTRACTION

Amit Madaan; Charu Tiwari; Rupesh R. Mehta

Researchain Logo
Decentralizing Knowledge