Amit Madaan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amit Madaan is active.

Explore More

Publication

Featured researches published by Amit Madaan.

international conference on data engineering | 2011

Web-scale information extraction with vertex

Pankaj Gulhane; Amit Madaan; Rupesh Rasiklal Mehta; Jeyashankher Ramamirtham; Rajeev Rastogi; Sandeepkumar Satpal; Srinivasan H. Sengamedu; Ashwin Tengli; Charu Tiwari

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.

conference on information and knowledge management | 2012

Matching product titles using web-based enrichment

Vishrawas Gopalakrishnan; Suresh Iyengar; Amit Madaan; Rajeev Rastogi; Srinivasan H. Sengamedu

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.

international world wide web conferences | 2011

A classification based framework for concept summarization

Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan

In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries. Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.

international world wide web conferences | 2008

Web page sectioning using regex-based template

Rupesh R. Mehta; Amit Madaan

This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.

international conference on data mining | 2012

A Classification Based Framework for Concept Summarization

Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan

In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries.\eat{ We demonstrate the efficacy of our method on seven broad query types - sports, news, celebrities, events, travel, country and abstract.} Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.\eat{ in terms of satisfying important visual and semantic properties both qualitatively and quantitatively. We observe from editorial evaluation that around

Archive | 2007

Techniques for inducing high quality structural templates for electronic documents

V. G. Vinod Vydiswaran; Rupesh R. Mehta; Amit Madaan

Archive | 2008

Site-specific information-type detection methods and systems

Rupesh R. Mehta; Amit Madaan

\% of our summaries are of high enough quality to be shown directly to the web users with minimal or no modifications.

Archive | 2008

STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS

Amit Madaan; V. G. Vinod Vydiswaran; Rupesh R. Mehta

Archive | 2009

HIGH PRECISION MULTI ENTITY EXTRACTION

Amit Madaan; Charu Tiwari

Archive | 2009

ROBUST XPATHS FOR WEB INFORMATION EXTRACTION

Amit Madaan; Charu Tiwari; Rupesh R. Mehta

Explore More

Collaboration

Dive into the Amit Madaan's collaboration.

Top Co-Authors

Rupesh R. Mehta

Yahoo!

View shared research outputs

Top Co-Authors

Charu Tiwari

Yahoo!

View shared research outputs

Top Co-Authors

Srinivasan H. Sengamedu

Yahoo!

View shared research outputs

Top Co-Authors

Dhruv Mahajan

Microsoft

View shared research outputs

Top Co-Authors

Rajeev Rastogi

Yahoo!

View shared research outputs

Top Co-Authors

Subhajit Sanyal

Yahoo!

View shared research outputs

Top Co-Authors

Sundararajan Sellamanickam

Yahoo!

View shared research outputs

Top Co-Authors

S. R. Jeyashankher

Yahoo!

View shared research outputs

Top Co-Authors

Ashwin Tengli

Microsoft

View shared research outputs

Top Co-Authors

Rajeev Rastogi

Yahoo!

View shared research outputs

Explore More

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot

Dive into the research topics where Amit Madaan is active.

Publication

Featured researches published by Amit Madaan.

Web-scale information extraction with vertex

Matching product titles using web-based enrichment

A classification based framework for concept summarization

Web page sectioning using regex­-based template

A Classification Based Framework for Concept Summarization

Techniques for inducing high quality structural templates for electronic documents

Site-specific information-type detection methods and systems

STRUCTURAL CLUSTERING AND TEMPLATE IDENTIFICATION FOR ELECTRONIC DOCUMENTS

HIGH PRECISION MULTI ENTITY EXTRACTION

ROBUST XPATHS FOR WEB INFORMATION EXTRACTION

Collaboration

Dive into the Amit Madaan's collaboration.

Web page sectioning using regex-based template