Amit Madaan
Yahoo!
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amit Madaan.
international conference on data engineering | 2011
Pankaj Gulhane; Amit Madaan; Rupesh Rasiklal Mehta; Jeyashankher Ramamirtham; Rajeev Rastogi; Sandeepkumar Satpal; Srinivasan H. Sengamedu; Ashwin Tengli; Charu Tiwari
Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.
conference on information and knowledge management | 2012
Vishrawas Gopalakrishnan; Suresh Iyengar; Amit Madaan; Rajeev Rastogi; Srinivasan H. Sengamedu
Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.
international world wide web conferences | 2011
Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan
In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries. Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.
international world wide web conferences | 2008
Rupesh R. Mehta; Amit Madaan
This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.
international conference on data mining | 2012
Dhruv Mahajan; Sundararajan Sellamanickam; Subhajit Sanyal; Amit Madaan
In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata information available with the images to get category information using Latent Dirichlet Allocation. Using this category information for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries.\eat{ We demonstrate the efficacy of our method on seven broad query types - sports, news, celebrities, events, travel, country and abstract.} Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.\eat{ in terms of satisfying important visual and semantic properties both qualitatively and quantitatively. We observe from editorial evaluation that around
Archive | 2007
V. G. Vinod Vydiswaran; Rupesh R. Mehta; Amit Madaan
78
Archive | 2008
Rupesh R. Mehta; Amit Madaan
\% of our summaries are of high enough quality to be shown directly to the web users with minimal or no modifications.
Archive | 2008
Amit Madaan; V. G. Vinod Vydiswaran; Rupesh R. Mehta
Archive | 2009
Amit Madaan; Charu Tiwari
Archive | 2009
Amit Madaan; Charu Tiwari; Rupesh R. Mehta