Srinivasan H. Sengamedu

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srinivasan H. Sengamedu is active.

Explore More

Publication

Featured researches published by Srinivasan H. Sengamedu.

knowledge discovery and data mining | 2011

Entity disambiguation with hierarchical topic models

Saurabh Kataria; Krishnan S. Kumar; Rajeev Rastogi; Prithviraj Sen; Srinivasan H. Sengamedu

Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedias category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedias crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.

international conference on data engineering | 2011

Web-scale information extraction with vertex

Pankaj Gulhane; Amit Madaan; Rupesh Rasiklal Mehta; Jeyashankher Ramamirtham; Rajeev Rastogi; Sandeepkumar Satpal; Srinivasan H. Sengamedu; Ashwin Tengli; Charu Tiwari

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.

international world wide web conferences | 2010

Exploiting content redundancy for web information extraction

Pankaj Gulhane; Rajeev Rastogi; Srinivasan H. Sengamedu; Ashwin Tengli

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.

acm multimedia | 2007

vADeo: video advertising system

Srinivasan H. Sengamedu; Neela Sawant; Smita Wadhwa

vADeo is a next-generation video ad system that analyzes the video to find appropriate scene changes where ads can be inserted. The context of each ad insertion point is determined through high-level analysis of the surrounding video segment thereby making ads contextual. Further vADeo implements two novelties on the player side - ad book-marking and delayed interaction - which encourage ad clicks without disrupting the video viewing experience.

web search and data mining | 2012

Comment spam detection by sequence mining

Ravi Kant; Srinivasan H. Sengamedu; Krishnan S. Kumar

Comments are supported by several web sites to increase user participation. Users can usually comment on a variety of media types - photos, videos, news articles, blogs, etc. Comment spam is one of the biggest challenges facing this feature. The traditional approach to combat spam is to train classifiers using various machine learning techniques. Since the commonly used classifiers work on the entire comment text, it is easy to mislead them by embedding spam content in good content. In this paper, we make several contributions towards comment spam detection. (1) We propose a new framework for spam detection that is immune to embed attacks. We characterize spam by a set of frequently occurring sequential patterns. (2) We introduce a variant (called min-closed) of the frequent closed sequence mining problem that succinctly captures all the frequently occurring patterns. We prove as well as experimentally show that the set of min-closed sequences is an order of magnitude smaller than the set of closed sequences and yet has exactly the same coverage. (3) We describe MCPRISM, extension of the recently published PRISM algorithm that effectively mines min-closed sequences, using prime encoding. In the process, we solve the open problem of using the prime-encoding technique to speed up traditional closed sequence mining. (4) We finally need to whittle down the set of frequent subsequences to a small set without sacrificing coverage. This problem is NP-Hard but we show that the coverage function is submodular and hence the greedy heuristic gives a fast algorithm that is close to optimal. We then describe the experiments that were carried out on a large real world comment data and the publicly available Gazelle dataset. (1) We show that nearly 80% of spam on real world data can be effectively captured by the mined sequences at very low false positive rates. (2) The sequences mined are highly discriminative. (3) On Gazelle data, the proposed algorithmic enhancements are faster by at least by a factor and by an order of magnitude on the larger comment dataset.

acm multimedia | 2007

LogoSeeker: a system for detecting and matching logos in natural images

Subhajit Sanyal; Srinivasan H. Sengamedu

The dominant advertising model on the Internet is based on matching search keywords or web page content to ads. The matching is based on text content. There is an explosion of media content on the Internet. Matching based on image content has not taken off on the Internet despite the huge popularity of sites like flickr.com. In this demo, we show we can adapt techniques from image matching to enable a logo-based advertisement matching system for photo sharing sites like Flickr. Logo detection is based on detection and matching of salient points.

conference on information and knowledge management | 2011

Supervised matching of comments with news article segments

Dyut Kumar Sil; Srinivasan H. Sengamedu; Chiranjib Bhattacharyya

Comments constitute an important part of Web 2.0. In this paper, we consider comments on news articles. To simplify the task of relating the comment content to the article content the comments are about, we propose the idea of showing comments alongside article segments and explore automatic mapping of comments to article segments. This task is challenging because of the vocabulary mismatch between the articles and the comments. We present supervised and unsupervised techniques for aligning comments to segments the of article the comments are about. More specifically, we provide a novel formulation of supervised alignment problem using the framework of structured classification. Our experimental results show that structured classification model performs better than unsupervised matching and binary classification model.

conference on information and knowledge management | 2012

Matching product titles using web-based enrichment

Vishrawas Gopalakrishnan; Suresh Iyengar; Amit Madaan; Rajeev Rastogi; Srinivasan H. Sengamedu

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.

international world wide web conferences | 2011

ReadAlong: reading articles and comments together

Dyut Kumar Sil; Srinivasan H. Sengamedu; Chiranjib Bhattacharyya

We propose a new paradigm for displaying comments: showing comments alongside parts of the article they correspond to. We evaluate the effectiveness of various approaches for this task and show that a combination of bag of words and topic models performs the best.

acm multimedia | 2011

Detection of pornographic content in internet images

Srinivasan H. Sengamedu; Subhajit Sanyal; Sriram Satish

Pornographic image detection is an important and challenging problem. Detection of pornography on the Internet is even more challenging because of the scale (billions of images) and diversity (small to very large images, graphic, grey scale images, etc.) of image content. The performance requirements (precision, recall, and speed) are also very stringent. Because of this, no single technique provides the required performance. In this paper, we describe a framework for detecting images with pornographic content. The framework combines various techniques based on object-level and pixel-level analysis of image content. To enable high-precision, we detect body parts (including faces) in images. For high-recall, low-level techniques like color and texture features are used. For adaptation to new datasets, we also support learning of appropriate color models from weakly-labeled datasets. In addition to image-based analysis, both text-based and site-level analysis are performed. Unlike many adult detection techniques, we explicitly leverage techniques like texture analysis and face detection for non-adult content identification. The multiple cues are combined in a systematic manner using ROC analysis and boosting. Evaluations on real world web data indicate that the system has the best performance among the systems compared.

Explore More