Kunal Punera | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kunal Punera is active.

Explore More

Publication

Featured researches published by Kunal Punera.

international world wide web conferences | 2005

The volume and evolution of web page templates

David Gibson; Kunal Punera; Andrew Tomkins

Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. We study the nature, evolution, and prevalence of these templates on the web. As part of this work, we develop new randomized algorithms for template extraction that perform approximately twenty times faster than existing approaches with similar quality. Our results show that 40--50% of the content on the web is template content. Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating. Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. We discuss the deleterious implications of this growth for information retrieval and ranking, classification, and link analysis.

international world wide web conferences | 2007

Page-level template detection via isotonic smoothing

Deepayan Chakrabarti; Ravi Kumar; Kunal Punera

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.

international world wide web conferences | 2008

A graph-theoretic approach to webpage segmentation

Deepayan Chakrabarti; Ravi Kumar; Kunal Punera

We consider the problem of segmenting a webpage into visually and semantically cohesive pieces. Our approach is based on formulating an appropriate optimization problem on weighted graphs, where the weights capture if two nodes in the DOM tree should be placed together or apart in the segmentation; we present a learning framework to learn these weights from manually labeled data in a principled manner. Our work is a significant departure from previous heuristic and rule-based solutions to the segmentation problem. The results of our empirical analysis bring out interesting aspects of our framework, including variants of the optimization problem and the role of learning.

Applied Artificial Intelligence | 2008

CONSENSUS-BASED ENSEMBLES OF SOFT CLUSTERINGS

Kunal Punera; Joydeep Ghosh

The problem of obtaining a single “consensus” clustering solution from a multitude or ensemble of clusterings of a set of objects, has attracted much interest recently because of its numerous practical applications. While a wide variety of approaches including graph partitioning, maximum likelihood, genetic algorithms, and voting-merging have been proposed so far to solve this problem, virtually all of them work on hard partitionings, i.e., where an object is a member of exactly one cluster in any individual solution. However, many clustering algorithms such as fuzzy c-means naturally output soft partitionings of data, and forcibly hardening these partitions before applying a consensus method potentially involves loss of valuable information. In this article we propose several consensus algorithms that can be applied directly to soft clusterings. Experimental results over a variety of real-life datasets are also provided to show that using soft clusterings as input does offer significant advantages, especially when dealing with vertically partitioned data.

international world wide web conferences | 2005

Automatically learning document taxonomies for hierarchical classification

Kunal Punera; Suju Rajan; Joydeep Ghosh

While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new technique that extracts a suitable hierarchical structure automatically from a corpus of labeled documents. We show that our technique groups similar classes closer together in the tree and discovers relationships among documents that are not encoded in the class labels. The learned taxonomy is then used along with binary SVMs for multi-class classification. We demonstrate the efficacy of our approach by testing it on the 20-Newsgroup dataset.

knowledge discovery and data mining | 2008

Generating succinct titles for web URLs

Deepayan Chakrabarti; Ravi Kumar; Kunal Punera

How can a search engine automatically provide the best and most appropriate title for a result URL (link-title) so that users will be persuaded to click on the URL? We consider the problem of automatically generating link-titles for URLs and propose a general statistical framework for solving this problem. The framework is based on using information from a diverse collection of sources, each of which can be thought of as contributing one or more candidate link-titles for the URL. It can also incorporate the context in which the link-title will be used, along with constraints on its length. Our framework is applicable to several scenarios: obtaining succinct titles for displaying quicklinks, obtaining titles for URLs that lack a good title, constructing succinct sitemaps, etc. Extensive experiments show that our method is very effective, producing results that are at least 20% better than non-trivial baselines.

web search and data mining | 2011

Scalable clustering of news search results

Choon Hui Teo; Suju Rajan; Kunal Punera; Byron Dom; Alexander J. Smola; Yi Chang; Zhaohui Zheng

In this paper, we present a system for clustering the search results of a news search engine. The news search interface includes the relevant news articles to a given query organized in terms of related news stories. Here each cluster corresponds to a news story and the news articles are clustered into stories. We present a system that clusters the search results of a news search system in a fast and scalable manner. The clustering system is organized into three components including offline clustering, incremental clustering and realtime clustering. We propose novel techniques for clustering the search results in realtime. The experimental results with large collections of news documents reveal that our system is both scalable and also achieves good accuracy in clustering the news search results.

knowledge discovery and data mining | 2006

Hierarchical topic segmentation of websites

Ravi Kumar; Kunal Punera; Andrew Tomkins

In this paper, we consider the problem of identifying and segmenting topically cohesive regions in the URL tree of a large website. Each page of the website is assumed to have a topic label or a distribution on topic labels generated using a standard classifier. We develop a set of cost measures characterizing the benefit accrued by introducing a segmentation of the site based on the topic labels. We propose a general framework to use these measures for describing the quality of a segmentation; we also provide an efficient algorithm to find the best segmentation in this framework. Extensive experiments on human-labeled data confirm the soundness of our framework and suggest that a judicious choice of cost measures allows the algorithm to perform surprisingly accurate topical segmentations.

international world wide web conferences | 2009

Quicklink selection for navigational query results

Deepayan Chakrabarti; Ravi Kumar; Kunal Punera

Quicklinks for a website are navigational shortcuts displayed below the website homepage on a search results page, and that let the users directly jump to selected points inside the website. Since the real-estate on a search results page is constrained and valuable, picking the best set of quicklinks to maximize the benefits for a majority of the users becomes an important problem for search engines. Using user browsing trails obtained from browser toolbars, and a simple probabilistic model, we formulate the quicklink selection problem as a combinatorial optimizaton problem. We first demonstrate the hardness of the objective, and then propose an algorithm that is provably within a factor of 1-1/e of the optimal. We also propose a different algorithm that works on trees and that can find the optimal solution; unlike the previous algorithm, this algorithm can incorporate natural constraints on the set of chosen quicklinks. The efficacy of our methods is demonstrated via empirical results on both a manually labeled set of websites and a set for which quicklink click-through rates for several webpages were obtained from a real-world search engine.

Knowledge and Information Systems | 2008

Effective and efficient classification on a search-engine model

Aris Anagnostopoulos; Andrei Z. Broder; Kunal Punera

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.

Explore More