Neeraj Agrawal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Neeraj Agrawal is active.

Explore More

Publication

Featured researches published by Neeraj Agrawal.

knowledge discovery and data mining | 2003

A bag of paths model for measuring structural similarity in Web documents

Sachindra Joshi; Neeraj Agrawal; Raghu Krishnapuram; Sumit Negi

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.

international conference on data engineering | 2004

EShopMonitor: a Web content monitoring tool

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as

Ibm Journal of Research and Development | 2004

The eShopmonitor: a comprehensive data extraction tool for monitoring web sites

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. We describe a system that monitors Web sites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site.

Archive | 2003

Determining structural similarity in semi-structured documents

Neeraj Agrawal; Sachindra Joshi; Raghuram Krishnapuram; Sumit Negi

Typical commercial Web sites publish information from multiple back-end data sources; these data sources are also updated very frequently. Given the size of most commercial sites today, it becomes essential to have an automated means of checking for correctness and consistency of data. The eShopmonitor allows users to specify items of interest to be tracked, monitors these items on the Web pages, and reports on any changes observed. Our solution comprises a crawler, a miner, a reporter, and a user component that work together to achieve the above functionality. The miner learns to locate the items of interest on a class of pages based on just one sample supplied by the user, via the user interface (UI) provided. The learning algorithm is based on the XPaths of the Document Object Model (DOM) of the page.

Archive | 2008