Sachindra Joshi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sachindra Joshi is active.

Explore More

Publication

Featured researches published by Sachindra Joshi.

knowledge discovery and data mining | 2003

A bag of paths model for measuring structural similarity in Web documents

Sachindra Joshi; Neeraj Agrawal; Raghu Krishnapuram; Sumit Negi

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.

inductive logic programming | 2007

Using ILP to construct features for information extraction from semi-structured text

Ganesh Ramakrishnan; Sachindra Joshi; Sreeram V. Balakrishnan; Ashwin Srinivasan

Machine-generated documents containing semistructured text are rapidly forming the bulk of data being stored in an organisation. Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction (IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information. We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in the form of features and their values) is better than using it to construct intensional models for the tasks (in the form of rules for information extraction).

siam international conference on data mining | 2014

42 - Joint Author Sentiment Topic Model

Subhabrata Mukherjee; Gaurab Basu; Sachindra Joshi

Traditional works in sentiment analysis and aspect rating prediction do not take author preferences and writing style into account during rating prediction of reviews. In this work, we introduce Joint Author Sentiment Topic Model (JAST), a generative process of writing a review by an author. Authors have different topic preferences, ‘emotional’ attachment to topics, writing style based on the distribution of semantic (topic) and syntactic (background) words and their tendency to switch topics. JAST uses Latent Dirichlet Allocation to learn the distribution of author-specific topic preferences and emotional attachment to topics. It uses a Hidden Markov Model to capture short range syntactic and long range semantic dependencies in reviews to capture coherence in author writing style. JAST jointly discovers the topics in a review, author preferences for the topics, topic ratings as well as the overall review rating from the point of view of an author. To the best of our knowledge, this is the first work in Natural Language Processing to bring all these dimensions together to have an author-specific generative model of a review.

knowledge discovery and data mining | 2008

Structured entity identification and document categorization: two tasks with one joint model

Indrajit Bhattacharya; Shantanu Godbole; Sachindra Joshi

Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two tasks have much to gain from each other. Apart from direct references to entities in a database, such as names of person entities, documents often also contain words that are correlated with discriminative entity attributes, such age-group and income-level of persons. This happens naturally in many enterprise domains such as CRM, Banking, etc. Then, entity identification, which is typically vulnerable against noise and incompleteness in direct references to entities in documents, can benefit from document categorization with respect to such attributes. In return, entity identification enables documents to be categorized according to different label-sets arising from entity attributes without requiring any supervision. In this paper, we propose a probabilistic generative model for joint entity identification and document categorization. We show how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion. Using extensive experiments over real and semi-synthetic data, we demonstrate that the two tasks can benefit immensely from each other when performed jointly using the proposed model.

international world wide web conferences | 2013

Incorporating author preference in sentiment rating prediction of reviews

Subhabrata Mukherjee; Gaurab Basu; Sachindra Joshi

Traditional works in sentiment analysis do not incorporate author preferences during sentiment classification of reviews. In this work, we show that the inclusion of author preferences in sentiment rating prediction of reviews improves the correlation with ground ratings, over a generic author independent rating prediction model. The overall sentiment rating prediction for a review has been shown to improve by capturing facet level rating. We show that this can be further developed by considering author preferences in predicting the facet level ratings, and hence the overall review rating. To the best of our knowledge, this is the first work to incorporate author preferences in rating prediction.

annual srii global conference | 2011

Data Cleansing Techniques for Large Enterprise Datasets

K. Hima Prasad; Tanveer A. Faruquie; Sachindra Joshi; Snigdha Chaturvedi; L. Venkata Subramaniam; Mukesh K. Mohania

Data quality improvement is an important aspect of enterprise data management. Data characteristics can change with customers, with domain and geography making data quality improvement a challenging task. Data quality improvement is often an iterative process which mainly involves writing a set of data quality rules for standardization and elimination of duplicates that are present within the data. Existing data cleansing tools require a fair amount of customization whenever moving from one customer to another and from one domain to another. In this paper, we present a data quality improvement tool which helps the data quality practitioner by showing the characteristics of the entities present in the data. The tool identifies the variants and synonyms of a given entity present in the data which is an important task for writing data quality rules for standardizing the data. We present a ripple down rule framework for maintaining data quality rules which helps in reducing the services effort for adding new rules. We also present a typical workflow of the data quality improvement process and show the usefulness of the tool at each step. We also present some experimental results and discussions on the usefulness of the tools for reducing services effort in a data quality improvement.

international conference on data mining | 2009

Cross-Guided Clustering: Transfer of Relevant Supervision across Domains for Improved Clustering

Indrajit Bhattacharya; Shantanu Godbole; Sachindra Joshi; Ashish Verma

Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred to a clustering task in a target domain, by providing a relevant supervised partitioning of a dataset from a different source domain. The target clustering is made more meaningful for the human user by trading off intrinsic clustering goodness on the target dataset for alignment with relevant supervised partitions in the source dataset, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-domain similarity measure that discovers hidden relationships across domains with potentially different vocabularies. Using multiple real-world datasets, we show that our approach improves clustering accuracy significantly over traditional k-means.

Neurocomputing | 2012

Letters: Using Sequential Unconstrained Minimization Techniques to simplify SVM solvers

Sachindra Joshi; Jayadeva; Ganesh Ramakrishnan; Suresh Chandra

In this paper, we apply Sequential Unconstrained Minimization Techniques (SUMTs) to the classical formulations of both the classical L1 norm SVM and the least squares SVM. We show that each can be solved as a sequence of unconstrained optimization problems with only box constraints. We propose relaxed SVM and relaxed LSSVM formulations that correspond to a single problem in the corresponding SUMT sequence. We also propose a SMO like algorithm to solve the relaxed formulations that works by updating individual Lagrange multipliers. The methods yield comparable or better results on large benchmark datasets than classical SVM and LSSVM formulations, at substantially higher speeds.

international conference on data engineering | 2004

EShopMonitor: a Web content monitoring tool

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as

Proceedings of the 1st IKDD Conference on Data Sciences | 2014

Theme Based Clustering of Tweets

Rudra M. Tripathy; Shashank Sharma; Sachindra Joshi; Sameep Mehta; Amitabha Bagchi

9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. We describe a system that monitors Web sites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site.

Explore More