Is this you? Create Your Porfile

Sudeshna Sarkar

Indian Institute of Technology Kharagpur

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sudeshna Sarkar is active.

Explore More

Publication

Featured researches published by Sudeshna Sarkar.

analytics for noisy unstructured text data | 2007

Investigation and modeling of the structure of texting language

Monojit Choudhury; Rahul Saraf; Vijit Jain; Animesh Mukherjee; Sudeshna Sarkar; Anupam Basu

Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The use of simple bigram language model results in a 35% reduction of the relative word level error rates.

Journal of Biomedical Informatics | 2009

Feature selection techniques for maximum entropy based biomedical named entity recognition

Sujan Kumar Saha; Sudeshna Sarkar; Pabitra Mitra

Named entity recognition is an extremely important and fundamental task of biomedical text mining. Biomedical named entities include mentions of proteins, genes, DNA, RNA, etc which often have complex structures, but it is challenging to identify and classify such entities. Machine learning methods like CRF, MEMM and SVM have been widely used for learning to recognize such entities from an annotated corpus. The identification of appropriate feature templates and the selection of the important feature values play a very important role in the success of these methods. In this paper, we provide a study on word clustering and selection based feature reduction approaches for named entity recognition using a maximum entropy classifier. The identification and selection of features are largely done automatically without using domain knowledge. The performance of the system is found to be superior to existing systems which do not use domain knowledge.

meeting of the association for computational linguistics | 2007

Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario

Sandipan Dandapat; Sudeshna Sarkar; Anupam Basu

This paper describes our work on building Part-of-Speech (POS) tagger for Bengali. We have use Hidden Markov Model (HMM) and Maximum Entropy (ME) based stochastic taggers. Bengali is a morphologically rich language and our taggers make use of morphological and contextual information of the words. Since only a small labeled training set is available (45,000 words), simple stochastic approach does not yield very good results. In this work, we have studied the effect of using a morphological analyzer to improve the performance of the tagger. We find that the use of morphology helps improve the accuracy of the tagger especially when less amount of tagged corpora are available.

Information & Software Technology | 2002

An efficient dynamic program slicing technique

G. B. Mund; Rajib Mall; Sudeshna Sarkar

Abstract An important application of the dynamic program slicing technique is program debugging. In applications such as interactive debugging, the dynamic slicing algorithm needs to be efficient. In this context, we propose a new dynamic program slicing technique that is more efficient than the related algorithms reported in the literature. We use the program dependence graph as an intermediate program representation, and modify it by introducing the concepts of stable and unstable edges . Our algorithm is based on marking and unmarking the unstable edges as and when the dependences arise and cease during run-time. We call this algorithm edge-marking algorithm . After an execution of a node x , an unstable edge ( x , y ) is marked if the node x uses the value of the variable defined at node y . A marked unstable edge ( x , y ) is unmarked after an execution of a node z if the nodes y and z define the same variable var , and the value of var computed at the node y does not affect the present value of var defined at the node z . We show that our algorithm is more time and space efficient than the existing ones. The worst case space complexity of our algorithm is O ( n 2 ), where n is the number of statements in the program. We also briefly discuss an implementation of our algorithm.

Information & Software Technology | 2003

Computation of intraprocedural dynamic program slices

G. B. Mund; Rajib Mall; Sudeshna Sarkar

Abstract Dynamic slicing algorithms are used in interactive applications such as program debugging and testing. Therefore, these algorithms need to be very efficient. In this context, we propose three intraprocedural dynamic slicing algorithms which are more space and time efficient than the existing algorithms. Two of the proposed algorithms compute precise dynamic slices of structured programs using Program Dependence Graph as an intermediate representation. To compute precise dynamic slices of unstructured programs, we introduce the concepts of jump dependence and Unstructured Program Dependence Graph. The third algorithm uses Unstructured Program Dependence Graph as the intermediate program representation, and computes precise dynamic slices of unstructured programs. We show that each of our proposed algorithms is more space and time efficient than the existing algorithms.

POLIBITS | 2008

Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

Sujan Kumar Saha; Partha Sarathi Ghosh; Sudeshna Sarkar; Pabitra Mitra

Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a two-phase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.

international conference on user modeling adaptation and personalization | 2012

Preference relation based matrix factorization for recommender systems

Maunendra Sankar Desarkar; Roopam Saxena; Sudeshna Sarkar

Users in recommender systems often express their opinions about different items by rating the items on a fixed rating scale. The rating information provided by the users is used by the recommender systems to generate personalized recommendations for them. Few recent research work on rating based recommender systems advocate the use of preference relations instead of absolute ratings in order to produce better recommendations. Use of preference relations for neighborhood based collaborative recommendation has been looked upon in recent literature. On the other hand, Matrix Factorization algorithms have been shown to perform well for recommender systems, specially when the data is sparse. In this work, we propose a matrix factorization based collaborative recommendation algorithm that considers preference relations. Experimental results show that the proposed method is able to achieve better recommendation accuracy over the compared baseline methods.

adaptive hypermedia and adaptive web based systems | 2002

Web Site Personalization Using User Profile Information

Mohit Goel; Sudeshna Sarkar

In this paper we discuss a technique for web site personalization. Connectivity analysis has been shown to be useful in identifying high quality web pages within a topic or domain specific graph of hyper linked documents. We have implemented a system that creates a view of a subset of a web site most relevant to a given user. This sort of personalization is useful for filtering a useful subset of a site so that the user gets a low volume of quality information. The essence of our approach is to augment a previous connectivity analysis with content analysis. We present an agent which assists the user when he browses and distills a personalized sub graph of the website based on his user profile.

Pattern Recognition Letters | 2010

A composite kernel for named entity recognition

Sujan Kumar Saha; Shashi Narayan; Sudeshna Sarkar; Pabitra Mitra

In this paper, we propose a novel kernel function for support vector machines (SVM) that can be used for sequential labeling tasks like named entity recognition (NER). Machine learning methods like support vector machines, maximum entropy, hidden Markov model and conditional random fields are the most widely used methods for implementing NER systems. The features used in machine learning algorithms for NER are mostly string based features. The proposed kernel is based on calculating a novel distance function between the string based features. In tasks like NER, the similarity between the contexts as well as the semantic similarity between the words play an important role. The goal is to capture the context and semantic information in NER like tasks. The proposed distance function makes use of certain statistics primarily derived from the training data and hierarchical clustering information. The kernel function is applied to the Hindi and biomedical NER tasks and the results are quite promising.

pattern recognition and machine intelligence | 2009

Learning Age and Gender of Blogger from Stylistic Variation

Mayur Rustagi; Rajendra Prasath; Sumit Goswami; Sudeshna Sarkar

We report results of stylistic differences in blogging for gender and age group variation. The results are based on two mutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylistic study of bloggers. For the second feature, we have analyzed the variation in average length of sentences across various age groups and gender. These features are augmented with previous study results reported in literature for stylistic analysis. The combined feature list enhances the accuracy by a remarkable extent in predicting age and gender. These machine learning experiments were done on two separate demographically tagged blog corpus. Gender determination is more accurate than age group detection over the data spread across all ages but the accuracy of age prediction increases if we sample data with remarkable age difference.

Explore More