Kamal Sarkar
Jadavpur University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kamal Sarkar.
Journal of Information Processing Systems | 2012
Kamal Sarkar; Mita Nasipuri; Suranjan Ghose
The paper presents three machine learning based keyphrase extraction methods that respectively use Decision Trees, Nave Bayes, and Artificial Neural Networks for keyphrase extraction. We consider keyphrases as being phrases that consist of one or more words and as representing the important concepts in a text document. The three machine learning based keyphrase extraction methods that we use for experimentation have been compared with a publicly available keyphrase extraction system called KEA. The experimental results show that the Neural Network based keyphrase extraction method outperforms two other keyphrase extraction methods that use the Decision Tree and Nave Bayes. The results also show that the Neural Network based method performs better than KEA.
advances in computing and communications | 2012
Kamal Sarkar
This paper describes a system that produces extractive summaries of Bengali news documents. The ultimate objective of produced summaries is defined as helping readers to determine whether they would be interested in reading a particular document. To this end, the summary aims to provide a reader with an idea about the theme of a document without revealing the in-depth detail. The approach presented here has four major steps (1) preprocessing (2) extraction of candidate summary sentences (3) ranking the candidate summary sentences (4) summary generation. The proposed approach defines TF*IDF, position and sentence length feature in more effective way that helps in improving the summarization performance. The experimental results show that the proposed text summarization approach outperforms the lead baseline and a more sophisticated baseline that uses TF*IDF and position features both.
Journal of Information Processing Systems | 2013
Kamal Sarkar
Many previous research studies on extractive text summarization consider a subset of words in a document as keywords and use a sentence ranking function that ranks sentences based on their similarities with the list of extracted keywords. But the use of key concepts in automatic text summarization task has received less attention in literature on summarization. The proposed work uses key concepts identified from a document for creating a summary of the document. We view single-word or multi-word keyphrases of a document as the important concepts that a document elaborates on. Our work is based on the hypothesis that an extract is an elaboration of the important concepts to some permissible extent and it is controlled by the given summary length restriction. In other words, our method of text summarization chooses a subset of sentences from a document that maximizes the important concepts in the final summary. To allow diverse information in the summary, for each important concept, we select one sentence that is the best possible elaboration of the concept. Accordingly, the most important concept will contribute first to the summary, then to the second best concept, and so on. To prove the effectiveness of our proposed summarization method, we have compared it to some state-of-the art summarization systems and the results show that the proposed method outperforms the existing systems to which it is compared.
International Journal of Computer Applications | 2013
Kamal Sarkar
are the phrases, consisting of one or more words, representing the important concepts in the articles. Keyphrases are useful for a variety of tasks such as text summarization, automatic indexing, clustering/classification, text mining etc. This paper presents a hybrid approach to keyphrase extraction from medical documents. The keyphrase extraction approach presented in this paper is an amalgamation of two methods: the first one assigns weights to candidate keyphrases based on an effective combination of features such as position, term frequency, inverse document frequency and the second one assign weights to candidate keyphrases using some knowledge about their similarities to the structure and characteristics of keyphrases available in the memory (stored list of keyphrases). An efficient candidate keyphrase identification method as the first component of the proposed keyphrase extraction system has also been introduced in this paper. The experimental results show that the proposed hybrid approach performs better than some state- of-the art keyphrase extraction approaches.
international conference on emerging applications of information technology | 2012
Kamal Sarkar; Vivekananda Gayen
This paper presents a practical part-of-speech (POS) tagger for Bengali, which will accept a raw Bengali text (typed in Bengali font) to produce a Bengali POS tagged output which can be directly used for other NLP applications. We have implemented a supervised Bengali trigram POS Tagger from the scratch using a statistical machine learning technique that uses the second order Hidden Markov Model (HMM). We have considered the bigram POS tagger as the baseline tagger to which our developed trigram POS tagger has been compared.
pattern recognition and machine intelligence | 2009
Kamal Sarkar
Keyphrases provide semantic metadata that summarizes the documents and enable the reader to quickly determine whether the given article is in the readers fields of interest. This paper presents an automatic keyphrase extraction method based on the naive Bayesian learning that exploits a number of domain-specific features to boost up the keyphrase extraction performance in medical domain. The proposed method has been compared to a popular keyphrase extraction algorithm, called Kea.
international conference on mining intelligence and knowledge exploration | 2015
Kamal Sarkar; Saikat Chakraborty
This paper reports about our work in the MIKE 2015, Shared Task on Sentiment Analysis in Indian Languages SAIL Tweets. We submitted runs for Hindi and Bengali. A multinomial Naive Bayes based model has been used to implement our system. The system has been trained and tested on the dataset released for SAIL TWEET CONTEST 2015. Our system obtains accuracy of 50.75i¾?%, 48.82i¾?%, 41.20i¾?%, and 40.20i¾?% for Hindi constrained, Hindi unconstrained, Bengali constrained and Bengali unconstrained run respectively.
Archive | 2013
Kamal Sarkar; Vivekananda Gayen
We present in this paper a trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Indian languages, which will accept a raw text in an Indian language (typed in corresponding language font) to produce a POS tagged output. We implement the trigram POS Tagger from the scratch based on the second order Hidden Markov Model (HMM). For handling unknown words, we introduce a prefix analysis method and a word-type analysis method which are combined with the well known suffix analysis method for predicting the probable tags. Though our developed systems have been tested on the data for four Indian languages namely Bengali, Hindi, Marathi and Telugu, the developed system can be easily ported to a new language just by replacing the training file with the POS tagged data for the new language. Our developed trigram POS tagger has been compared to the bigram POS tagger defined as a baseline.
international conference on information systems | 2011
Kamal Sarkar
Keyphrases provide the subject metadata that gives the clues about the content of a document. In this paper, we present a new method for Bengali keyphrase extraction. The proposed method has several steps such as extraction of n-grams, identification of candidate keyphrases and assigning scores to the candidate keyphrases. Since Bengali is a highly inflectional language, we have developed a lightweight stemmer for stemming the candidate keyphrases. The proposed method has been tested on a collection of Bengali documents selected from a Bengali corpus downloadable from TDIL website.
international conference on computational linguistics | 2005
Kamal Sarkar; Sivaji Bandyopadhyay
This paper discusses an approach to generate headline summary from a set of documents. Headline summary is basically a very short summary in the form of headline. As the amount of on-line information increases, systems that can automatically summarize multiple documents are becoming increasingly desirable. In this situation, headline summary is useful for users who only need information on the main topics in a set of documents. Headline summary from multiple documents will be very useful in the text mining applications for the generation of meaningful label (a compact identifier that allows a person to quickly see what the topic is about) for a cluster of documents.