Is this you? Create Your Porfile

Sujan Kumar Saha

Birla Institute of Technology, Mesra

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sujan Kumar Saha is active.

Explore More

Publication

Featured researches published by Sujan Kumar Saha.

Journal of Biomedical Informatics | 2009

Feature selection techniques for maximum entropy based biomedical named entity recognition

Sujan Kumar Saha; Sudeshna Sarkar; Pabitra Mitra

Named entity recognition is an extremely important and fundamental task of biomedical text mining. Biomedical named entities include mentions of proteins, genes, DNA, RNA, etc which often have complex structures, but it is challenging to identify and classify such entities. Machine learning methods like CRF, MEMM and SVM have been widely used for learning to recognize such entities from an annotated corpus. The identification of appropriate feature templates and the selection of the important feature values play a very important role in the success of these methods. In this paper, we provide a study on word clustering and selection based feature reduction approaches for named entity recognition using a maximum entropy classifier. The identification and selection of features are largely done automatically without using domain knowledge. The performance of the system is found to be superior to existing systems which do not use domain knowledge.

POLIBITS | 2008

Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

Sujan Kumar Saha; Partha Sarathi Ghosh; Sudeshna Sarkar; Pabitra Mitra

Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a two-phase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.

Pattern Recognition Letters | 2010

A composite kernel for named entity recognition

Sujan Kumar Saha; Shashi Narayan; Sudeshna Sarkar; Pabitra Mitra

In this paper, we propose a novel kernel function for support vector machines (SVM) that can be used for sequential labeling tasks like named entity recognition (NER). Machine learning methods like support vector machines, maximum entropy, hidden Markov model and conditional random fields are the most widely used methods for implementing NER systems. The features used in machine learning algorithms for NER are mostly string based features. The proposed kernel is based on calculating a novel distance function between the string based features. In tasks like NER, the similarity between the contexts as well as the semantic similarity between the words play an important role. The goal is to capture the context and semantic information in NER like tasks. The proposed distance function makes use of certain statistics primarily derived from the training data and hierarchical clustering information. The kernel function is applied to the Hindi and biomedical NER tasks and the results are quite promising.

Knowledge Based Systems | 2012

A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Sujan Kumar Saha; Pabitra Mitra; Sudeshna Sarkar

Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper.

The Scientific World Journal | 2013

A Kernel-Based Approach for Biomedical Named Entity Recognition

Rakesh Patra; Sujan Kumar Saha

Support vector machine (SVM) is one of the popular machine learning techniques used in various text processing tasks including named entity recognition (NER). The performance of the SVM classifier largely depends on the appropriateness of the kernel function. In the last few years a number of task-specific kernel functions have been proposed and used in various text processing tasks, for example, string kernel, graph kernel, tree kernel and so on. So far very few efforts have been devoted to the development of NER task specific kernel. In the literature we found that the tree kernel has been used in NER task only for entity boundary detection or reannotation. The conventional tree kernel is unable to execute the complete NER task on its own. In this paper we have proposed a kernel function, motivated by the tree kernel, which is able to perform the complete NER task. To examine the effectiveness of the proposed kernel, we have applied the kernel function on the openly available JNLPBA 2004 data. Our kernel executes the complete NER task and achieves reasonable accuracy.

pattern recognition and machine intelligence | 2013

Automatic Generation of Multiple Choice Questions Using Wikipedia

Arjun Singh Bhatia; Manas Kirti; Sujan Kumar Saha

In this paper we present a system for automatic generation of multiple choice test items using Wikipedia. Here we propose a methodology for potential sentence selection with the help of existing test items in the web. The sentences are selected using a set of pattern extracted from the existing questions. We also propose a novel technique for generating named entity distractors. For generating quality named entity distractors we extract certain additional attribute values on the key from the web and search the Wikipedia for the entities having similar attribute values. We run our experiments in sports domain. The generated questions and distractors are evaluated by a set of human evaluators using a set of parameters. The evaluation results demonstrate that the system is reasonably accurate.

meeting of the association for computational linguistics | 2015

A System for Generating Multiple Choice Questions: With a Novel Approach for Sentence Selection

Mukta Majumder; Sujan Kumar Saha

Multiple Choice Question (MCQ) plays a major role in educational assessment as well as in active learning. In this paper we present a system that generates MCQs automatically using a sports domain text as input. All the sentences in a text are not capable of generating MCQs; the first step of the system is to select the informative sentences. We propose a novel technique to select informative sentences by using topic modeling and parse structure similarity. The parse structure similarity is computed between the parse structure of an input sentence and a set of reference parse structures. In order to compile the reference set we use a number of existing MCQs collected from the web. Keyword selection is done with the help of occurrence of domain specific word and named entity word in the sentence. Distractors are generated using a set of rules and name dictionary. Experimental results demonstrate that the proposed technique is quite accurate.

Archive | 2018

Automatic Generation of Named Entity Distractors of Multiple Choice Questions Using Web Information

Rakesh Patra; Sujan Kumar Saha

This paper presents a novel technique for automatic generation of distractors for multiple choice questions. Distractors are the wrong choices given along with the correct answer (key) to befuddle the examinee. Various techniques have been proposed in the literature for automatic distractor generation. But none of these approaches are suitable when the key is a named entity. And named entity key or distractors are dominating in many domains including sports and entertainment. Here, we propose a technique for generation of named entity distractors. For generating good named entity distractors, we first detect the class of the key and collect a set of attribute values, classified into generic and specific categories. Based on these attributes, we retrieve a set of candidate distractors from a few trusted Web sites like Wikipedia. Then, we find the similarity between the key and a candidate distractor. The close ones are chosen as the final set of distractors. A set of human evaluators assess the distractors by using a set of parameters. In our evaluation, we observe that the system-generated distractors are good in terms of relevance and close to the key.

Archive | 2017

Advances in Computational Intelligence

Sudip Kumar Sahana; Sujan Kumar Saha

Includes a special section containing interdisciplinary applications of computational intelligence Includes keynote lectures delivered at the conference Covers work in emerging areas in the field of computational intelligence

Journal of Innovative Optical Health Sciences | 2013

A NOVEL TECHNIQUE FOR MULTIPLE FAULTS AND THEIR LOCATIONS DETECTION AND START ELECTRODE SELECTION IN MICROFLUIDIC DIGITAL BIOCHIP

Mukta Majumder; Nilanjana Das; Sujan Kumar Saha

A device, that is used for biomedical operation or safety-critical applications like point-of-care health assessment, massive parallel DNA analysis, automated drug discovery, air-quality monitoring and food-safety testing, must have the attributes like reliability, dependability and correctness. As the biochips are used for these purposes; therefore, these devices must be fault free all the time. Naturally before using these chips, they must be well tested. We are proposing a novel technique that can detect multiple faults, locate the fault positions within the biochip, as well as calculate the traversal time if the biochip is fault free. The proposed technique also highlights a new idea how to select the appropriate base node or pseudo source (start electrode). The main idea of the proposed technique is to form multiple loops with the neighboring electrode arrays and then test each loop by traversing test droplet to check whether there is any fault. If a fault is detected then the proposed technique also locates it by backtracking the test droplet. In case, no fault is detected, the biochip is fault free then the proposed technique also calculates the time to traverse the chip. The result suggests that the proposed technique is efficient and shows significant improvement to calculate fault-free biochip traversal time over existing method.

Explore More