Sujan Kumar Saha
Birla Institute of Technology, Mesra
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sujan Kumar Saha.
Journal of Biomedical Informatics | 2009
Sujan Kumar Saha; Sudeshna Sarkar; Pabitra Mitra
Named entity recognition is an extremely important and fundamental task of biomedical text mining. Biomedical named entities include mentions of proteins, genes, DNA, RNA, etc which often have complex structures, but it is challenging to identify and classify such entities. Machine learning methods like CRF, MEMM and SVM have been widely used for learning to recognize such entities from an annotated corpus. The identification of appropriate feature templates and the selection of the important feature values play a very important role in the success of these methods. In this paper, we provide a study on word clustering and selection based feature reduction approaches for named entity recognition using a maximum entropy classifier. The identification and selection of features are largely done automatically without using domain knowledge. The performance of the system is found to be superior to existing systems which do not use domain knowledge.
POLIBITS | 2008
Sujan Kumar Saha; Partha Sarathi Ghosh; Sudeshna Sarkar; Pabitra Mitra
Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a two-phase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.
Pattern Recognition Letters | 2010
Sujan Kumar Saha; Shashi Narayan; Sudeshna Sarkar; Pabitra Mitra
In this paper, we propose a novel kernel function for support vector machines (SVM) that can be used for sequential labeling tasks like named entity recognition (NER). Machine learning methods like support vector machines, maximum entropy, hidden Markov model and conditional random fields are the most widely used methods for implementing NER systems. The features used in machine learning algorithms for NER are mostly string based features. The proposed kernel is based on calculating a novel distance function between the string based features. In tasks like NER, the similarity between the contexts as well as the semantic similarity between the words play an important role. The goal is to capture the context and semantic information in NER like tasks. The proposed distance function makes use of certain statistics primarily derived from the training data and hierarchical clustering information. The kernel function is applied to the Hindi and biomedical NER tasks and the results are quite promising.
Knowledge Based Systems | 2012
Sujan Kumar Saha; Pabitra Mitra; Sudeshna Sarkar
Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper.
The Scientific World Journal | 2013
Rakesh Patra; Sujan Kumar Saha
Support vector machine (SVM) is one of the popular machine learning techniques used in various text processing tasks including named entity recognition (NER). The performance of the SVM classifier largely depends on the appropriateness of the kernel function. In the last few years a number of task-specific kernel functions have been proposed and used in various text processing tasks, for example, string kernel, graph kernel, tree kernel and so on. So far very few efforts have been devoted to the development of NER task specific kernel. In the literature we found that the tree kernel has been used in NER task only for entity boundary detection or reannotation. The conventional tree kernel is unable to execute the complete NER task on its own. In this paper we have proposed a kernel function, motivated by the tree kernel, which is able to perform the complete NER task. To examine the effectiveness of the proposed kernel, we have applied the kernel function on the openly available JNLPBA 2004 data. Our kernel executes the complete NER task and achieves reasonable accuracy.
pattern recognition and machine intelligence | 2013
Arjun Singh Bhatia; Manas Kirti; Sujan Kumar Saha
In this paper we present a system for automatic generation of multiple choice test items using Wikipedia. Here we propose a methodology for potential sentence selection with the help of existing test items in the web. The sentences are selected using a set of pattern extracted from the existing questions. We also propose a novel technique for generating named entity distractors. For generating quality named entity distractors we extract certain additional attribute values on the key from the web and search the Wikipedia for the entities having similar attribute values. We run our experiments in sports domain. The generated questions and distractors are evaluated by a set of human evaluators using a set of parameters. The evaluation results demonstrate that the system is reasonably accurate.
meeting of the association for computational linguistics | 2015
Mukta Majumder; Sujan Kumar Saha
Multiple Choice Question (MCQ) plays a major role in educational assessment as well as in active learning. In this paper we present a system that generates MCQs automatically using a sports domain text as input. All the sentences in a text are not capable of generating MCQs; the first step of the system is to select the informative sentences. We propose a novel technique to select informative sentences by using topic modeling and parse structure similarity. The parse structure similarity is computed between the parse structure of an input sentence and a set of reference parse structures. In order to compile the reference set we use a number of existing MCQs collected from the web. Keyword selection is done with the help of occurrence of domain specific word and named entity word in the sentence. Distractors are generated using a set of rules and name dictionary. Experimental results demonstrate that the proposed technique is quite accurate.
Archive | 2018
Rakesh Patra; Sujan Kumar Saha
This paper presents a novel technique for automatic generation of distractors for multiple choice questions. Distractors are the wrong choices given along with the correct answer (key) to befuddle the examinee. Various techniques have been proposed in the literature for automatic distractor generation. But none of these approaches are suitable when the key is a named entity. And named entity key or distractors are dominating in many domains including sports and entertainment. Here, we propose a technique for generation of named entity distractors. For generating good named entity distractors, we first detect the class of the key and collect a set of attribute values, classified into generic and specific categories. Based on these attributes, we retrieve a set of candidate distractors from a few trusted Web sites like Wikipedia. Then, we find the similarity between the key and a candidate distractor. The close ones are chosen as the final set of distractors. A set of human evaluators assess the distractors by using a set of parameters. In our evaluation, we observe that the system-generated distractors are good in terms of relevance and close to the key.
Archive | 2017
Sudip Kumar Sahana; Sujan Kumar Saha
Includes a special section containing interdisciplinary applications of computational intelligence Includes keynote lectures delivered at the conference Covers work in emerging areas in the field of computational intelligence
Journal of Innovative Optical Health Sciences | 2013
Mukta Majumder; Nilanjana Das; Sujan Kumar Saha
A device, that is used for biomedical operation or safety-critical applications like point-of-care health assessment, massive parallel DNA analysis, automated drug discovery, air-quality monitoring and food-safety testing, must have the attributes like reliability, dependability and correctness. As the biochips are used for these purposes; therefore, these devices must be fault free all the time. Naturally before using these chips, they must be well tested. We are proposing a novel technique that can detect multiple faults, locate the fault positions within the biochip, as well as calculate the traversal time if the biochip is fault free. The proposed technique also highlights a new idea how to select the appropriate base node or pseudo source (start electrode). The main idea of the proposed technique is to form multiple loops with the neighboring electrode arrays and then test each loop by traversing test droplet to check whether there is any fault. If a fault is detected then the proposed technique also locates it by backtracking the test droplet. In case, no fault is detected, the biochip is fault free then the proposed technique also calculates the time to traverse the chip. The result suggests that the proposed technique is efficient and shows significant improvement to calculate fault-free biochip traversal time over existing method.