Navanath Saharia
Tezpur University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Navanath Saharia.
meeting of the association for computational linguistics | 2009
Navanath Saharia; Dhrubajyoti Das; Utpal Sharma; Jugal K. Kalita
Assamese is a morphologically rich, agglutinative and relatively free word order Indic language. Although spoken by nearly 30 million people, very little computational linguistic work has been done for this language. In this paper, we present our work on part of speech (POS) tagging for Assamese using the well-known Hidden Markov Model. Since no well-defined suitable tagset was available, we develop a tagset of 172 tags in consultation with experts in linguistics. For successful tagging, we examine relevant linguistic issues in Assamese. For unknown words, we perform simple morphological analysis to determine probable tags. Using a manually tagged corpus of about 10000 words for training, we obtain a tagging accuracy of nearly 87% for test inputs.
advances in computing and communications | 2012
Navanath Saharia; Utpal Sharma; Jugal K. Kalita
Stemming is the process of automatically extracting the base form of a given word of a language. Assamese is a morphologically rich, relatively free word order, Indo-Aryan language spoken in North-Eastern part of India that uses Assamese-Bengali script for writing. As it is among the less computationally studied languages, our aim is to extract stem from a given word. We adopt the suffix stripping approach along with a rule engine that generates all the possible suffix sequences. We found 82% accuracy with the suffix stripping approach after adding a root-word list of size 20,000 approximately.
international conference on computational linguistics | 2013
Navanath Saharia; Kishori M. Konwar; Utpal Sharma; Jugal K. Kalita
Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.
2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) | 2015
Horia-Nicolai Teodorescu; Navanath Saharia
We present an approach to create an Internet slang annotated dictionary to help identifying the level of specific attitudes and moods (specifically aggressiveness, distress, hatefulness and offensiveness) in social networks and social media posts. The annotation refers to the attitudes, with the actual meaning having a low importance. The annotated dictionary is intended for automatic use in the detection of harmful and help-requiring messages. The main contributions are the concept of dictionary annotated for attitude and the method of building and annotating the dictionary, which are not typical for slang dictionaries.
ACM Transactions on Asian Language Information Processing | 2014
Navanath Saharia; Utpal Sharma; Jugal K. Kalita
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo-Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMM states. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].
international conference on asian language processing | 2010
Navanath Saharia; Utpal Sharma; Jugal K. Kalita
Nouns and verbs pose the major challenge in part-of-speech tagging exercises. In this paper we present a suffix based noun and verb classifier for Assamese, an inflectional, relatively free word order Indic language. We used a tiny dictionary of frequent words to increase the accuracy. We obtained F-score of around 85%.
SIRS | 2014
Himangshu Sarma; Navanath Saharia; Utpal Sharma
Exact pronunciation of words of a language is not found from the written form of the language. Phonetic transcription is a step towards the speech processing of a language. For a language like Assamese it is most important because it is spoken differently in different regions of the state. In this paper we report automatic transcription of Assamese speech using Hidden Markov Model Tool Kit (HTK). We obtain accuracy of 65.26 an experiment. We transcribed recorded speech files using IPA symbols and ASCII for automatic transcription. We used 34 phones for IPA transcription and 38 for ASCII transcription.
Archive | 2018
Nayan Jyoti Kalita; Navanath Saharia
Social media/network has become one of the comfortable medium for people to share their feelings instinctively. With the increasing use of social media, language identification of code-mixed text is a new problem to the linguistics as they influence the growing use of informal languages. Most of the traditional language detectors fail to identify the language. This paper describes a study to detect languages at the word level in English-Assamese code-mixed text. For the work, we have collected texts from Facebook groups and pages. We develop a system to evaluate the level of mixing between different languages in our corpus and to detect the languages. Our experiments have been carried out using a linear kernel support vector machine.
acm transactions on asian and low resource language information processing | 2017
Himangshu Sarma; Navanath Saharia; Utpal Sharma
Language analysis is very important for the native speaker to connect with the digital world. Assamese is a relatively unexplored language. In this report, we analyze different aspects of speech-to-text processing, starting from building a speech corpus, defining syllable rules, and finally developing a speech search engine of Assamese. We have collected about 20 hours of speech in three (viz., read, extempore, and conversation) modes and transcribed it. We also discuss some issues and challenges faced during development of the corpus. We have developed an automatic syllabification model with 11 rules for the Assamese language and found an accuracy of more than 95% in our result. We found 12 different syllable patterns where 5 are found most frequent. The maximum length of a syllable found is four letters. With the help of Hidden Markov Model Toolkit (HTK) 3.5, we used deep learning based neural network for our speech recognition model, where we obtained 78.05% accuracy for automatic transcription of Assamese speech.
international conference on computational linguistics | 2014
Nayan Jyoti Kalita; Navanath Saharia; Smriti Kumar Sinha
In this work we present a morphological analysis of Bishnupriya Manipuri language, an Indo-Aryan language spoken in the north eastern India. As of now, there is no computational work available for the language. Finite state morphology is one of the successful approaches applied in a wide variety of languages over the year. Therefore we adapted the finite state approach to analyse morphology of the Bishnupriya Manipuri language.