Niladri Sekhar Dash
Indian Statistical Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Niladri Sekhar Dash.
Literary and Linguistic Computing | 2004
Niladri Sekhar Dash
Empirical analysis of any natural language needs to be substantiated with the statistical findings because without adequate knowledge from statistics any linguistic study can fall into the quicksand of mistaken data handling and false observation. Recent introduction of various sub-disciplines (computational linguistics, corpus linguistics, forensic linguistics, applied linguistics, lexicology, stylometrics, lexicography, and language teaching, etc.) requires various statistical results of language properties to understand the language as well as to design sophisticated tools and software for language technology. Keeping this in mind, we present here some simple frequency counts of characters found in the Bangla text corpus. Also, we empirically evaluate their functional behaviours in the language with close reference to the corpus. Here we verify previously made observations, as well as make some new observations required for various works of language technology in Bangla.
Archive | 2018
Ankita Dhar; Niladri Sekhar Dash; Kaushik Roy
This paper explores the use of standard features as well as machine learning approaches for categorizing Bangla text documents of online Web corpus. The TF-IDF feature with dimensionality reduction technique (40% of TF) is used here for bringing in precision in the whole process of lexical matching for identification of domain category or class of a piece of text document. This approach stands on the generic observation that text categorization or text classification is a task of automatically sorting out a set of text documents into some predefined sets of text categories. Although an ample range of methods have been applied on English texts for categorization, limited studies are carried out on Indian language texts including that of Bangla. Hence, an attempt is made here to analyze the level of efficiency of the categorization method mentioned above for Bangla text documents. For verification and validation, Bangla text documents that are obtained from various online Web sources are normalized and used as inputs for the experiment. The experimental results show that the feature extraction method along with LIBLINEAR classification model can generate quite satisfactory performance by attaining good results in terms of high-dimensional feature sets and relatively noisy document feature vectors.
ieee international conference on recent trends in information systems | 2015
Alok Ranjan Pal; Diganta Saha; Sudip Kumar Naskar; Niladri Sekhar Dash
In the proposed approach, an attempt was made to disambiguate Bengali ambiguous words using Naïve Bayes Classification algorithm. The whole task was divided into two modules. Each module executes a specific task. In the first module, the algorithm was applied on a regular text, collected from the Bengali text corpus developed in the TDIL project of the Govt. of India and the accuracy of disambiguation process was obtained around 80%. In the second module, the whole training data and the test data were lemmatized and applying the same algorithm, around 85% accurate result was obtained. The output was verified with a previously tagged output file, generated with the help of a Bengali lexical dictionary. The implicational relevance of this study was attested in automatic text classification, machine learning, information extraction, and word sense disambiguation.
Archive | 2015
Bishwa Ranjan Das; Srikanta Patnaik; Sarada Baboo; Niladri Sekhar Dash
This paper presents a novel approach to recognize named entities in Odia corpus. The development of a NER system for Odia using Support Vector Machine is a challenging task in intelligent computing. NER aims at classifying each word in a document into predefined target named entity classes in a linear and non-linear fashion. Starting with named entity annotated corpora and a set of features it requires to develop a base-line NER System. Some language specific rules are added to the system to recognize specific NE classes. Moreover, some gazetteers and context patterns are added to the system to increase its performance as it is observed that identification of rules and context patterns requires language-based knowledge to make the system work better. We have used required lexical databases to prepare rules and identify the context patterns for Odia. Experimental results show that our approach achieves higher accuracy than previous approaches.
Archive | 2015
Bishwa Ranjan Das; Srikanta Patnaik; Niladri Sekhar Dash
In this paper, we have tried to describe the details about the strategies and methods we have adapted to design and develop a digital Odia corpus of newspaper texts. We have also attempted to identify the scopes of its utilization in different domains of Odia language technology and applied linguistics. The corpus is developed with sample news reports produced and published by some major Odia newspapers published from Bhubaneswar and neighboring places. We have followed several issues relating to text corpus design, development, and management, such as size of the corpus with regard to number of sentences and words, coverage of domains and sub-domains of news texts, text representation, question of nativity, determination of target users, selection of time span, selection of texts, amount of sample for each text types, method of data sampling, manner of data input, corpus sanitation, corpus file management, and problem of copyright. The digital corpus is basically in machine readable format, so that the text becomes easy to process very quickly. We presume that the corpus we have developed will come to a great help to look into the present texture of the language as well as to retrieve various linguistic data and information required for writing a modern grammar for Odia with close reference to its empirical identity, usage, and status. The electronic Odia corpus that we have generated can also be used in various fields of research and development activities for Odia.
Archive | 2019
Niladri Sekhar Dash; L. Ramamoorthy
The humble goal of this chapter is to refer to some of the achievements in the area of Indian languages corpora generation and lexical databases compilation, which have been done for a few Indian languages within last two and half decades. We shall also try to refer in this chapter some works which are still in the process of continuation for the development of corpora and lexical database for the Indian languages. What is most satisfying is the involvement and active participation of a large number of renowned institutions and individuals of the country in such works due to which these have drawn a considerable amount of attention and approval across the globe. In essence, in this chapter, we shall make a short survey on the development of monolingual corpora and parallel translation corpora, which are developed through some individual attempts or joint enterprise across the country.
Archive | 2019
Niladri Sekhar Dash; L. Ramamoorthy
The development of an exhaustive database of scientific and technical terms in a natural language carries tremendous importance in the areas of linguistic resource generation, translation, machine learning, knowledge representation, language standardization, information retrieval, dictionary compilation, language education, text composition, language planning, as well as in many other domains of language technology and mass literacy. Keeping these utilities in mind, in this chapter, we propose for developing a large lexical database of scientific and technical terms in a natural language with the utilization of a corpus. In this work, we propose to adopt an advanced method for systematic collection of scientific and technical terms from a digital language corpus, which is already developed and made available for general access. Since most of the Indian languages are enriched with digital texts that are available on the Internet, it will not be unfair to expect that we can develop a resource of this kind in most of the Indian languages. Following some of the stages of corpus processing discussed in this book (Chap. 5), we can develop an exhaustive database of scientific and technical terms in any of the Indian languages which can be utilized in all possible linguistic activities.
Archive | 2019
Niladri Sekhar Dash; L. Ramamoorthy
Recent works show that a dictionary can be made to a certain level of satisfaction if it is made with data and information acquired from widely representative and properly balanced language corpus. A language corpus provides an empirical basis in the selection of words and other lexical items as well as in supplying the most authentic information relating to pronunciation, usage, grammar, meaning, illustration, and other information with which all the words and lexical items in a general reference dictionary are furnished with. In the same manner, a language corpus supplies the most authentic information relating to compounds, idioms, phrases, and proverbs, etc., which is also included within a general reference dictionary with equal attention and importance. In this chapter, we try to explain how linguistic data and information collected from a corpus can contribute toward compilation a more useful dictionary. Although a corpus has better functional utilities in development of electronic dictionary, we like to concentrate here on the use of a corpus in the compilation of printed dictionary. We shall occasionally refer to the TDIL corpora developed in the Indian languages and use linguistic data and information from these to substantiate our arguments and observations.
Archive | 2019
Niladri Sekhar Dash; L. Ramamoorthy
Language corpus is now accepted as one of the primary resources in several branches of application-oriented and description-based linguistics. In all these branches, corpus is directly and indirectly used for description, analysis, and application of various elements and properties of a language. This trend of using corpus as a resource actually reflects on the ideological shift in the approach of the language investigators and applicators in recent years. The ready availability of corpus has made us realize that we do not need to depend on our intuitive linguistic expertise to establish our claims. Rather there is a great scope for us to extract data and information from a corpus for the same purpose. This alternative method of language study has inspired us to depend on language data faithfully obtained from real-life situations rather than depending on our intuitive speculation. Keeping this phenomenon in view, in this chapter, we shall try to show that the utility of language corpus is no more confined within a few areas of linguistics and language technology. Rather it is being used in many old and new branches of linguistics to make these fields more useful, informative, and insightful. To substantiate our argument, we shall describe the use of corpus in some important domains of linguistics, namely lexicology, lexical semantics, sociolinguistics, psycholinguistics, and stylistics.
Archive | 2019
Niladri Sekhar Dash; L. Ramamoorthy
In this chapter, we shall make attempt to discuss some of the most common techniques that are often used for processing texts stored in a text corpus. From the early stage of corpus generation and processing, most of these techniques have been quite useful in compiling different types of information about a language as well as formulating some new approaches useful for corpus text analysis and investigation. The important thing is that most of the corpus processing techniques have never been used in language analysis before the advent of the corpus in digital form and application of computer in language data collection and analysis. Another important aspect of this new trend that most of the techniques have strong functional relevance in the present context of language analysis since these techniques give us much better ways to look at the properties of language and utilize them in language analysis and application. We shall briefly discuss some of the most useful text processing techniques such as frequency calculation of characters and words, lexical collocation, concordance of words, keyword in context, local word grouping, and lemmatization. Also, we shall try to show how information and data extracted through the text processing techniques are useful in language description, analysis, and application.