Girish Nath Jha
Jawaharlal Nehru University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Girish Nath Jha.
Sanskrit Computational Linguistics | 2009
Girish Nath Jha; Muktanand Agrawal; Subash; Sudhir K. Mishra; Diwakar Mani; Diwakar Mishra; Manji Bhadra; Surjit Kumar Singh
The paper describes a Sanskrit morphological analyzer that identifies and analyzes inflected noun-forms and verb-forms in any given sandhi-free text. The system which has been developed as java servlet RDBMS can be tested at http://sanskrit.jnu.ac.in (Language Processing Tools > Sanskrit Tinanta Analyzer/Subanta Analyzer) with Sanskrit data in Unicode text. Subsequently, the separate systems of subanta and tinanta will be combined into a single system of sentence analysis with karaka interpretation. Currently, the system checks and labels each word as three basic POS categories - subanta, tinanta, and avyaya. Thereafter, each subanta is sent for subanta processing based on an example database and a rule database. The verbs are examined based on a database of verb roots and forms as well by reverse morphology based on Paninian techniques. Future enhancements include plugging in the amarakosa (http://sanskrit.jnu.ac.in/amara) and other noun lexicons with the subanta system. The tinanta will be enhanced by the kṛdanta analysis module being developed separately.
language and technology conference | 2011
Narayan Choudhary; Girish Nath Jha
This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals - creating parallel sentence aligned corpus and parts of speech (POS) annotation of the corpora as per recently evolved national standard under Bureau of Indian Standard (BIS). This phase of the project is finishing in April 2012 and the next phase with newer domains and more national languages is likely to take off in May 2012. The goal of the current phase is to create parallel aligned POS tagged corpora in 12 major Indian languages (including English) with Hindi as the source language in health and tourism domains. Additional languages and domains will be added in the next phase. With the goal of 25 thousand sentences in each domain, we find that the total number of words in each of the domains has reached up to 400 thousands, the largest in size for a parallel corpus in any pair of Indian languages. A careful attempt has been made to capture various types of texts. With an analysis of the domains, we divided the two domains into sub-domains and then looked for the source text in those particular sub-domains to be included in the source text. With a preferable structure of the corpora in mind, we present our experiences also in selecting the text as the source and recount the problems like that of a judgment on the sub-domain text representation in the corpora. The POS annotation framework used for this corpora creation has also seen new changes in the POS tagsets. We also give a brief on the POS annotation framework being applied in this endeavor.
international conference on intelligent computing | 2010
Ritesh Kumar; Girish Nath Jha
In this paper, we present a corpus based study of politeness across two languages-English and Hindi. It studies the politeness in a translated parallel corpus of Hindi and English and sees how politeness in a Hindi text is translated into English. We provide a detailed theoretical background in which the comparison is carried out, followed by a brief description of the translated data within this theoretical model. Since politeness may become one of the major reasons of conflict and misunderstanding, it is a very important phenomenon to be studied and understood cross-culturally, particularly for such purposes as machine translation.
language and technology conference | 2009
Girish Nath Jha; Madhav Gopal
In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b), developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the requirements of the Sanskrit data. A revision to the annotation guidelines done for IL-POSTS is also presented. The authors also present an experiment of training the tagger at MSRI and documenting the results.
Sanskrit Computational Linguistics | 2009
Girish Nath Jha; Sudhir K. Mishra
Pāninis grammar is widely known for its formal treatment of the Sanskrit language. Many scholars Jha 2004 have earlier taken a systemic view of Pānini and have argued that Pāninis system is easily implementable. However, on a closer look, several complications arise, especially in Pāninis recourse to semantics in many of the vidhi and saṁjnā. rules. This seems to happen more in the kāraka prakaraa than in other components. The authors of this paper have highlighted the challenges in implementing some of the semantic aspects of Pāninis kāraka system. For example, the semantic conditions of being most desired by the agent (karturīpsitatamam ) and of being most effective (sādhakatamam ) are difficult to formalize in the rules specifying the semantic conditions for an objects being termed karman and karana respectively. Likewise, the semantic conditions of being that which the agent approaches with the direct object (karmanā yamabhipraiti ), and being the one pleased in relation to verbs of pleasing (rucyarthānā m prīyamā nah ) are difficult to formalize in rules specifying semantic conditions for an objects being termed sampradāna. The paper also looks at possible strategies to handle such situations, and presents pseudo-code like translations of some kāraka rules.
Procedia Computer Science | 2016
Rajneesh Kumar Pandey; Girish Nath Jha
The paper shows a statistical Sanskrit-Hindi Translator and analyzes the errors being generated by the system. The System is being trained simultaneously on the platform - the Microsoft Translator Hub (MTHub) and is intended only for simple Sanskrit prose texts. The training set includes 24K parallel sentences and 25k monolingual data with recent BLEU (Bilingual Evaluation Understudy) scores of 41 and above. The paper discusses the errors analysis of the system and suggests possible solutions. Further, it also focuses on the evaluation of MTHub system with BLEU metrics. For developing MT systems, the parallel Sanskrit-Hindi text corpora has been collected or developed manually from the literature, health, news and tourism domains. The paper also discusses issues and challenges in the development of translation systems for languages like Sanskrit.
language and technology conference | 2015
Pitambar Behera; Atul Kr. Ojha; Girish Nath Jha
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose of NLP application for these languages prove to be quite challenging. For the development of any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121 k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Project. Both the taggers are trained and tested with approximately 80 k and 13 k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.
advances in computing and communications | 2015
Srishti Singh; Girish Nath Jha
The authors present the first Support Vector Machines (SVM) based statistical Parts of Speech (POS) Tagger developed for Bhojpuri. Bhojpuri is a less resourced Indo Aryan language of the Asian continent and the POS tagger presented here is a step towards developing language resources for it. SVMs have already been trained on other languages like Malayalam and Bengali with an accuracy of 86-90 %. The present research came up with approximately 87.3 -88.6% accuracy for test datasets.
language and technology conference | 2011
Narayan Choudhary; Pramod Pandey; Girish Nath Jha
For a task of natural language understanding, the identification of tense, aspect and mood (TAM) features in a given text is of importance in itself. A closer look at the verb groups in a sentence can give the exact combination of the TAM features the verb group carries. While the verb group consisting of one word could be easily interpreted for the TAM features, the TAM features of the verb groups consisting of more than one word (as witnessed in many languages) can be identified exactly through a rule based method. In this paper we present a rule based method to capture the TAM features denoted by verb groups in Hindi.
international conference on applied and theoretical computing and communication technology | 2017
Anupama Pandey; Srishti Singh; Atul Kr. Ojha; Girish Nath Jha
In this paper, author reports the scope of domain adaptation for multi-domain Hindi POS tagger to a new domain, i.e. the popular domain of Cricket, through an initial experiment. Utility of Adaptation of new domain is proposed and verified by testing the accuracy of existing Hindi POS tagger for sports domain (here, Cricket) resulting in reduced average accuracy of 87.77% from approx. 94% overall tagger accuracy. Manual validation method is followed for evaluating the test result for generating correct error report for the sports domain data. Alongside, inter — annotator agreement/disagreement found among evaluators, and some major tagger based errors like unseen vocabulary and inconsistent performance has been recorded along with some suggestions for the improvement, serving as the basis of introducing adaptation for the Hindi tagger.