S. Arulmozi
University of Hyderabad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by S. Arulmozi.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
Understanding the concept of ‘corpus’ has been one of the challenging issues in corpus linguistics in recent times. Language users are often confused with the concept, and as a result of this, they sometimes consider a language database of any form and content as a corpus and treat it accordingly. This is not acceptable since the concept of the corpus is far more complex. What is important here is that one should have a clear idea about what a corpus is. Without a clear idea on how to define corpora, subsequent studies on corpus data and information are bound to be skewed and erroneous. Keeping this issue in mind, in this chapter we have made attempts to provide some preliminary ideas about what a corpus is. We have first listed some popular definitions of ‘corpus’, referencing the definitions already available in dictionaries. Next, we have elaborated on the concept of the corpus in a scientific manner with a focus on its internal properties. Then we have explicated the acronym (the abbreviated form) in some detail; made distinctions between a corpus, a dataset and a database; elaborated on the formational principles of a digital corpus; determined the immediate benefits of a corpus; discussed the advantages of a corpus; and finally, we have argued for generation of a corpus in all major and minor languages.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
The history of use of language corpora before the digital corpus was generated and used is shrouded in darkness. In this chapter, we have attempted to shed some light on this dark history. We have tried to study the unmarked history regarding the processes of the generation of handmade language corpora over the past 200 years. Tracing through the past, we have described how, in the earlier years, people designed, developed and utilized language corpora in various linguistic studies. First, we have tried to justify the relevance of the survey in the present context of corpus-based linguistic studies; then we have shown how language corpora are used to collect words and other lexical items for compiling general and special dictionaries, such as, Johnson’s Dictionary (1755), The Oxford English Dictionary (1882), Supplementary Volumes of Oxford English Dictionary and the Dictionary of American English. In addition, we have described how good quotations are collected from handmade language corpora to substantiate the definitions of words provided in reference dictionaries; how handmade corpora are used in the lexical study of a language; and how data and information are extracted from handmade corpora for writing grammar books for primary and advanced language learners. Thus we have provided some rudimentary descriptions about the works of earlier scholars who manually designed and developed language corpora based on their personal design principles and utilized these in various ways to address several linguistic requirements.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
The history of speech corpus generation is comparatively short, slow and shady in comparison to text corpus generation. In fact, the diversity observed in text corpus generation is hardly noted in speech corpus generation. The number of speech corpora is small because of certain technical constraints that stand as barriers in speech corpus generation. Moreover, the inherent characteristics of spoken texts make the process of speech corpus generation a complex task. Furthermore, there are procedural hurdles that make the process of speech corpus generation a troublesome affair. In this chapter, we have referred to the hurdles in the generation of speech corpus; highlighted the relevance of this survey in general; discussed the speech part of the Survey of English Usage; described the form and content of the London–Lund Corpus of Spoken English; provided information on the composition of the Machine-Readable Corpus of Spoken English; referred to the Corpus of Spoken New Zealand English; presented the structure and content of the Michigan Corpus of Academic Speech; discussed the generation of Corpus of London Teenage Language; and referred to some small-sized speech corpora developed so far in English and other languages.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
The generation of text corpora is not confined to a few widely privileged languages such as English, French, German or Spanish. Many lesser-known and under-privileged languages are also emerging with corpora of various types for various kinds of application. This makes it possible to discover corpora of various types in most advanced as well as less advanced languages. In essence, digital text corpora have already been developed in almost all languages, barring a few, which are yet to have the opportunity to deploy the facilities of computer technology used by most others. As a continuation of the previous chapter (Chap. 11), in this chapter, we have briefly discussed the form and content of some widely known corpora developed in various languages of the world. In a sequential order, we have briefly reported on the form and composition of the British National Corpus (BNC); discussed the BNC-Baby; referred to the structure and content of the American National Corpus (ANC); presented a short sketch of the Bank of English; reported about the Croatian National Corpus; highlighted the composition of the English–Norwegian Parallel Corpus; and, finally, presented short reports on a few small-sized text corpora that are widely known for their applicational relevance.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
The classification of the corpus is not confined to the genre and nature of texts. It spreads far beyond this. In this chapter, we have tried to show that a corpus can also be classified based on the type of text and the purpose of the corpus design. Based on the type of text, a corpus can be termed a ‘monolingual corpus’, which contains text samples from a single language or a dialect variety; a ‘bilingual corpus’, which carries proportional amounts of texts taken from two languages or dialect varieties (which may or may not be genealogically, typologically or geographically related); or a ‘multilingual corpus’, which stores a good amount of language data with proportional distribution across text types from more than two languages. On the other hand, based on the purpose of design, a corpus can be termed an ‘unannotated corpus’ where text samples are kept in their raw form without the addition of metadata or annotation of any kind; or an ‘annotated corpus’ where texts are annotated or tagged with various intralingual and extralingual data and information. Furthermore, we have also described the ‘maxims of corpus annotation’ proposed by earlier scholars; analyzed the issues involved in the act of corpus annotation; referred to the challenges directly and indirectly linked with corpus annotation; and finally, have referred to the state-of-the-art of corpus annotation in English and other languages across the world.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
The World Wide Web is viewed as a useful linguistic resource since it is a unique linguistic world that is full of surprising linguistic data and information. It is the largest store of texts in existence that is freely-available for all kinds of works. It covers a wide range of domains, and it is constantly added to and updated with new kinds of text by one and all. In the present world of corpus linguistics, web source text has been a highly enriched source of texts. It is therefore necessary to understand the form and content of web texts in order to specify their position and importance in corpus linguistics. To serve this purpose, in this chapter, we have defined the concept of a web text corpus (WTC); concentrated on its features and content to mark its unique identity; discussed the purposes behind the generation of a WTC; referred to some of the early attempts made to create a WTC in English and other (mostly non-Indian) languages; described the methodologies applied to create a WTC in an easy and useful manner; described the metadata information normally tagged to a WTC; identified the problems that are faced during the course of generating, storing and processing a WTC; and finally have attested the functional utility of a WTC in various domains of linguistics and language technology.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
Language corpora, from the very first date of their inception, have been a target of constant criticism by scholars from different domains of linguistics. In reality, there are people from many domains who join with generative linguists to nullify the importance of corpora in research, investigation and application. On the other hand, language corpora themselves have some limitations with regard to form, content and composition that cannot be ignored in the present scenario of corpus generation and application. In this chapter, we have tried to discuss these limitations, in brief, to show how these limitations are creating hurdles of several kinds in the progress of corpus linguistics, and how one can try to overcome these limitations with the initiation and execution of some appropriate measures. First, we have delved into the criticisms of the generative linguists that are raised against corpus linguistics; discussed the paucity of balanced text representation in a corpus; highlighted the limitations in technical efficiency; discussed about the preference for written text over speech data in the act of corpus generation; referred to the scarcity of dialogic texts in a corpus; discussed the paucity of pictorial elements in a corpus; looked into the feature of scarcity of poetic texts in a corpus; and finally, reported on some other limitations normally attached to a corpus.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
It is always difficult to define the nature of language data since language texts often possess multiple properties, due to which the nature of a particular text may overlap with that of another. However, since it is assumed that a corpus should be marked with the nature of a text, it is necessary to understand how a corpus can be different based on the nature of text—although mutual interpolation across texts is a common feature in every natural language. Based on the nature of the text, in this chapter, we have argued that a ‘general corpus’ is meant for including all kinds of text available in a language; a ‘special corpus’ is meant to collect data of a special type and to be used in special situations; a ‘sample corpus’ should contain sufficient amount of data from the major text types to be used as a representative sample of these texts types; a ‘literary corpus’ should contain only samples from imaginative literary texts; a ‘monitor corpus’, by virtue of its name and nature, must be very large in size with data taken from all kinds of context and composition with an open possibility for it to be regularly upgraded and augmented; a ‘multimodal corpus’ is meant to contain texts in all forms (audio, video, textual, sign language, etc.); a ‘sublanguage corpus’ should contain a variety of language data compiled from the ‘subsets’ of the general language; and a ‘controlled language corpus’ should be exclusive in nature since it is meant to put a strong restriction on the grammar, style and vocabulary of a language for the writers of documents belonging to special domains.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
In this chapter, we have sketched out how language corpora can be classified based on the nature of the application of texts at various domains of linguistics and language technology. We have argued that a ‘parallel corpus’ should refer to the texts of the same domains obtained from different languages; a ‘translation corpus’ should include text samples that are accompanied by their translations in one or more languages (with original texts from a source language and their translations from one or many target languages); an ‘aligned corpus’ should be a kind of translation corpus where text samples from one language and their translations from other language are aligned, paragraph by paragraph, sentence by sentence, phrase by phrase, word by word, and even character by character; a ‘comparable corpus’ by definition should contain a pair of corpora from one language or from two or more languages, as the situation requires; a ‘reference corpus’ should be designed to provide comprehensive information about a language in its total linguistic identity both in a diachronic and synchronic scale; a ‘learner corpus’ should be generated with a chosen collection of both written and spoken text samples produced by the language learners; and an ‘opportunistic corpus’ should refer to a moderate collection of text samples that are obtained, converted and used free of charge by some novices or amateurs.
Archive | 2018
Niladri Sekhar Dash; S. Arulmozi
In this chapter, we have addressed some of the theoretical and practical issues relating to the generation, processing and management of a parallel translation corpus (PTC) with reference to some Indian languages. A PTC developed in a consortium-mode project under the aegis of DeitY, Govt. of India is discussed. Several issues relating to PTC development are discussed here for the first time keeping in mind the ready application of parallel translation corpora in various domains of computational linguistics and applied linguistics. In a normative manner, we have defined here what a PTC is, described the process of its construction, and have identified its primary features. These issues are brought under focus to justify the present work of trying to develop a PTC for Indian languages for future reference and application. Next, we have exemplified the processes of text alignment in a PTC; discussed the methods of text analysis; proposed the restructuring of translational units; defined the process of extraction of translational equivalents from a PTC; proposed the generation of a bilingual lexical database and termbank from a structured PTC; and finally have identified the areas where a PTC and information extracted from it may be utilized. Since the construction of PTC is full of hurdles, we have tried to construct a roadmap with a focus on techniques and methodologies that may be applied in order to achieve the task.