Kalika Bali
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kalika Bali.
empirical methods in natural language processing | 2014
Yogarshi Vyas; Spandana Gella; Jatin Sharma; Kalika Bali; Monojit Choudhury
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, and explore language identification, back-transliteration, normalization and POS tagging of this data. Our results show that language identification and transliteration for Hindi are two major challenges that impact POS tagging accuracy.
international acm sigir conference on research and development in information retrieval | 2014
Parth Gupta; Kalika Bali; Rafael E. Banchs; Monojit Choudhury; Paolo Rosso
For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which we refer to as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. In this paper, we formally introduce the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, estimate the prevalence and thereby establish the importance of this problem. We also give a principled solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. We present an extensive empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixed-script IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines.
human factors in computing systems | 2013
Sébastien Cuendet; Indrani Medhi; Kalika Bali; Edward Cutrell
Designing ICT systems for rural users in the developing world is difficult for a variety of reasons ranging from problems with infrastructure to wide differences in user contexts and capabilities. Developing regions may include huge variability in spoken languages, and users are often low- or non-literate, with very little experience interacting with digital technologies. Researchers have explored the use of text-free graphical interfaces as well as speech-based applications to overcome some of the issues related to language and literacy. While there are benefits and drawbacks to each of these approaches, they can be complementary when used together. In this work, we present VideoKheti, a mobile system using speech, graphics, and touch interaction for low-literate farmers in rural India. VideoKheti helps farmers to find and watch agricultural extension videos in their own language and dialect. In this paper, we detail the design and development of VideoKheti and report on a field study with 20 farmers in rural India who were asked to find videos based on a scenario. The results show that farmers could use VideoKheti, but their success still greatly depended on their education level. While participants were enthusiastic about using the system, the multimodal interface did not overcome many obstacles for low-literate users.
workshop on computational approaches to code switching | 2014
Kalika Bali; Jatin Sharma; Monojit Choudhury; Yogarshi Vyas
Code-Mixing is a frequently observed phenomenon in social media content generated by multi-lingual users. The processing of such data for linguistic analysis as well as computational modelling is challenging due to the linguistic complexity resulting from the nature of the mixing as well as the presence of non-standard variations in spellings and grammar, and transliteration. Our analysis shows the extent of Code-Mixing in English-Hindi data. The classification of Code-Mixed words based on frequency and linguistic typology underline the fact that while there are easily identifiable cases of borrowing and mixing at the two ends, a large majority of the words form a continuum in the middle, emphasizing the need to handle these at different levels for automatic processing of the data.
linguistic annotation workshop | 2009
Sandipan Dandapat; Priyanka Biswas; Monojit Choudhury; Kalika Bali
Alternative paths to linguistic annotation, such as those utilizing games or exploiting the web users, are becoming popular in recent times owing to their very high benefit-to-cost ratios. In this paper, however, we report a case study on POS annotation for Bangla and Hindi, where we observe that reliable linguistic annotation requires not only expert annotators, but also a great deal of supervision. For our hierarchical POS annotation scheme, we find that close supervision and training is necessary at every level of the hierarchy, or equivalently, complexity of the tagset. Nevertheless, an intelligent annotation tool can significantly accelerate the annotation process and increase the inter-annotator agreement for both expert and non-expert annotators. These findings lead us to believe that reliable annotation requiring deep linguistic knowledge (e.g., POS, chunking, Treebank, semantic role labeling) requires expertise and supervision. The focus, therefore, should be on design and development of appropriate annotation tools equipped with machine learning based predictive modules that can significantly boost the productivity of the annotators.
workshop on computational approaches to code switching | 2014
Gokul Chittaranjan; Yogarshi Vyas; Kalika Bali; Monojit Choudhury
We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, EnglishSpanish (En-Es), English-Nepali (En-Ne), English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.
acm symposium on computing and development | 2013
Kalika Bali; Sunayana Sitaram; Sébastien Cuendet; Indrani Medhi
Voice user interfaces for ICTD applications have immense potential in their ability to reach to a large illiterate or semi-literate population in these regions where text-based interfaces are of little use. However, building speech systems for a new language is a highly resource intensive task. There have been attempts in the past to develop techniques to circumvent the need for large amounts of data and technical expertise required to build such systems. In this paper we present the development and evaluation of an application specific speech recognizer for Hindi. We use the Salaam method [4] to bootstrap a high quality speech engine in English to develop a mobile speech based agricultural video search for farmers in India. With very little training data for a 79 word vocabulary we are able to achieve >90% accuracies for test and field deployments. We report some observations from field that we believe are critical to the effective development and usability of a speech application in ICTD.
empirical methods in natural language processing | 2016
Koustav Rudra; Shruti Rijhwani; Rafiya Begum; Kalika Bali; Monojit Choudhury; Niloy Ganguly
Linguistic research on multilingual societies has indicated that there is usually a preferred language for expression of emotion and sentiment (Dewaele, 2010). Paucity of data has limited such studies to participant interviews and speech transcriptions from small groups of speakers. In this paper, we report a study on 430,000 unique tweets from Indian users, specifically Hindi-English bilinguals, to understand the language of preference, if any, for expressing opinion and sentiment. To this end, we develop classifiers for opinion detection in these languages, and further classifying opinionated tweets into positive, negative and neutral sentiments. Our study indicates that Hindi (i.e., the native language) is preferred over English for expression of negative opinion and swearing. As an aside, we explore some common pragmatic functions of code-switching through sentiment detection.
Proceedings of the 9th International Conference (EVOLANG9) | 2012
Rishiraj Saha Roy; Monojit Choudhury; Kalika Bali
Searching information on the World Wide Web by issuing queries to commercial search engines is one of the most common activities engaged in by almost every Web user. Web search queries have a unique structure, which is more complex than just a bag-of-words, yet simpler than a natural language. This structure has been evolving over the past decade which is an artefact of the way search engines are evolving and aggressively using feedback from past users to serve current and future users better. In this paper, we argue that queries can be considered as an evolving protolanguage from functional, structural and dynamical points of view. Therefore, Web search logs, a perfectly preserved and rich dataset, can probably reveal several interesting facts about the evolution of protolanguage.
international conference on multimodal interfaces | 2009
Prasenjit Dey; Ramchandrula Sitaram; Rahul Ajmera; Kalika Bali
Multimodal systems, incorporating more natural input modalities like speech, hand gesture, facial expression etc., can make human-computer-interaction more intuitive by drawing inspiration from spontaneous human-human-interaction. We present here a multimodal input device for Indic scripts called the Voice Key Board (VKB) which offers a simpler and more intuitive method for input of Indic scripts. VKB exploits the syllabic nature of Indic language scripts and exploits the users mental model of Indic scripts wherein a base consonant character is modified by different vowel ligatures to represent the actual syllabic character. We also present a user evaluation result for VKB comparing it with the most common input method for the Devanagari script, the InScript keyboard. The results indicate a strong user preference for VKB in terms of input speed and learnability. Though VKB starts with a higher user error rate compared to InScript, the error rate drops by 55% by the end of the experiment, and the input speed of VKB is found to be 81% higher than InScript. Our user study results point to interesting research directions for the use of multiple natural modalities for Indic text input.