Suraj Maharjan
University of Alabama at Birmingham
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Suraj Maharjan.
workshop on computational approaches to code switching | 2014
Thamar Solorio; Elizabeth Blair; Suraj Maharjan; Steven Bethard; Mona T. Diab; Mahmoud Ghoneim; Abdelati Hawwari; Fahad AlGhamdi; Julia Hirschberg; Alison Chang; Pascale Fung
We present an overview of the first shared task on language identification on codeswitched data. The shared task included code-switched data from four language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA), MandarinEnglish (MAN-EN), Nepali-English (NEPEN), and Spanish-English (SPA-EN). A total of seven teams participated in the task and submitted 42 system runs. The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs. In contrast, the language pairs with the higest F-measure where SPA-EN and NEP-EN. The task made evident that language identification in code-switched data is still far from solved and warrants further research.
linguistic annotation workshop | 2015
Suraj Maharjan; Elizabeth Blair; Steven Bethard; Thamar Solorio
Code-switching, where a speaker switches between languages mid-utterance, is frequently used by multilingual populations worldwide. Despite its prevalence, limited effort has been devoted to develop computational approaches or even basic linguistic resources to support research into the processing of such mixedlanguage data. We present a user-centric approach to collecting code-switched utterances from social media posts, and develop language universal guidelines for the annotation of codeswitched data. We also present results for several baseline language identification models on our corpora and demonstrate that language identification in code-switched text is a difficult task that calls for deeper investigation.
ibero-american conference on artificial intelligence | 2014
Suraj Maharjan; Prasha Shrestha; Thamar Solorio; Ragib Hasan
Most natural language processing tasks deal with large amounts of data, which takes a lot of time to process. For better results, a larger dataset and a good set of features are very helpful. But larger volumes of text and high dimensionality of features will mean slower performance. Thus, natural language processing and distributed computing are a good match. In the PAN 2013 competition, the test runtimes for author profiling range from several minutes to several days. Most author profiling systems available now are either inaccurate or slow or both. Our system, written entirely in MapReduce, employs nearly 3 million features and still manages to finish the task in a fraction of time than state-of-the-art systems and with better accuracy. Our system demonstrates that when we deal with a huge amount of data and/or a large number of features, using distributed systems makes perfect sense.
workshop on computational approaches to code switching | 2016
Younes Samih; Suraj Maharjan; Mohammed Attia; Laura Kallmeyer; Thamar Solorio
This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UHG system introduces a novel unified neural network architecture for language identification in code-switched tweets for both SpanishEnglish and MSA-Egyptian dialect. The system makes use of word and character level representations to identify code-switching. For the MSA-Egyptian dialect the system does not rely on any kind of language-specific knowledge or linguistic resources such as, Part Of Speech (POS) taggers, morphological analyzers, gazetteers or word lists to obtain state-ofthe-art performance.
empirical methods in natural language processing | 2017
Gustavo Aguilar; Suraj Maharjan; Adrian Pastor López Monroy; Thamar Solorio
Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.
asian himalayas international conference on internet | 2011
Rajendra Banjade; Suraj Maharjan
Recommendation systems apply statistical and knowledge discovery techniques to the problem of making product recommendations and they are achieving widespread success in E-Commerce these days. A successful recommendation system fulfils several purposes and the choice of the methodology significantly influences the quality of recommendations and other aspects including scalability. As the volume of data in the e-commerce is growing massively, the system should also be able to address the need to provide the recommendations either by in-memory calculations or offline calculations, both demanding the high performance. For a large number of customers and products, the linear regression with a proper model selection can provide significantly better results and performance. Recommendations engines are increasingly becoming a popular choice for solving the problem of content discovery enabling the user to find personally relevant content that they might not have known was available. In this paper, we consider linear regression technique for analyzing large-scale dataset for the purpose of useful recommendations to e-commerce customers by offline calculations of model results.
Proceedings of the First Workshop on Abusive Language Online | 2017
Niloofar Safi Samghabadi; Suraj Maharjan; Alan P. Sprague; Raquel Diaz-Sprague; Thamar Solorio
Although social media has made it easy for people to connect on a virtually unlimited basis, it has also opened doors to people who misuse it to undermine, harass, humiliate, threaten and bully others. There is a lack of adequate resources to detect and hinder its occurrence. In this paper, we present our initial NLP approach to detect invective posts as a first step to eventually detect and deter cyberbullying. We crawl data containing profanities and then determine whether or not it contains invective. Annotations on this data are improved iteratively by in-lab annotations and crowdsourcing. We pursue different NLP approaches containing various typical and some newer techniques to distinguish the use of swear words in a neutral way from those instances in which they are used in an insulting way. We also show that this model not only works for our data set, but also can be successfully applied to different data sets.
ibero-american conference on artificial intelligence | 2014
Prasha Shrestha; Suraj Maharjan; Gabriela De la Rosa; Alan P. Sprague; Thamar Solorio; Gary Warner
Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of them will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name, which does not provide much information. Our method for malware family identification is based on the observation that closely related malware have heavy overlap of strings. We first created two kinds of prototypes from printable strings in the malware: one using term frequency–inverse document frequency (tf-idf) and the other using the prominent strings extracted from the vocabulary. We then used these prototypes for classification. We achieved an accuracy of 91.02 % by considering the entire vocabulary and an accuracy of 80.52 % by considering 20 prominent strings for each malware family. Our accuracy is high enough for our system to be used to classify even those malware that can confuse the anti-virus vendors.
CLEF (Working Notes) | 2014
Suraj Maharjan; Prasha Shrestha; Thamar Solorio
conference of the european chapter of the association for computational linguistics | 2017
Suraj Maharjan; John Arevalo; Manuel Montes; Fabio A. González; Thamar Solorio