Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Shervin Malmasi is active.

Publication


Featured researches published by Shervin Malmasi.


empirical methods in natural language processing | 2014

Language Transfer Hypotheses with Linear SVM Weights

Shervin Malmasi; Mark Dras

Language transfer, the characteristic second language usage patterns caused by native language interference, is investigated by Second Language Acquisition (SLA) researchers seeking to find overused and underused linguistic features. In this paper we develop and present a methodology for deriving ranked lists of such features. Using very large learner data, we show our method’s ability to find relevant candidates using sophisticated linguistic features. To illustrate its applicability to SLA research, we formulate plausible language transfer hypotheses supported by current evidence. This is the first work to extend Native Language Identification to a broader linguistic interpretation of learner data and address the automatic extraction of underused features on a per-native language basis.


empirical methods in natural language processing | 2014

Arabic Native Language Identification

Shervin Malmasi; Mark Dras

In this paper we present the first application of Native Language Identification (NLI) to Arabic learner data. NLI, the task of predicting a writer’s first language from their writing in other languages has been mostly investigated with English data, but is now expanding to other languages. We use L2 texts from the newly released Arabic Learner Corpus and with a combination of three syntactic features (CFG production rules, Arabic function words and Part-of-Speech n-grams), we demonstrate that they are useful for this task. Our system achieves an accuracy of 41% against a baseline of 23%, providing the first evidence for classifier-based detection of language transfer effects in L2 Arabic. Such methods can be useful for studying language transfer, developing teaching materials tailored to students’ native language and forensic linguistics. Future directions are discussed.


conference of the european chapter of the association for computational linguistics | 2014

Chinese Native Language Identification

Shervin Malmasi; Mark Dras

We present the first application of Native Language Identification (NLI) to nonEnglish data. Motivated by theories of language transfer, NLI is the task of identifying a writer’s native language (L1) based on their writings in a second language (the L2). An NLI system was applied to Chinese learner texts using topicindependent syntactic models to assess their accuracy. We find that models using part-of-speech tags, context-free grammar production rules and function words are highly effective, achieving a maximum accuracy of 71% . Interestingly, we also find that when applied to equivalent English data, the model performance is almost identical. This finding suggests a systematic pattern of cross-linguistic transfer may exist, where the degree of transfer is independent of the L1 and L2.


north american chapter of the association for computational linguistics | 2015

Large-scale Native Language Identification with cross-corpus evaluation

Shervin Malmasi; Mark Dras

We present a large-scale Native Language Identification (NLI) experiment on new data, with a focus on cross-corpus evaluation to identify corpusand genre-independent language transfer features. We test a new corpus and show it is comparable to other NLI corpora and suitable for this task. Cross-corpus evaluation on two large corpora achieves good accuracy and evidences the existence of reliable language transfer features, but lower performance also suggests that NLI models are not completely portable across corpora. Finally, we present a brief case study of features distinguishing Japanese learners’ English writing, demonstrating the presence of cross-corpus and cross-genre language transfer features that are highly applicable to SLA and ESL research.


workshop on innovative use of nlp for building educational applications | 2015

Measuring Feature Diversity in Native Language Identification

Shervin Malmasi; Aoife Cahill

The task of Native Language Identification (NLI) is typically solved with machine learning methods, and systems make use of a wide variety of features. Some preliminary studies have been conducted to examine the effectiveness of individual features, however, no systematic study of feature interaction has been carried out. We propose a function to measure feature independence and analyze its effectiveness on a standard NLI corpus.


north american chapter of the association for computational linguistics | 2016

Predicting Post Severity in Mental Health Forums.

Shervin Malmasi; Marcos Zampieri; Mark Dras

We present our approach to predicting the severity of user posts in a mental health forum. This system was developed to compete in the 2016 Computational Linguistics and Clinical Psychology (CLPsych) Shared Task. Our entry employs a meta-classifier which uses a set of of base classifiers constructed from lexical, syntactic and metadata features. These classifiers were generated for both the target posts as well as their contexts, which included both preceding and subsequent posts. The output from these classifiers was used to train a meta-classifier, which outperformed all individual classifiers as well as an ensemble classifier. This meta-classifier was then extended to a Random Forest of meta-classifiers, yielding further improvements in classification accuracy. We achieved competitive results, ranking first among a total of 60 submitted entries in the competition.


workshop on innovative use of nlp for building educational applications | 2015

Oracle and Human Baselines for Native Language Identification

Shervin Malmasi; Joel R. Tetreault; Mark Dras

We examine different ensemble methods, including an oracle, to estimate the upper-limit of classification accuracy for Native Language Identification (NLI). The oracle outperforms state-of-the-art systems by over 10% and results indicate that for many misclassified texts the correct class label receives a significant portion of the ensemble votes, often being the runner-up. We also present a pilot study of human performance for NLI, the first such experiment. While some participants achieve modest results on our simplified setup with 5 L1s, they did not outperform our NLI system, and this performance gap is likely to widen on the standard NLI setup.


Proceedings of the Fourth Workshop on NLP for Similar Languages,#N# Varieties and Dialects (VarDial) | 2017

Findings of the VarDial Evaluation Campaign 2017

Marcos Zampieri; Shervin Malmasi; Nikola Ljubešić; Preslav Nakov; Ahmed M. Ali; Jörg Tiedemann; Yves Scherrer; Noëmi Aepli

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.


workshop on innovative use of nlp for building educational applications | 2015

The Jinan Chinese Learner Corpus

Maolin Wang; Shervin Malmasi; Mingxuan Huang

We present the Jinan Chinese Learner Corpus, a large collection of L2 Chinese texts produced by learners that can be used for educational tasks. The present work introduces the data and provides a detailed description. Currently, the corpus contains approximately 6 million Chinese characters written by students from over 50 different L1 backgrounds. This is a large-scale corpus of learner Chinese texts which is freely available to researchers either through a web interface or as a set of raw texts. The data can be used in NLP tasks including automatic essay grading, language transfer analysis and error detection and correction. It can also be used in applied and corpus linguistics to support Second Language Acquisition (SLA) research and the development of pedagogical resources. Practical applications of the data and future directions are discussed.


Natural Language Engineering | 2017

Multilingual native language identification

Shervin Malmasi; Mark Dras

We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author’s first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

Collaboration


Dive into the Shervin Malmasi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Alexander Turchin

Brigham and Women's Hospital

View shared research outputs
Top Co-Authors

Avatar

Naoshi Hosomura

Brigham and Women's Hospital

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lee-Shing Chang

Brigham and Women's Hospital

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Preslav Nakov

Qatar Computing Research Institute

View shared research outputs
Top Co-Authors

Avatar

Huabing Zhang

Peking Union Medical College Hospital

View shared research outputs
Top Co-Authors

Avatar

Alexa Rubin

Brigham and Women's Hospital

View shared research outputs
Researchain Logo
Decentralizing Knowledge