Prasha Shrestha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Prasha Shrestha is active.

Explore More

Publication

Featured researches published by Prasha Shrestha.

ibero-american conference on artificial intelligence | 2014

A Straightforward Author Profiling Approach in MapReduce

Suraj Maharjan; Prasha Shrestha; Thamar Solorio; Ragib Hasan

Most natural language processing tasks deal with large amounts of data, which takes a lot of time to process. For better results, a larger dataset and a good set of features are very helpful. But larger volumes of text and high dimensionality of features will mean slower performance. Thus, natural language processing and distributed computing are a good match. In the PAN 2013 competition, the test runtimes for author profiling range from several minutes to several days. Most author profiling systems available now are either inaccurate or slow or both. Our system, written entirely in MapReduce, employs nearly 3 million features and still manages to finish the task in a fraction of time than state-of-the-art systems and with better accuracy. Our system demonstrates that when we deal with a huge amount of data and/or a large number of features, using distributed systems makes perfect sense.

empirical methods in natural language processing | 2016

Why Do They Leave: Modeling Participation in Online Depression Forums.

Farig Sadeque; Ted Pedersen; Thamar Solorio; Prasha Shrestha; Nicolas Rey-Villamizar; Steven Bethard

Depression is a major threat to public health, accounting for almost 12% of all disabilities and claiming the life of 1 out of 5 patients suffering from it. Since depression is often signaled by decreasing social interaction, we explored how analysis of online health forums may help identify such episodes. We collected posts and replies from users of several forums on healthboards.com and analyzed changes in their use of language and activity levels over time. We found that users in the Depression forum use fewer social words, and have some revealing phrases associated with their last posts (e.g., cut myself ). Our models based on these findings achieved 94 F1 for detecting users who will withdraw from a Depression forum by the end of a 1-year observation period.

empirical methods in natural language processing | 2016

Analysis of Anxious Word Usage on Online Health Forums.

Nicolas Rey-Villamizar; Prasha Shrestha; Farig Sadeque; Steven Bethard; Ted Pedersen; Arjun Mukherjee; Thamar Solorio

Online health communities and support groups are a valuable source of information for users suffering from a physical or mental illness. Users turn to these forums for moral support or advice on specific conditions, symptoms, or side effects of medications. This paper describes and studies the linguistic patterns of a community of support forum users over time focused on the used of anxious related words. We introduce a methodology to identify groups of individuals exhibiting linguistic patterns associated with anxiety and the correlations between this linguistic pattern and other word usage. We find some evidence that participation in these groups does yield positive effects on their users by reducing the frequency of anxious related word used over time.

empirical methods in natural language processing | 2015

Predicting Continued Participation in Online Health Forums

Farig Sadeque; Thamar Solorio; Ted Pedersen; Prasha Shrestha; Steven Bethard

Online health forums provide advice and emotional solace to their users from a social network of people who have faced similar conditions. Continued participation of users is thus critical to their success. In this paper, we develop machine learning models for predicting whether or not a user will continue to participate in an online health forum. The prediction models are trained and tested over a large dataset collected from the support group based social networking site dailystrength.org. We find that our models can predict continued participation with over 83% accuracy after as little as 1 month observing the user’s activities, and that performance increases rapidly up to 1 year of observation. We also show that features such as the time since a user’s last activity are consistently predictive regardless of the length of the observation period, while other features, such as the number of times a user replies to others, decrease in predictiveness as the observation period grows.

north american chapter of the association for computational linguistics | 2016

Semi-supervised CLPsych 2016 Shared Task System Submission.

Nicolas Rey-Villamizar; Prasha Shrestha; Thamar Solorio; Farig Sadeque; Steven Bethard; Ted Pedersen

The 2016 CLPsych Shared Task is centered on the automatic triage of posts from a mental health forum, au.reachout.com. In this paper, we describe our method for this shared task. We used four different groups of features. These features are designed to capture stylistic and word patterns, together with psychological insights based on the Linguistic Inquiry and Word Count (LIWC) word list. We used a multinomial naive Bayes classifier as our base system. We were able to boost the accuracy of our approach by extending the number of training samples using a semi-supervised approach, labeling some of the unlabeled data and extending the number training samples.

conference on intelligent text processing and computational linguistics | 2015

Identification of Original Document by Using Textual Similarities

Prasha Shrestha; Thamar Solorio

When there are two documents that share similar content, either accidentally or intentionally, the knowledge about which one of the two is the original source of the content is unknown in most cases. This knowledge can be crucial in order to charge or acquit someone of plagiarism, to establish the provenance of a document or in the case of sensitive information, to make sure that you can rely on the source of the information. Our system identifies the original document by using the idea that the pieces of text written by the same author have higher resemblance to each other than to those written by different authors. Given two pairs of documents with shared content, our system compares the shared part with the remaining text in both of the documents by treating them as bag of words. For cases when there is no reference text by one of the authors to compare against, our system makes predictions based on similarity of the shared content to just one of the documents.

conference on intelligent text processing and computational linguistics | 2016

Large Scale Authorship Attribution of Online Reviews

Prasha Shrestha; Arjun Mukherjee; Thamar Solorio

Traditional authorship attribution methods focus on the scenario of a limited number of authors writing long pieces of text. These methods are engineered to work on a small number of authors and generally do not scale well to a corpus of online reviews where the candidate set of authors is large. However, attribution of online reviews is important as they are replete with deception and spam. We evaluate a new large scale approach for predicting authorship via the task of verification on online reviews. Our evaluation considers a large number of possible candidate authors seen to date. Our results show that multiple verification models can be successfully combined to associate reviews with their correct author in more than 78% of the time. We propose that our approach can be used to slow down or deter the number of deceptive reviews in the wild.

ibero-american conference on artificial intelligence | 2014

Using String Information for Malware Family Identification

Prasha Shrestha; Suraj Maharjan; Gabriela De la Rosa; Alan P. Sprague; Thamar Solorio; Gary Warner

Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of them will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name, which does not provide much information. Our method for malware family identification is based on the observation that closely related malware have heavy overlap of strings. We first created two kinds of prototypes from printable strings in the malware: one using term frequency–inverse document frequency (tf-idf) and the other using the prominent strings extracted from the vocabulary. We then used these prototypes for classification. We achieved an accuracy of 91.02 % by considering the entire vocabulary and an accuracy of 80.52 % by considering 20 prominent strings for each malware family. Our accuracy is high enough for our system to be used to classify even those malware that can confuse the anti-virus vendors.

CLEF (Working Notes) | 2013