Joseph D. Prusa
Florida Atlantic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Joseph D. Prusa.
Journal of Big Data | 2015
Michael Crawford; Taghi M. Khoshgoftaar; Joseph D. Prusa; Aaron N. Richter; Hamzah Al Najada
Online reviews are often the primary factor in a customer’s decision to purchase a product or service, and are a valuable source of information that can be used to determine public opinion on these products or services. Because of their impact, manufacturers and retailers are highly concerned with customer feedback and reviews. Reliance on online reviews gives rise to the potential concern that wrongdoers may create false reviews to artificially promote or devalue products and services. This practice is known as Opinion (Review) Spam, where spammers manipulate and poison reviews (i.e., making fake, untruthful, or deceptive reviews) for profit or gain. Since not all online reviews are truthful and trustworthy, it is important to develop techniques for detecting review spam. By extracting meaningful features from the text using Natural Language Processing (NLP), it is possible to conduct review spam detection using various machine learning techniques. Additionally, reviewer information, apart from the text itself, can be used to aid in this process. In this paper, we survey the prominent machine learning techniques that have been proposed to solve the problem of review spam detection and the performance of different approaches for classification and detection of review spam. The majority of current research has focused on supervised learning methods, which require labeled data, a scarcity when it comes to online review spam. Research on methods for Big Data are of interest, since there are millions of online reviews, with many more being generated daily. To date, we have not found any papers that study the effects of Big Data analytics for review spam detection. The primary goal of this paper is to provide a strong and comprehensive comparative study of current research on detecting review spam using various machine learning techniques and to devise methodology for conducting further investigation.
information reuse and integration | 2015
Joseph D. Prusa; Taghi M. Khoshgoftaar; David J. Dittman; Amri Napolitano
Sentiment classification of tweets is used for a variety of social sensing tasks and provides a means of discerning public opinion on a wide range of topics. A potential concern when performing sentiment classification is that the training data may contain class imbalance, which can negatively affect classification performance. A classifier trained on imbalanced data may be biased in favor of the majority class. One possibile method of addressing this is to use data sampling to achieve a more balanced class distribution. In this work, we seek to observe how data sampling (using random undersampling with either a 50:50 or 35:65 positive:negative post-sampling class distribution ratio) affects the classification performance on tweet sentiment data. Our experimental results show that Random Undersampling significantly improves classification performance in comparison to not using any data sampling. Furthermore, there is no significant difference between selecting a 50:50 or 35:65 post-sampling class distribution ratio.
information reuse and integration | 2015
Joseph D. Prusa; Taghi M. Khoshgoftaar; David J. Dittman
Sentiment analysis of tweets requires the ability to reliably and accurately identify the emotional polarity (positive or negative) of instances. This can be challenging, particularly when the data quality is questionable due to noise or imbalance. Ensemble learning algorithms have been shown to offer superior performance compared to non-ensemble techniques in many domains, but have not been thoroughly studied in the domain of tweet sentiment classification. In this work, we compare the performance of two popular ensemble techniques, bagging and boosting. Both bagging and boosting are tested using seven different base learners. Additionally, we compare the performance of ensemble techniques to using each of the base learners with no ensemble technique. Each of the resulting 21 learning algorithms is trained and tested on two datasets, a large automatically labeled lower quality dataset and a small manually labeled high quality dataset. We find that, in general, ensemble learners achieve higher performance on both datasets, and that bagging is superior when data quality is a concern.
Journal of Big Data | 2017
Joseph D. Prusa; Taghi M. Khoshgoftaar
Using traditional machine learning approaches, there is no single feature engineering solution for all text mining and learning tasks. Thus, researchers must determine and implement the best feature engineering approach for each text classification task; however, deep learning allows us to skip this step by extracting and learning high-level features automatically from low-level text representations. Convolutional neural networks, a popular type of neural network for deep learning, have been shown to be effective at performing feature extraction and classification for many domains including text. Recently, it was demonstrated that convolutional neural networks can be used to train classifiers from character-level representations of text. This approach achieved superior performance compared to classifiers trained on word-level text representations, likely due to the use of character-level representations preserving more information from the data. Training neural networks from character level data requires a large volume of instances; however, the large volume of training data and model complexity makes training these networks a slow and computationally expensive task. In this paper, we propose a new method of creating character-level representations of text to reduce the computational costs associated with training a deep convolutional neural network. We demonstrate that our method of character embedding greatly reduces training time and memory use, while significantly improving classification performance. Additionally, we show that our proposed embedding can be used with padded convolutional layers to enable the use of current convolutional network architectures, while still facilitating faster training and higher performance than the previous approach for learning from character-level text.
information reuse and integration | 2016
Brian Heredia; Taghi M. Khoshgoftaar; Joseph D. Prusa; Michael Crawford
Understanding the sentiment conveyed by a person is a crucial task in any social interaction. Moreover, it can be used to gain insight and understanding of views held by many people. Sentiment classification is not limited to human interaction, as text can also convey the sentiment of the author. Opinion mining in text is a long studied field in machine learning. This study focuses on two of the many text domains used in the field of sentiment analysis: reviews and tweets. In this study, we aim to determine the the effect of performing cross-domain sentiment classification using either reviews or tweets as training data. We conduct an empirical investigation using two tweet datasets and one review dataset, and three classifiers. We conduct 18 experiments, varying the training dataset, the classifier used to build the model, and the dataset used to evaluate the model built. Our results show that training with tweets, for both datasets, yields an effective classifier for reviews. However, the converse, using reviews to classify sentiment in tweets, has the worst performance of all models, producing AUC values ranging from 0.59 to 0.65. Our best model is generated using tweets to train a Multinomial Naïve Bayes classifier, and using reviews to evaluate. Multinomial Naïve Bayes was the best performing learner, producing the highest AUC in 5 out of the 6 combinations of training/test datasets. To the best of our knowledge, this study is the first to examine the effects of cross-domain sentiment classification using tweets and reviews.
international conference on tools with artificial intelligence | 2015
Joseph D. Prusa; Taghi M. Khoshgoftaar; Amri Napolitano
Performing sentiment analysis of tweets by training a classifier is a challenging and complex task, requiring that the classifier can correctly and reliably identify the emotional polarity of a tweet. Poor data quality, due to class imbalance or mislabeled instances, may negatively impact classification performance. Ensemble learning techniques combine multiple models in an attempt to improve classification performance, especially on poor quality or imbalanced data, however, these techniques do not address the concern of high dimensionality present in tweets sentiment data and may require a prohibitive amount of resources to train on high dimensional data. This work addresses these issues by studying bagging and boosting combined with feature selection. These two techniques are denoted as Select-Bagging and Select-Boost, and seek to address both poor data quality and high dimensionality. We compare the performance of Select-Bagging and Select-Boost against feature selection alone. These techniques are tested with four base learners, two datasets and ten feature subset sizes. Our results show that Select-Boost offers the highest performance, is significantly better than using no ensemble technique, and is significantly better than Select-Bagging for most learners on both datasets. To the best of our knowledge, this is the first study to focus on the effects of using ensemble learning in combination with feature selection for the purpose of tweet sentiment classification.
international conference on machine learning and applications | 2016
Brian Heredia; Taghi M. Khoshgoftaar; Joseph D. Prusa; Michael Crawford
Whether purchasing a product or searching for a new doctor, consumers often turn to online reviews for recommendations. Determining whether reviews are truthful is imperative to the consumer, as to not get misled by false recommendations. Unfortunately, it is often difficult, or impossible, for humans to ascertain the validity of a review through reading the text, however, studies have shown machine learning methods perform well for detecting untruthful reviews. Previously, no studies have examined the effects of ensemble learners on the detection of untruthful reviews, despite these techniques being effective in related text classification domains. We seek to inform other researchers of the effects of ensemble techniques on the detection of spam reviews. To this aim, we evaluate four classifiers and three ensemble techniques using those four classifiers as base learners. We compare the results of Multinomial Naïve Bayes, C4.5, Logistic Regression, Support Vector Machine, Random Forest with 100, 250, and 500 trees, and Boosting and Bagging using the base learners. We found that none of the ensemble techniques tested were able to significantly improve review spam detection over standard Multinomial Naïve Bayes and thus, are not worth the computational expense they inflict.
color imaging conference | 2016
Brian Heredia; Taghi M. Khoshgoftaar; Joseph D. Prusa; Michael Crawford
Understanding the sentiment conveyed by a person is an important part of any social interaction, and sentiment in text can provide valuable insight into an authors opinion. Sentiment analysis for text is a large field of research within machine learning, as it allows the sentiment of large numbers of text instances to be determined and used to answer various questions, such as election prediction. Typically, a sentiment classifier is trained using data from the same domain it is intended to be applied to; however, there may not be sufficient training data within the given domain. Additionally, using data from multiple sources, including other related domains, may help create a more generalized sentiment classifier that can be applied to multiple domains. To this aim, we conduct an empirical study using sentiment data from two sources, online reviews and tweets. We first test the performance of sentiment analysis models built using a single data source for both in-domain and cross-domain classification. Then, we evaluate classifiers trained using instances randomly sampled from both sources. Additionally, we evaluate sampling different quantities of instances from both data sources to determine how many instances should be included in a training data set. We apply statistical tests to verify the significance of our results and find that using a combination of instances from reviews and tweets is similar to, or better than any model trained from a single domain. Also, we found no significant difference in performance for classifiers 100,000 or more combined training instances. These results are important as they indicate a more robust classifier can be trained by using a smaller number of in-domain instances augmented with instances from a related domain, rather than using purely in-domain instances. Thus, we recommend using a training data set composed of both tweets and reviews, when training a sentiment classifier for use in predicting both tweet and review sentiment.
international conference on machine learning and applications | 2015
Joseph D. Prusa; Taghi M. Khoshgoftaar; Amri Napolitano
Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.
Social Network Analysis and Mining | 2017
Brian Heredia; Taghi M. Khoshgoftaar; Joseph D. Prusa; Michael Crawford
As fake reviews become more prominent on the web, a method to differentiate between untruthful and truthful reviews becomes increasingly necessary. However, detection of false reviews may be difficult, as determining the validity of a review based solely on text can be nearly impossible for a human. In this study, we aim to determine the effectiveness of machine learning techniques, specifically ensemble techniques and the combination of feature selection and ensemble techniques, for the detection of spam reviews. In addition to traditional ensemble techniques, such as Boosting and Bagging, we employ techniques that combine ensemble methods with a form of feature selection: Select-Boost, Select-Bagging and Random Forest. For Select-Boost and Select-Bagging, we combine the Boosting and Bagging approaches with three different feature rankers. Random Forest was performed using 100, 250, and 500 trees. Our results show a combination of Select-Boost, multinomial naïve Bayes and, either Chi-squared or signal-to-noise, significantly outperforms all methods except Random Forest using 500 trees. There is no significant difference between the feature subset sizes tested when using Select-Boost with multinomial naïve Bayes, regardless of the feature selection technique employed. To the best of our knowledge, this is the first study to examine the effect of a combination of ensemble techniques and feature selection in the domain of spam review detection.