Taghi M. Khoshgoftaar

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Taghi M. Khoshgoftaar is active.

Explore More

Publication

Featured researches published by Taghi M. Khoshgoftaar.

Advances in Artificial Intelligence | 2009

A survey of collaborative filtering techniques

Xiaoyuan Su; Taghi M. Khoshgoftaar

As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, modelbased, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.

systems man and cybernetics | 2010

RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

C. Seiffert; Taghi M. Khoshgoftaar; J. Van Hulse; Amri Napolitano

Class imbalance is a problem that is common to many application domains. When examples of one class in a training data set vastly outnumber examples of the other class(es), traditional data mining algorithms tend to create suboptimal classification models. Several techniques have been used to alleviate the problem of class imbalance, including data sampling and boosting. In this paper, we present a new hybrid sampling/boosting algorithm, called RUSBoost, for learning from skewed training data. This algorithm provides a simpler and faster alternative to SMOTEBoost, which is another algorithm that combines boosting and data sampling. This paper evaluates the performances of RUSBoost and SMOTEBoost, as well as their individual components (random undersampling, synthetic minority oversampling technique, and AdaBoost). We conduct experiments using 15 data sets from various application domains, four base learners, and four evaluation metrics. RUSBoost and SMOTEBoost both outperform the other procedures, and RUSBoost performs comparably to (and often better than) SMOTEBoost while being a simpler and faster technique. Given these experimental results, we highly recommend RUSBoost as an attractive alternative for improving the classification performance of learners built using imbalanced data.

IEEE Transactions on Software Engineering | 1992

The detection of fault-prone programs

John C. Munson; Taghi M. Khoshgoftaar

The use of the statistical technique of discriminant analysis as a tool for the detection of fault-prone programs is explored. A principal-components procedure was employed to reduce simple multicollinear complexity metrics to uncorrelated measures on orthogonal complexity domains. These uncorrelated measures were then used to classify programs into alternate groups, depending on the metric values of the program. The criterion variable for group determination was a quality measure of faults or changes made to the programs. The discriminant analysis was conducted on two distinct data sets from large commercial systems. The basic discriminant model was constructed from deliberately biased data to magnify differences in metric values between the discriminant groups. The technique was successful in classifying programs with a relatively low error rate. While the use of linear regression models has produced models of limited value, this procedure shows great promise for use in the detection of program modules with potential for faults. >

Journal of Big Data | 2015

Deep learning applications and challenges in big data analytics

Maryam M. Najafabadi; Flavio Villanustre; Taghi M. Khoshgoftaar; Naeem Seliya; Randall Wald; Edin Muharemagic

Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks. We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.

IEEE Journal on Selected Areas in Communications | 1990

Predicting software development errors using software complexity metrics

Taghi M. Khoshgoftaar; John C. Munson

Predictive models that incorporate a functional relationship of program error measures with software complexity metrics and metrics based on factor analysis of empirical data are developed. Specific techniques for assessing regression models are presented for analyzing these models. Within the framework of regression analysis, the authors examine two separate means of exploring the connection between complexity and errors. First, the regression models are formed from the raw complexity metrics. Essentially, these models confirm a known relationship between program lines of code and program errors. The second methodology involves the regression of complexity factor measures and measures of errors. These complexity factors are orthogonal measures of complexity from an underlying complexity domain model. From this more global perspective, it is believed that there is a relationship between program errors and complexity domains of program structure and size (volume). Further, the strength of this relationship suggests that predictive models are indeed possible for the determination of program errors from these orthogonal complexity domains. >

IEEE Software | 1996

Early quality prediction: a case study in telecommunications

Taghi M. Khoshgoftaar; Edward B. Allen; Kalai Kalaichelvan; Nishith Goel

Predicting the quality of modules lets developers focus on potential problems and make improvements earlier in development, when it is more cost-effective. The authors applied discriminant analysis to identify fault-prone modules in a large telecommunications system prior to testing.

automation of software test | 1996

EMERALD: software metrics and models on the desktop

John P. Hudepohl; Stephen J. Aud; Taghi M. Khoshgoftaar; Edward B. Allen; Jean Mayrand

As software becomes more and more sophisticated, industry has begun to place a premium on software reliability. The telecommunications industry is no exception. Consequently software reliability is a strategic business weapon in an increasingly competitive marketplace. In response to these concerns, BNR, Nortel, and Bell Canada developed the Enhanced Measurement for Early Risk Assessment of Latent Defects (Emerald), a decision support system designed to improve telecommunications software reliability. Emerald efficiently integrates software measurements, quality models, and delivery of results to the desktop of software developers. We have found that Emerald not only improves software reliability, but also facilitates the accurate correction of field problems. Our experiences developing Emerald have also taught us some valuable lessons about the implementation and adoption of this type of software tool.

international conference on tools with artificial intelligence | 2007

An Empirical Study of Learning from Imbalanced Data Using Random Forest

Taghi M. Khoshgoftaar; M. Golawala; J. Van Hulse

This paper discusses a comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka. RF is a relatively new learner, and to the best of our knowledge, only preliminary experimentation on the construction of random forest classifiers in the context of imbalanced data has been reported in previous work. Therefore, the contribution of this study is to provide an extensive empirical evaluation of RF learners built from imbalanced data. What should be the recommended default number of trees in the ensemble? What should the recommended value be for the number of attributes? How does the RF learner perform on imbalanced data when compared with other commonly-used learners? We address these and other related issues in this work.

Journal of Systems and Software | 1995

A neural network approach for early detection of program modules having high risk in the maintenance phase

Taghi M. Khoshgoftaar; David L. Lanning

Abstract A neural network model is developed to classify program modules as either high or low risk based on multiple criterion variables. The inputs to the model include a selection of software complexity metrics collected from a telecommunications system. Two criterion variables are used for class determination: the number of changes to enhance the program modules, and the number of changes required to remove faults from the modules. The data were deliberately biased to magnify differences in metrics values between the discriminant groups. The technique displayed a low classification error rate. This success, and the absence of the data assumptions typical of statistical techniques, demonstrate the utility of neural networks in isolating high-risk modules where class determination is based on multiple quality metrics.

Journal of Big Data | 2016

A survey of transfer learning

Karl R. Weiss; Taghi M. Khoshgoftaar; Dingding Wang

Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.

Explore More