Tegjyot Singh Sethi
University of Louisville
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tegjyot Singh Sethi.
intelligent information systems | 2016
Tegjyot Singh Sethi; Mehmed Kantardzic; Hanqing Hu
Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.
Procedia Computer Science | 2015
Tegjyot Singh Sethi; Mehmed Kantardzic
Abstract Validating online stream classifiers has traditionally assumed the availability of labeled samples, which can be monitored over time, to detect concept drift. However, labeling in streaming domains is expensive, time consuming and in certain applications, such as land mine detection, not a possibility at all. In this paper, the Margin Density Drift Detection (MD3) approach is proposed, which can signal change using unlabeled samples and requires labeling only for retraining, in the event of a drift. The MD3 approach when evaluated on 5 synthetic and 5 real world drifting data streams, produced statistically equivalent classification accuracy to that of a fully labeled accuracy tracking drift detector, and required only a third of the samples to be labeled, on average.
information reuse and integration | 2014
Elaheh Arabmakki; Mehmed Kantardzic; Tegjyot Singh Sethi
In the streaming data milieu, the input data distribution is not static and the models generated must be updated when concept drift occurs, to maintain the classification performance. Updating a model requires retraining with the new incoming labeled samples. However, labeling data is a costly and time-consuming process and designing algorithms which do not require all the instances in the stream to be labeled, is needed. In this paper, a new Reduced Labeled Samples (RLS) framework is proposed, which can handle concept drift in imbalanced data streams, by selectively labeling only those set of samples which are the most useful in characterizing the drift, and thereby generating an updated model with fewer labeled samples. Experimental comparison with state of the art imbalanced stream classification algorithms shows that the RLS framework achieves comparable or better performance with requiring only 18% of the samples to be labeled.
information reuse and integration | 2016
Tegjyot Singh Sethi; Mehmed Kantardzic; Elaheh Arabmakki
Machine learning models deployed in real world applications, operate in a dynamic environment where the data distribution can change constantly. These changes, called concept drifts, cause the performance of the learned model to degrade over time. As such it is essential to detect and adapt to changes in the data, for the model to be of any real use. While, model adaptation requires labeled data (for retraining), the detection process does not have to. Labeling data is time consuming and expensive, and if data changes are infrequent, most of the labeling effort, spent on verification, is wasted. In this paper, an ensemble based detection method is proposed, which tracks the number of samples in the critical disagreement regions of the ensemble, to detect concept drift from unlabeled data. The proposed algorithm, is-distribution and model independent, unsupervised and can be used in an online incremental fashion. Experimental analysis on 4 real world concept drift datasets shows that the proposed methodology gives high prediction performance, low false alarm rate and uses only 11.3% overall labeling, on average.
ieee recent advances in intelligent computational systems | 2011
Ch. V. M. K. Hari; Tegjyot Singh Sethi; B.S.S. Kaushal; Abhishek Sharma
One of the challenges faced by the managers in the software industry today is the ability to accurately define the requirements of the software projects early in the software development phase. The cost-benefit analysis forms the basis of the planning and decision making throughout the software development lifecycle. As such there is a need for efficient software cost estimation techniques for making any endeavor viable. Software cost estimation is the process of prognosticating the amount of effort required to build a software project. In this paper we have proposed a Particle Swarm Optimization (PSO) technique which operates on data sets clustered using the K-means clustering algorithm. PSO is employed to generate parameters of the COCOMO model for each cluster of data values. The clusters and effort parameters are then trained to a Neural Network by using Back propagation technique, for classification of data. Here we have tested the model on the COCOMO 81 dataset and also compared the obtained values with standard COCOMO model. By making use of the experience from Neural Networks and the efficient tuning of parameters by PSO operating on clusters, the proposed model is able to generate better results and it can be applied efficiently to larger data sets.
pacific asia workshop on intelligence and security informatics | 2017
Tegjyot Singh Sethi; Mehmed Kantardzic; Joung Woo Ryu
The increasing scale and sophistication of cyberattacks has led to the adoption of machine learning based classification techniques, at the core of cybersecurity systems. These techniques promise scale and accuracy, which traditional rule or signature based methods cannot. However, classifiers operating in adversarial domains are vulnerable to evasion attacks by an adversary, who is capable of learning the behavior of the system by employing intelligently crafted probes. Classification accuracy in such domains provides a false sense of security, as detection can easily be evaded by carefully perturbing the input samples. In this paper, a generic data driven framework is presented, to analyze the vulnerability of classification systems to black box probing based attacks. The framework uses an exploration exploitation based strategy, to understand an adversarys point of view of the attack defense cycle. The adversary assumes a black box model of the defenders classifier and can launch indiscriminate attacks on it, without information of the defenders model type, training data or the domain of application. Experimental evaluation on 10 real world datasets demonstrates that even models having high perceived accuracy (>90%), by a defender, can be effectively circumvented with a high evasion rate (>95%, on average). The detailed attack algorithms, adversarial model and empirical evaluation, serve.
Expert Systems With Applications | 2017
Tegjyot Singh Sethi; Mehmed Kantardzic
New classifier-independent, dynamic, unsupervised approach for detecting concept drift.Reduced number of false alarms and increased relevance of drift detection.Results comparable to supervised approaches, which require fully labeled streams.Our approach generalizes the notion of margin density, as a signal to detect drifts.Experiments on cybersecurity datasets, show efficacy for detecting adversarial drifts. Classifiers deployed in the real world operate in a dynamic environment, where the data distribution can change over time. These changes, referred to as concept drift, can cause the predictive performance of the classifier to drop over time, thereby making it obsolete. To be of any real use, these classifiers need to detect drifts and be able to adapt to them, over time. Detecting drifts has traditionally been approached as a supervised task, with labeled data constantly being used for validating the learned model. Although effective in detecting drifts, these techniques are impractical, as labeling is a difficult, costly and time consuming activity. On the other hand, unsupervised change detection techniques are unreliable, as they produce a large number of false alarms. The inefficacy of the unsupervised techniques stems from the exclusion of the characteristics of the learned classifier, from the detection process. In this paper, we propose the Margin Density Drift Detection (MD3) algorithm, which tracks the number of samples in the uncertainty region of a classifier, as a metric to detect drift. The MD3 algorithm is a distribution independent, application independent, model independent, unsupervised and incremental algorithm for reliably detecting drifts from data streams. Experimental evaluation on 6 drift induced datasets and 4 additional datasets from the cybersecurity domain demonstrates that the MD3 approach can reliably detect drifts, with significantly fewer false alarms compared to unsupervised feature based drift detectors. At the same time, it produces performance comparable to that of a fully labeled drift detector. The reduced false alarms enables the signaling of drifts only when they are most likely to affect classification performance. As such, the MD3 approach leads to a detection scheme which is credible, label efficient and general in its applicability.
information reuse and integration | 2014
Tegjyot Singh Sethi; Mehmed Kantardzic; Elaheh Arabmakki; Hanquing Hu
The classification of streaming data requires learning in an environment where the distribution of the incoming data might change continuously. Stream classification methodologies need to adapt to these changes under limitations of time and memory resources. As such, it is not possible to expect all the samples in the stream to be labeled, as labeling is often time consuming and expensive. In this paper a new ensemble classification approach is proposed, which can handle Spatio-Temporal drifts in streams even when the labeling is limited. The proposed methodology uses a grid density clustering approach to track drifts in the spatial configuration of the data, and maintains a set of classifier models local to each cluster, to track its evolution over time. Structured weighted aggregation of the models across all clusters is performed to produce an overall effective prediction on a new sample. Additionally, a uniform sampling approach amenable to the grid representation of the clusters is proposed, which selects samples to be labeled while preserving the grid density information of the stream. This provides for better selection of representative samples to be labeled, for improved drift detection and handling, while maintaining the labeling budget. Experimental comparison with state of the art drift handling systems shows that the proposed methodology is able to give a high classification performance, with a manageable ensemble size and with only 10% of the samples labeled.
2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT) | 2013
Hanqing Hu; Mehmed Kantardzic; Tegjyot Singh Sethi
In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.
advances in information technology | 2011
Tegjyot Singh Sethi; Ch. V. M. K. Hari; B.S.S. Kaushal; Abhishek Sharma
The modern day software industry has seen an increase in the number of software projects .With the increase in the size and the scale of such projects it has become necessary to perform an accurate requirement analysis early in the project development phase in order to perform a cost benefit analysis. Software cost estimation is the process of gauging the amount of effort required to build a software project. In this paper we have proposed a Particle Swarm Optimization (PSO) technique which operates on data sets which are clustered using the K-means clustering algorithm. The PSO generates the parameter values of the COCOMO model for each of the clusters of data values. As clustering encompasses similar objects under each group PSO tuning is more efficient and hence it generates better results and can be used for large data sets to give accurate results. Here we have tested the model on the COCOMO81 dataset and also compared the obtained values with standard COCOMO model. It is found that the developed model provides better estimation of the effort.