Ahsanul Haque
University of Texas at Dallas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ahsanul Haque.
international conference on data engineering | 2016
Ahsanul Haque; Latifur Khan; Michael Baron; Bhavani M. Thuraisingham; Charu C. Aggarwal
To decide if an update to a data stream classifier is necessary, existing sliding window based techniques monitor classifier performance on recent instances. If there is a significant change in classifier performance, these approaches determine a chunk boundary, and update the classifier. However, monitoring classifier performance is costly due to scarcity of labeled data. In our previous work, we presented a semi-supervised framework SAND, which uses change detection on classifier confidence to detect a concept drift. Unlike most approaches, it requires only a limited amount of labeled data to detect chunk boundaries and to update the classifier. However, SAND is expensive in terms of execution time due to exhaustive invocation of the change detection module. In this paper, we present an efficient framework, which is based on the same principle as SAND, but exploits dynamic programming and executes the change detection module selectively. Moreover, we provide theoretical justification of the confidence calculation, and show effect of a concept drift on subsequent confidence scores. Experiment results show efficiency of the proposed framework in terms of both accuracy and execution time.
pacific-asia conference on knowledge discovery and data mining | 2015
Ahsanul Haque; Latifur Khan; Michael Baron
Most of the approaches for classifying evolving data stream divide the stream into fixed size chunks to address infinite length and concept drift problems. These approaches suffer from trade-off between performance and sensitivity. To address this problem, existing adaptive sliding window techniques determine chunk boundaries dynamically by detecting changes in classifier error rate which requires true labels for all of the data instances. However, true labels are scarce and often delayed in reality. In this paper, we propose an approach which determines dynamic chunk boundaries by detecting significant changes in classifier confidence scores using only limited number of labeled data instances. Moreover, we integrate suitable classification technique with it to propose a complete semi supervised framework which uses dynamic chunk boundaries to address concept drift and concept evolution efficiently. Results from the experiments using benchmark data sets show the effectiveness of our proposed framework in terms of handling both concept drift and concept evolution.
international conference on big data | 2014
Ahsanul Haque; Swarup Chandra; Latifur Khan; Charu C. Aggarwal
In the case of a graphical model, machine learning algorithms used to evaluate a query can be broadly classified into exact and approximate inference algorithms. Exact inference algorithms use only network parameters to evaluate a query. However, these algorithms are typically intractable on large networks due to exponential time and space complexity. Approximate inference algorithms are widely used in practice to overcome this constraint, with a trade-off in accuracy. It includes sampling and propagation-based algorithms. These approximate algorithms may also suffer from scalability issues if applied on large networks, for achieving higher accuracy. To address this challenge, we have designed and implemented several MapReduce-based distributed versions of a specific type of approximate inference algorithm called Adaptive Importance Sampling (AIS). We compare and evaluate the proposed approaches using benchmark networks. Experimental results show that our proposed approaches achieve significant scaleup and speedup compared to the sequential method, while achieving similar accuracy asymptotically.
international conference on cloud computing | 2014
Ahsanul Haque; Brandon Parker; Latifur Khan; Bhavani M. Thuraisingham
Big Data Stream mining has some inherent challenges which are not present in traditional data mining. Not only Big Data Stream receives large volume of data continuously, but also it may have different types of features. Moreover, the concepts and features tend to evolve throughout the stream. Traditional data mining techniques are not sufficient to address these challenges. In our current work, we have designed a multi-tiered ensemble based method HSMiner to address aforementioned challenges to label instances in an evolving Big Data Stream. However, this method requires building large number of AdaBoost ensembles for each of the numeric features after receiving each new data chunk which is very costly. Thus, HSMiner may face scalability issue in case of classifying Big Data Stream. To address this problem, we propose three approaches to build these large number of AdaBoost ensembles using MapReduce based parallelism. We compare each of these approaches from different aspects of design. We also empirically show that, these approaches are very useful for our base method to achieve significant scalability and speedup.
ieee international conference on cloud computing technology and science | 2013
Ahsanul Haque; Brandon Parker; Latifur Khan; Bhavani M. Thuraisingham
In our current work, we have proposed a multi-tiered ensemble based robust method to address all of the challenges of labeling instances in evolving data stream. Bottleneck of our current work is, it needs to build ADABOOST ensembles for each of the numeric features. This can face scalability issue as number of features can be very large at times in data stream. In this paper, we propose an intelligent approach to build these large number of ADABOOST ensembles with MapReduce based parallelism. We show that, this approach can help our base method to achieve significant scalability without compromising classification accuracy. We analyze different aspects of our design to depict advantages and disadvantages of the approach. We also compare and analyze performance of the proposed approach in terms of execution time, speedup and scale up.
conference on information and knowledge management | 2016
Swarup Chandra; Ahsanul Haque; Latifur Khan; Charu C. Aggarwal
A typical data stream classification involves predicting label of data instances generated from a non-stationary process. Studies in the past decade have focused on this problem setting to address various challenges such as concept drift and concept evolution. Most techniques assume availability of class labels associated with unlabeled data instances, soon after label prediction, for further training and drift detection. Moreover, training and test data distributions are assumed to be similar. These assumptions are not always true in practice. For instance, a semi-supervised setting that aims to utilize only a fraction of labels may induce bias during data selection. Consequently, the resulting data distribution of training and test instances may differ. In this paper, we present a novel stream classification problem setting involving two independent non-stationary data generating processes, relaxing the above assumptions. A source stream continuously generates labeled data instances whose distribution is biased compared to that of a target stream which generates unlabeled data instances from the same domain. The problem, we call Multistream Classification, is to predict the class labels of data instances in the target stream, while utilizing labels available on the source stream. Since concept drift can occur asynchronously on these two streams, we design an adaptive framework that uses a technique for supervised concept drift detection in the biased source stream, and unsupervised concept drift detection in the target stream. A weighted ensemble of classifiers is updated after each drift detection on either streams, while utilizing a bias correction mechanism that leverage source information to predict labels of target instances whenever necessary. We empirically evaluate the multistream classifiers performance on both real-world and synthetic datasets, while comparing with various baseline methods and its variants.
computational intelligence and data mining | 2014
Ahsanul Haque; Swarup Chandra; Latifur Khan; Michael Baron
A graphical model represents the data distribution of a data generating process and inherently captures its feature relationships. This stochastic model can be used to perform inference, to calculate posterior probabilities, in various applications such as classification. Exact inference algorithms are known to be intractable on large networks due to exponential time and space complexity. Approximate inference algorithms are instead widely used in practice to overcome this constraint, with a trade off in accuracy. Stochastic sampling is one such method where an approximate probability distribution is empirically evaluated using various sampling techniques. However, these algorithms may still suffer from scalability issues on large and complex networks. To address this challenge, we have designed and implemented several MapReduce based distributed versions of a specific type of approximate inference algorithm called Adaptive Importance Sampling (AIS). We compare and evaluate the proposed approaches using benchmark networks. Experimental result shows that our approach achieves significant scaleup and speedup compared to the sequential algorithm, while achieving similar accuracy asymptotically.
international conference on data engineering | 2017
Ahsanul Haque; Swarup Chandra; Latifur Khan; Kevin W. Hamlen; Charu C. Aggarwal
Traditional data stream classification techniques assume that the stream of data is generated from a single non-stationary process. On the contrary, a recently introduced problem setting, referred to as Multistream Classification involves two independent non-stationary data generating processes. One of them is the source stream that continuously generates labeled data instances. The other one is the target stream that generates unlabeled test data instances from the same domain. The distributions represented by the source stream data is biased compared to that of the target stream. Moreover, these streams may have asynchronous concept drifts between them. The multistream classification problem is to predict the class labels of target stream instances, while utilizing labeled data available from the source stream. In this paper, we propose an efficient solution for multistream classification by fusing drift detection into online data shift adaptation. Experiment results on benchmark data sets indicate significantly improved performance over the only existing approach for multistream classification.
conference on information and knowledge management | 2017
Ahsanul Haque; Zhuoyi Wang; Swarup Chandra; Bo Dong; Latifur Khan; Kevin W. Hamlen
Traditional data stream classification assumes that data is generated from a single non-stationary process. On the contrary, multistream classification problem involves two independent non-stationary data generating processes. One of them is the source stream that continuously generates labeled data. The other one is the target stream that generates unlabeled test data from the same domain. The distribution represented by the source stream data is biased compared to that of the target stream. Moreover, these streams may have asynchronous concept drifts between them. The multistream classification problem is to predict the class labels of target stream instances by utilizing labeled data from the source stream. This kind of scenario is often observed in real-world applications due to scarcity of labeled data. The only existing approach for multistream classification uses separate drift detection on the streams for addressing the asynchronous concept drift problem. If a concept drift is detected in any of the streams, it uses an expensive batch technique for data shift adaptation. These add significant execution overhead, and limit its usability. In this paper, we propose an efficient solution for multistream classification by fusing drift detection into online data shift adaptation. We study the theoretical convergence rate and computational complexity of the proposed approach. Moreover, empirical results on benchmark data sets indicate significantly improved performance over the baseline methods.El presente trabajo fue realizado en la subregion del Suroeste de Antioquia. Se pretendio establecer a traves de los instrumentos de planificacion, el modelo de ocupacion que presentaba el suroeste en el ano 2010 y las proyecciones estrategicas que se esperaba implementar en el transcurso de una decada. Mediante encuestas de percepcion, talleres subregionales y locales con lideres sociales y actores politicos, se logra constatar los principales cambios y transformaciones que ha presentado la subregion en un lapso de nueve anos. Se puede evidenciar finalmente que los principales conflictos hallados en el ano 2010, no se han resuelto, por el contrario, actividades como la mineria aurifera, los monocultivos, y la ampliacion de la frontera agricola siguen aumentado y ejerciendo presion sobre el territorio
international conference on data mining | 2016
Swarup Chandra; Ahsanul Haque; Latifur Khan; Charu C. Aggarwal
Many real-world applications exhibit scenarios where distributions represented by training and test data are not similar, but related by a covariate shift, i.e., having equal class conditional distribution with unequal covariate distribution. Traditional data mining techniques suffer to learn a good predictive model in the presence of covariate shift. Recent studies have proposed approaches to address this challenge by weighing training instances based on density ratio between test and training data distributions. Kernel Mean Matching (KMM) is a well known method for estimating density ratio, but has time complexity cubic in the size of training data. Therefore, KMM is not suitable in real-world applications, especially in cases where the predictive model needs to be updated periodically with large training data. We address this challenge by taking fixed-size samples from training and test data, performing independent computations on these samples, and combining the results to obtain overall density ratio estimates. Our empirical evaluation demonstrates a large gain in execution time, while also achieving competitive accuracy on numerous benchmark datasets.