Archive | 2019

Performance Evaluation of Tree Ensemble Classification Models Towards Challenges of Big Data Analytics

 
 
 
 

Abstract


Big Data Analytics poses challenges like effective and accurate real-time data mining, lack of suitable tools & techniques and in-memory processing problem. Tree-based ensemble methods (machine learning models) are able to perform such kind of large-scale analytical processing in combination with high-performance cluster computing (special kind of distributed computing) using parallel processing. Random Forest (forest of randomized trees, a tree ensemble) algorithm is considered for the performance evaluation, as tree model supports concurrency and all trees are grown simultaneously in it, so it is a suitable parallel approach with good accuracy, noisy & imbalance dataset handling capability and also it never overfit unlike a single tree model for large dataset. However significant notable improvement over the original approach is available, but some limitation still exists regarding performance and streaming dataset such that performance rate decreases on increasing the compute nodes due to a redundant allocation of feature subsets in the hybrid approach of task & data parallelization and inability to handle stream data. So these performance issues are identified and a problem statement is formulated with an objective to achieve the linear scalable speedup and incremental processing capability of random forest algorithm to perform predictive analytics over massive datasets in the cluster environment.

Volume None
Pages 141-154
DOI 10.1007/978-981-13-8300-7_12
Language English
Journal None

Full Text