[PDF] Combining Machine Learning Models using combo Library

Abstract

Model combination, often regarded as a key sub-field of ensemble learning, has been widely used in both academic research and industry applications. To facilitate this process, we propose and implement an easy-to-use Python toolkit, combo, to aggregate models and scores under various scenarios, including classification, clustering, and anomaly detection. In a nutshell, combo provides a unified and consistent way to combine both raw and pretrained models from popular machine learning libraries, e.g., scikit-learn, XGBoost, and LightGBM. With accessibility and robustness in mind, combo is designed with detailed documentation, interactive examples, continuous integration, code coverage, and maintainability check; it can be installed easily through Python Package Index (PyPI) or this https URL.

Full PDF

CCombining Machine Learning Models Using combo

Library

Yue Zhao , Xuejian Wang , Cheng Cheng , Xueying Ding H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA 15213 USA Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213 [email protected], [email protected], [email protected], [email protected]

Abstract

Model combination, often regarded as a key sub-ﬁeld of en-semble learning, has been widely used in both academicresearch and industry applications. To facilitate this pro-cess, we propose and implement an easy-to-use Pythontoolkit, combo , to aggregate models and scores undervarious scenarios, including classiﬁcation, clustering, andanomaly detection. In a nutshell, combo provides a uni-ﬁed and consistent way to combine both raw and pre-trained models from popular machine learning libraries,e.g., scikit-learn, XGBoost, and LightGBM. With accessi-bility and robustness in mind, combo is designed with de-tailed documentation, interactive examples, continuous in-tegration, code coverage, and maintainability check; it canbe installed easily through Python Package Index (PyPI) or https://github.com/yzhao062/combo . Introduction

Recently, model combination has gained much attention inmany real-world tasks, and stayed as the winning solutionin numerous data science competitions such like Kaggle(2007). It is considered as a sub-ﬁeld of ensemble learn-ing, aiming for achieving better prediction performance(2012). Despite that, it is often beyond the scope of machinelearning—it has been used in other domains such as theexperimental design in clinical trials. Generally speaking,model combination has two key usages: stability improve-ment and performance boost. For instance, practitioners runindependent trials and then average the results to eliminatethe built-in randomness and uncertainty—more reliable re-sults may be obtained. Additionally, even in a non-ideal sce-nario, base models may make independent but complemen-tary errors. The combined model can, therefore, yield betterperformance than any constituent ones.Although model combination is crucial for all sorts oflearning tasks, dedicated Python libraries are absent. Thereare a few packages that partly fulﬁll this purpose, but estab-lished libraries either exist as single purpose tools like PyOD(2019) and pycobra (2018), or as part of general purpose li-braries like scikit-learn (2011).

Code Snippet 1: Demo of combo

API with DCS combo can ﬁll this gap with four key advantages. Firstly, combo contains more than 15 combination algorithms, in-cluding both classical algorithms like dynamic classiﬁer se-lection (DCS) (1997) and recent advancement like LCSP(2019). It could handle the combination operation for allsorts of tasks like classiﬁcation, clustering, and anomaly de-tection. Secondly, combo works with both raw and pre-trained learning models from major libraries like scikit-learn, XGBoost, and LightGBM, given certain conditionsare met. Thirdly, the models in combo are designed withuniﬁed APIs, detailed documentation , and interactive ex-amples for the easy use. Lastly, all combo models are as-sociated with unit test and being checked by continuous in-tegration tools for robustness; code coverage and maintain-ability check are also enabled for performance and sustain-ability. To our best knowledge, this is the ﬁrst comprehen-sive framework for combining learning models and scoresin Python, which is valuable for data practitioners, machinelearning researchers, and data competition participants. https://pycombo.readthedocs.io https://mybinder.org/v2/gh/yzhao062/combo/master a r X i v : . [ c s . L G ] N ov igure 1: Comparison of Selected Classiﬁer Combination on Simulated Data Core Scenarios combo models for classiﬁcation, clustering, and anomalydetection share uniﬁed APIs. Inspired by scikit-learn’s APIdesign, the models in combo all come with the followingkey methods: (i) fit function processes the train data andgets the model ready for prediction; (ii) predict functiongenerates labels for the unknown test data once the modelis ﬁtted; (iii) predict proba generates predictions inprobability instead of discrete labels by predict and (iv) fit predict calls fit function ﬁrst on the input dataand then predicts on it (applicable to unsupervised modelonly). Code Snippet 1 shows the use of above APIs on DCS.Notably, ﬁtted (pretrained) models can be used directly bysetting pre fitted ﬂag; fit process will be skipped.

Classiﬁer Combination aims to aggregate multiple basesupervised classiﬁers in either parallel or sequential man-ner. Selected classiﬁer combination methods implementedin combo include stacking (meta-learning), dynamic classi-ﬁer selection, dynamic ensemble selection, and a group ofheuristic aggregation methods like averaging and majorityvote. Fig. 1 shows how different frameworks behave on asimulated dataset with 300 points. The leftmost one is a sim-ple k NN model ( k =15), and the other three are the combina-tion of ﬁve k NN models with k in range [5 , , , , .Different from classiﬁer combination, Cluster Combina-tion is usually done in an unsupervised manner. The focusis on how to align the predicted labels generated by baseclusterings, as cluster labels are categorical instead of ordi-nal. For instance, [0 , , , , and [1 , , , , are equiv-alent with appropriate alignment. Two classical clusteringcombination methods are therefore implemented to handlethis—clustering combination using evidence accumulation(EAC)(2005) and Clusterer Ensemble (2006). Anomaly De-tection concentrates on identifying the anomalous objectsfrom the general data distribution (2019). The challengesof combining multiple outlier detectors lie in its unsuper-vised nature and extreme data imbalance. Two latest combi-nation frameworks, LSCP (2019) and XGBOD (2018), areincluded in combo for unsupervised and semi-superviseddetector combination.

Score Combination comes with moreﬂexibility than the above tasks as it only asks for the out-put from multiple models, whichever it is from a group ofclassiﬁers or outlier detectors. As a general purpose task,score combination methods are easy to use without the need of initializing a dedicated class. Each aggregation method,e.g., average of maximum (AOM), can be invoked directly.

Conclusion and Future Directions combo is a comprehensive Python library to combine themodels from major machine learning libraries. It supportsfour types of combination scenarios (classiﬁcation, cluster-ing, anomaly detection, and raw scores) with uniﬁed APIs,detailed documentation, and interactive examples. As av-enues for future work, we will add the combination frame-works for customized deep learning models (from Tensor-Flow, PyTorch, and MXNet), enable GPU acceleration andparallelization for scalability, and expand to more task sce-narios such as imbalanced learning and regression.

References [2007] Bell, R. M., and Koren, Y. 2007. Lessons from thenetﬂix prize challenge.

SIGGKDD Explorations

PAMI

JMLR

PAMI

IJCNN .[2019] Zhao, Y.; Nasrullah, Z.; Hryniewicki, M. K.; and Li,Z. 2019. LSCP: Locally selective combination in paralleloutlier ensembles. In

SDM , 585–593. SIAM.[2019] Zhao, Y.; Nasrullah, Z.; and Li, Z. 2019. PyOD: Apython toolbox for scalable outlier detection.

JMLR .[2006] Zhou, Z.-H., and Tang, W. 2006. Clusterer ensemble.

KBS