scikit-dyn2sel -- A Dynamic Selection Framework for Data Streams
Lucca Portes Cavalheiro, Jean Paul Barddal, Alceu de Souza Britto Jr, Laurent Heutte
aa r X i v : . [ c s . L G ] A ug scikit-dyn2sel - A Dynamic Selection Framework for DataStreams Lucca Portes Cavalheiro [email protected]
Jean Paul Barddal [email protected]
Alceu de Souza Britto Jr. [email protected]
Graduate Program in Informatics (PPGIa)Pontif´ıcia Universidade Cat´olica do Paran´a (PUCPR)Curitiba, Brazil
Laurent Heutte [email protected]
Laboratoire d’Informatique, du Traitement de l’Information et des Systmes (LITIS)Universit´e de Rouen NormandieRouen, France
Abstract
Mining data streams is a challenge per se. It must be ready to deal with an enormousamount of data and with problems not present in batch machine learning, such as conceptdrift. Therefore, applying a batch-designed technique, such as dynamic selection of classi-fiers (DCS) also presents a challenge. The dynamic characteristic of ensembles that dealwith streams presents barriers to the application of traditional DCS techniques in suchclassifiers. scikit-dyn2sel is an open-source python library tailored for dynamic selec-tion techniques in streaming data. scikit-dyn2sel ’s development follows code qualityand testing standards, including PEP8 compliance and automated high test coverage us-ing codecov.io and circleci.com . Source code, documentation, and examples are madeavailable on GitHub at https://github.com/luccaportes/Scikit-DYN2SEL . Keywords:
Dynamic Selection of Classifiers, Data Stream Mining
1. Introduction
Dynamic selection of classifiers (DCS) is a widely studied area in batch machine learning.Its application provided significant gains in many types of data. When developing new DCStechniques, it is essential to compare this novel method to the current state-of-art of the area.Nowadays, this is a straightforward task, thanks to deslib (Cruz et al., 2020), a librarythat allows the application of most DCS methods following a familiar and straightforwardinterface, borrowed from scikit-learn (Pedregosa et al., 2011).When dealing with DCS in data stream mining, however, there is no such convenience.The application of data stream mining differs from batch machine learning, and thus, it isimpossible to apply traditional DCS techniques as is. Common concepts in DCS are notnaturally present in the streaming environment, such as the validation set, which makes theutilization of
DESLIB (Cruz et al., 2020) not directly possible. In this paper, we propose scikit-dyn2sel , a framework for using and implementing DCS techniques in the datastream mining context. c (cid:13) https://creativecommons.org/licenses/by/4.0/ . avalheiro, Barddal, Britto Jr., and Heutte Table 1: Methods Contemplated in scikit-dyn2sel . DCSApply DCSTechniqueDYNSE (Almeida et al., 2016) KNORA-E (Ko et al., 2008)DESDD (Albuquerque et al., 2019) KNORA-U (Ko et al., 2008)MDE (Zyblewski et al., 2019) A Priori and A Posteriori (Giacinto and Roli, 1999)DCS-LA (Woods et al., 1996)DCS-RANK (Sabourin et al., 1993)KNOP (Cavalin et al., 2013)MCB (Huang and Suen, 1995)META-DES (Cruz et al., 2015)
2. Structure
The scikit-dyn2sel framework is built on top of scikit-multiflow (Montiel et al., 2018)and deslib (Cruz et al., 2020), a scikit-learn (Pedregosa et al., 2011) inspired libraryfor data stream mining. The interface of all the methods for applying DCS follows thesame interface as scikit-multiflow classifiers, the essential methods are partial fit and predict . These methods are respectively used for updating the classifiers with newdata and for computing predictions.The framework is divided into four main classes. One of these is the
DCSTechnique class, which contains the traditional DCS methods implemented. The objective of this classis to output a prediction using an ensemble and a validation set, such that the latter isdefined in the
ValidationSet class. Some methods for applying DCS can be used directlyon traditional online ensembles; however, many also contemplate the ensemble constructionstep, that is why each method inherits its ensemble from
Ensemble . All of these classesare combined in the
ApplyDCS class, which is the class that the methods for applying DCSin data streams inherit from. This class follows the same interface as scikit-multiflow (Montiel et al., 2018).Another benefit from scikit-dyn2sel is that traditional DCS techniques available on
DESLIB (Cruz et al., 2020) are not re-implemented. Instead, they encapsulated on the
DCSTechniques class.
Table 1 presents all the DCS methods currently implemented in scikit-dyn2sel . The leftpart of the table displays the methods for applying dynamic selection techniques in datastreams, and the right part displays the techniques itself.
3. Open Source scikit-dyn2sel is open to contributions from the community. It is hosted in a publicrepository on Github. It is licensed under the MIT license, which is a very embracing andpermissive licensing, allowing but not limited to commercial use, distribution, modification,and private use. cikit-dyn2sel - A Dynamic Selection Framework for Data Streams from s k m u l t i f l o w . e v a l u a t i o n import E v a l u a t e P r e q u e n t i a l from s k m u l t i f l o w . d ata import
SEAGenerator from d y n 2 s e l . a p p l y d c s import
DYNSEMethod from d y n 2 s e l . d c s t e c h n i q u e s import
KNORAEc l f = DYNSEMethod (H o e f f d i n g T r e e ( ) , c h u n k s i z e =1000 ,d cs meth od=KNORAE( ) , m a x e n s e m b l e s i z e =10)gen = SEAGenerator ( )ev = E v a l u a t e P r e q u e n t i a l ( )ev . e v a l u a t e ( gen , c l f )Figure 1: Usage example of scikit-dyn2sel .
4. Installation
The installation of the library can be done via Python package manager ( pip ) using “ pipinstall scikit-dyn2sel ”, or by directly cloning its GitHub repository.
5. Tests
To ensure the good operation of the framework, unit tests were written for each mainmethod in the library. When a new contribution to the code is proposed, a continuousintegration tool (CircleCi) runs the tests to ensure that if the contribution is accepted, thepreviously expected behavior of the methods is still respected. To measure the percentage oftest coverage, Codecov is applied after CircleCi’s tests pass. A contribution is only acceptedif it does not decrease the test coverage percentage of the framework.
6. Code Quality
The code is fully compliant with Python PEP8 standards, which is ensured by the Blackcode formatting tool (Python Software Foundation, 2018), which is also run on CircleCiafter each contribution proposal. Furthermore, the static analyzer Codacy is also integratedinto the Github repository, ensuring standardized code quality.
7. Usage
The usage of scikit-dyn2sel is straightforward. Since it follows the same interface as scikit-multiflow (Montiel et al., 2018), the methods can be executed with common eval-uator used in the library, such as prequential (Gama et al., 2013). Figure 1 shows how thiscan be done using the
DYNSE (Almeida et al., 2016) method.
Acknowledgments avalheiro, Barddal, Britto Jr., and Heutte This study was financed in part by the Coordenao de Aperfeioamento de Pessoal de NvelSuperior - Brasil (CAPES) - Finance Code 001.
References
Regis Antonio Saraiva Albuquerque, Albert Franca Josua Costa, Eulanda Miranda dosSantos, Robert Sabourin, and Rafael Giusti. A decision-based dynamic ensemble selectionmethod for concept drift, 2019.P. R. L. D. Almeida, L. S. Oliveira, A. D. S. Britto, and R. Sabourin. Handling conceptdrifts using dynamic selection of classifiers. In , pages 989–995, Nov 2016. doi: 10.1109/ICTAI.2016.0153.Paulo Cavalin, Robert Sabourin, and Ching Suen. Dynamic selection approaches formultiple classifier systems.
Neural Computing and Applications , 22, 03 2013. doi:10.1007/s00521-011-0737-9.Rafael Cruz, Robert Sabourin, George Cavalcanti, and Tsang Ing Ren. Meta-des: A dy-namic ensemble selection framework using meta-learning.
Pattern Recognition , 48, 052015. doi: 10.1016/j.patcog.2014.12.003.Rafael M. O. Cruz, Luiz G. Hafemann, Robert Sabourin, and George D. C. Cavalcanti.Deslib: A dynamic ensemble selection library in python.
Journal of Machine LearningResearch , 21(8):1–5, 2020. URL http://jmlr.org/papers/v21/18-144.html .Joo Gama, Raquel Sebasti˜ao, and Pedro Rodrigues. On evaluating stream learning algo-rithms.
Machine Learning , 90:317–346, 10 2013. doi: 10.1007/s10994-012-5320-9.G. Giacinto and F. Roli. Methods for dynamic classifier selection. In
Proceedings 10thInternational Conference on Image Analysis and Processing , pages 659–664, Sep. 1999.doi: 10.1109/ICIAP.1999.797670.Y. S. Huang and C. Y. Suen. A method of combining multiple experts for the recognitionof unconstrained handwritten numerals.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 17(1):90–94, Jan 1995. ISSN 1939-3539. doi: 10.1109/34.368145.Albert H. R. Ko, Robert Sabourin, and Alceu Souza Britto, Jr. From dynamic classifierselection to dynamic ensemble selection.
Pattern Recogn. , 41(5):17181731, May 2008.ISSN 0031-3203.Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. Scikit-multiflow: A multi-output streaming framework.
Journal of Machine Learning Research , 19(72):1–5, 2018.URL http://jmlr.org/papers/v19/18-251.html .F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research , 12:2825–2830, 2011. cikit-dyn2sel - A Dynamic Selection Framework for Data Streams PSF Python Software Foundation. Black - the uncompromising python code formatter. https://github.com/psf/black , 2018.M. Sabourin, A. Mitiche, D. Thomas, and G. Nagy. Classifier combination for hand-printeddigit recognition. In
Proceedings of 2nd International Conference on Document Analysisand Recognition (ICDAR ’93) , pages 163–166, Oct 1993. doi: 10.1109/ICDAR.1993.395758.Kevin Woods, W. Philip Kegelmeyer Jr, and Kevin Bowyer. Combination of multiple classi-fiers using local accuracy estimates. In
Proceedings of the 1996 Conference on ComputerVision and Pattern Recognition (CVPR 96) , CVPR 96, page 391, USA, 1996. IEEEComputer Society. ISBN 0818672587.Pawe l Zyblewski, Pawe l Ksieniewicz, and Micha l Wo´zniak. Classifier selection for highlyimbalanced data streams with minority driven ensemble. In Leszek Rutkowski, Rafa lScherer, Marcin Korytkowski, Witold Pedrycz, Ryszard Tadeusiewicz, and Jacek M. Zu-rada, editors,
Artificial Intelligence and Soft Computing , pages 626–635, Cham, 2019.Springer International Publishing. ISBN 978-3-030-20912-4., pages 626–635, Cham, 2019.Springer International Publishing. ISBN 978-3-030-20912-4.