Appl. Soft Comput. | 2021

A distributed density estimation algorithm and its application to naive Bayes classification

 
 
 

Abstract


Abstract We consider the problem of learning a density function from observations of an unknown underlying model in a distributed setting, where the observations are partitioned into different sites. Applying commonly used density estimation methods such as Gaussian Mixture Model (GMM) or Kernel Density Estimation (KDE) to distributed data leads to an extensive amount of communication. A familiar approach to address this issue is to sample a small subset of data and collect them into a central node to run the density estimation algorithms on them. In this paper, we follow an alternative to the sub-sampling approach by proposing the nested Log-Poly model. This model provides an accurate density estimation from a small sized statistic of the entire data. In distributed settings, it transfers the small sized statistics from the client nodes to a central node. The estimation process is then run in the central node. The proposed model can be used in different learning tasks such as classification in supervised learning and clustering in unsupervised learning. However, the properties of nested Log-Poly make it a suitable model for one-dimensional density estimations in the distributed settings. This makes Log-Poly a good choice for naive Bayes classifier, where one-dimensional density estimation is required for every feature conditioned on the class label. We provide a theoretical analysis of the efficiency of our model in estimating a wide range of probability density functions. Our experiments show that nested Log-Poly outperforms the state of the art density estimators on several synthetic datasets. We compare the accuracy and the communication load of naive Bayes classifier using nested Log-Poly and other related density estimators on several real datasets. The experimental outcomes depict that nested Log-Poly has less communication load, while maintaining a competitive classification accuracy compared to similar methods that use the entire data. Moreover, we present a comprehensive comparison between nested Log-Poly and validated KDE with sub-sampling, in terms of the number of communicated variables and the number of bytes transferred between the clients and the central node. Nested Log-Poly provides comparable accuracy with the validated KDE with sub-sampling, while communicating fewer variables. However, our method needs to compute and transmit the variables with a high precision in order to accurately capture the details of the underlying distributions.

Volume 98
Pages 106837
DOI 10.1016/j.asoc.2020.106837
Language English
Journal Appl. Soft Comput.

Full Text