IEEE Transactions on Industrial Informatics | 2019

Random Sample Partition: A Distributed Data Model for Big Data Analysis

 
 
 

Abstract


With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.

Volume 15
Pages 5846-5854
DOI 10.1109/TII.2019.2912723
Language English
Journal IEEE Transactions on Industrial Informatics

Full Text