A Survey on Sampling and Profiling over Big Data (Technical Report)
aa r X i v : . [ c s . D B ] M a y A Survey on Sampling and Profiling over Big Data(Technical Report)
Zhicheng Liu ∗ , Aoqian Zhang $ ∗ Tsinghua University, [email protected] $ University of Waterloo, [email protected]
Abstract —Due to the development of internet technology andcomputer science, data is exploding at an exponential rate. Bigdata brings us new opportunities and challenges. On the onehand, we can analyze and mine big data to discover hiddeninformation and get more potential value. On the other hand,the 5V characteristic of big data, especially Volume which meanslarge amount of data, brings challenges to storage and processing.For some traditional data mining algorithms, machine learningalgorithms and data profiling tasks, it is very difficult to handlesuch a large amount of data. The large amount of data is highlydemanding hardware resources and time consuming. Samplingmethods can effectively reduce the amount of data and helpspeed up data processing. Hence, sampling technology has beenwidely studied and used in big data context, e.g., methods fordetermining sample size, combining sampling with big dataprocessing frameworks. Data profiling is the activity that findsmetadata of data set and has many use cases, e.g., performingdata profiling tasks on relational data, graph data, and timeseries data for anomaly detection and data repair. However, dataprofiling is computationally expensive, especially for large datasets. Therefore, this paper focuses on researching sampling andprofiling in big data context and investigates the application ofsampling in different categories of data profiling tasks. Fromthe experimental results of these studies, the results got fromthe sampled data are close to or even exceed the results of thefull amount of data. Therefore, sampling technology plays animportant role in the era of big data, and we also have reasonto believe that sampling technology will become an indispensablestep in big data processing in the future.
Index Terms —Big Data, Large Amount, Sampling, Data Pro-filing
I. I
NTRODUCTION W E are in the era of big data. With the development ofcomputer science and internet technology, data is ex-ploding at an exponential rate. According to statistics, Googleprocesses more than hundreds of PB data per day, Facebookgenerates more than 10 PB of log data per month, Baiduprocesses nearly 100 PB of data per day, and Taobao generatesdozens of terabytes online transaction data every day [1]. InMay 2011, the McKinsey Global Institution (MGI) releasedthe report which said that big data has great potential in theEuropean Public Sector, US Health Care, Manufacturing, USRetail Industry and Location-based Services. MGI estimates inthe report that the mining and analysis of big data will generate300 billion in potential value per year in the US medical sectorand more than 149 billion in the European public sector [2]. It The Next Frontier of Big Data: Innovation, Competition, and Productivity can be seen that there is great value behind big data. Therefore,mining the hidden value under big data makes a lot of sense.Big data is something so huge and complex that it is difficultor impossible for traditional systems and tools to process andwork on it [3]. In the latest development, IBM uses ”5Vs”model to depict big data. In the ”5Vs” model, Volume meansthe amount of data and it is the most direct difficulty facedby traditional systems; Velocity means that data is generatedquickly; Variety means that data sources and data types arediverse including structural, semi-structured, and unstructureddata; Value is the most important feature of big data, althoughthe value density of data is low; Veracity refers to that dataquality of big data where there is dirty data. Because big datais so large that data analysis and data mining based on bigdata require high computing power and storage capacity. Inaddition, some classical mining algorithms that require severalpasses over the whole dataset may take hours or even days toget result [4].
A. Data Sampling
At present, there are two major strategies for data miningand data analysis: sampling and using distributed systems [5].The existing big data processing framework includes batchprocessing framework like Apache Hadoop, streaming dataprocessing framework like Apache Storm, hybrid processingframework like Apache Spark and Apache Flink. Samplingis a scientific method of selecting representative sample datafrom target data. Designing a big data sampling mechanismis to reduce the amount of data to a manageable size forprocessing [6]. Even if computer clusters are available, wecan use sampling such as block-level sampling to speed upbig data analysis [7].Different from distributed systems, sampling is a kindof data reduction method like filtering. Distributed systemsincrease computing power by adding hardware resources.However, a huge computing cost is not always affordablein practice. It is highly demanded to perform the computingunder limited resources. In this sense, sampling is very useful.Since the full amount of data is not used, the approximate re-sult is obtained from the sample data. Such approximate resultis quite useful in the context of big data. The computationalchallenge of big data means that sampling is essential andthe sampling methods chosen by researchers is also important[8]. Besides, the biases caused by sampling are also somethingneed to be considered.
Sampling or re-sampling is to use less data to get the overallcharacteristics of the whole dataset. Albattah [9] studies therole of sampling in big data analysis. He believes that even ifwe can handle the full amount of data, we don’t have to do this.They focus on how sampling will play its role in specific fieldsof Artificial Intelligence and verify it by doing experiments.The experimental results show that sampling not only reducesthe data processing time, but also get better results in somecases. Even though some examples of sampling are not aseffective as the original dataset, they are obviously negligiblecompared to the greatly reduced processing time. As stated in[9], we believe that sampling can improve big data analysisand will become a preprocessing step in big data processingin the future.When it comes to sampling, how to determine the samplesize is a very important factor, and different scholars haveproposed many methods to determine the sample size [10]–[13]. And we also have to consider sampling bias when usingsampling techniques. In addition, some scholars have alsostudied the application of sampling techniques in the contextof big data, e.g., combining sampling with distributed storage,big data computing frameworks. And these will be introducedin detail in Section III.
B. Data Profiling
Data mining is an emerging research area, whose goal isto extract significant patterns or interesting rules from largedata sets [14]. Data profiling gathers metadata of data thatcan be used to find data to be mined and import data intovarious tools for analysis, which is an important preparatorytask [15]. There is currently no formal, universal or widelyaccepted definition of distinction between data profiling anddata mining. Abedjan et al. [16] think data profiling is used togenerate metadata for data sets that are used to help understanddata sets and manage data sets. However, data mining is usedto mine the hidden knowledge behind the data, which is not soobvious. Of course, data profiling and data mining also havesome overlapping tasks, such as association rule mining andclustering. In summary, the goal of data profiling is to generatesummary information about the data to help understand thedata, and the goal of data mining is to mine the new insightsof the data.There are many use cases of data profiling, such as dataprofiling for missing data imputation [17], [18] or erroneousdata repairing in relational database [19]. However, data pro-filing itself has to face computational challenges, especiallywhen it comes to large data sets. Hence, how to alleviate thecomputational challenges of data profiling is very significantin era of big data. As mentioned above, sampling for big dataprofiling is very valuable and meaningful. We will give a briefintroduction for data profiling in Section II.
C. Sampling for Data Profiling
In this paper, we focus on the sampling techniques usedfor big data profiling. Certainly, we will first introduce dataprofiling and sampling technology separately. Among them,data profiling has been associated with outstanding survey Fig. 1: A classification of typical data profiling tasks [16].papers such as [16]. Finally, our core content is to introducethe application of sampling in data profiling tasks when facinglarge data sets.In [16], the research on data profiling around the relationaldatabase is fully investigated and introduced. The classificationof data profiling (see Figure 1) is given in [16]. We will investi-gate the sampling techniques for important data profiling tasksin single column, multiple columns and dependency accordingto the classification of data profiling in [16]. Some traditionalsampling methods are introduced in [10], and methods ofdetermining the sample size are mainly introduced, but lessattention is paid to sampling in big data context. Therefore,when discussing the sampling technology below, we willsupplement some applications and information of sampling inthe big data scenario, e.g., block-based sampling.Specifically, in order to ensure the comprehensiveness of thesurvey, we follow the systematic search method provided in[16], a comprehensive summary of data profiling techniques.As also illustrated in Figure 1 of our manuscript, Abedjan et al.[16] categorize the data profiling approaches into three aspects,from the elementary columns to the complex ones, i.e., (1) dataprofiling for single columns, (2) data profiling for multiplecolumns, and (3) data profiling for dependencies. While thesampling techniques for data profiling are not emphasized in[16], in our paper, we extensively select the studies on sam-
Fig. 2: A systematic search method for selecting studies,following the categorization in Figure 1 by [16].pling for data profiling in the aforesaid categories, respectively.Figure 2 presents the systematic search method for selectingstudies, following the categorization in [13]. Following thismethod, we summarize the typical methods selected in eachcategory in Table III.The remaining of this paper is organized as follows. InSection II, we introduce the relevant knowledge of dataprofiling. In Section III, we introduce sampling techniquesand some important factors in sampling techniques. Next weintroduce the application of sampling for single-column dataprofiling tasks in Section IV, multi-column data profiling tasksin Section V and dependencies in Section VI based on theclassification of data profiling tasks in [16]. Finally, in SectionVII, we summarize the content of the article and propose somefuture works. The organizational structure of this article isshown in Figure 2. II. D
ATA P ROFILING
Before using or processing data, it is very important to havea general understanding of the data. Data profiling is the ac-tivity that finds metadata of data set [16], [20], [21], thereforeit can provide basic information about data to help peopleunderstand the data. Data profiling is an important area ofresearch for many IT experts and scholars. Data profiling hasmany classic use cases, such as data integration, data quality,data cleansing, big data analysis, database management, queryoptimization [16], [20]. Abedjan et al. [16] mainly investigatesdata profiling for relational data. However, in addition to relational databases, many non-relational databases need dataprofiling [20], such as time series data [22]–[24], graph data[25]–[27], or heterogeneous data in dataspaces [28]–[30].Data profiling tasks are classified in [16] and [20]. Abedjanet al. [16] classify the data profiling tasks of single datasource, and divides the tasks of data profiling into singlecolumn data profiling, multiple columns data profiling anddependency (see Figure 1). In fact, dependencies belong tomultiple columns data profiling tasks. Abedjan et al. [16] putdependencies separately into a large category and discuss it indetail. Naumann [20] classifies data profiling from single datasource to multiple data sources.There are three challenges for data profiling: managingthe input, performing the computation and managing theoutput [16], [20], [31]. In this article we focus on the secondchallenge, performing the computation, i.e., the computationalcomplexity of data profiling. The computational complexity ofdata profiling depends on the number of rows and columnsof data. When the data set is very large, the calculation ofdata profiling can be very expensive. This is why we careabout sampling for big data profiling, in order to reducethe computational pressure and speed up the process of dataprofiling. III. S
AMPLING T ECHNIQUES
In this section, we introduce common sampling techniquesin III-A, application of sampling technology in big data contextin III-B, methods of determining sample size in III-C andresolutions of reducing sampling bias in III-D.
A. Common Sampling Techniques
Sampling refers to estimating the characteristics of theentire population through the representative subsets withinthe population [10]. From a big perspective, sampling in-volves probability and non-probability sampling. Probabilitysampling means that every unit in a finite population has acertain probability to be selected, and it does not necessarilyrequire equality. Non-probability sampling is generally basedon subjective ideas and inferences, e.g., common web ques-tionnaires [32], [33]. The sampling methods mentioned beloware all probability sampling methods. Sampling is often usedin data profiling [16], data analysis [34], data mining [6], datavisualization [35], machine learning [36] etc. The advantage ofsampling is that algorithms or models can be conducted usingsubset instead of the whole data set. There are some commonlyused sampling techniques including simple random sampling[37], stratified sampling [38], systematic sampling [39], clustersampling [40], oversampling and undersampling [41], [42],reservoir sampling [43], etc. Table I gives an overview of thesecommon sampling methods.
B. Sampling in Big Data Context
In the era of big data, the application of sampling isparticularly important due to the large amount of big data. Andsampling can be performed with the help of big data comput-ing framework. For example, He et al. [44] use MapReduce
TABLE I: Common sampling methods
Sampling method DescriptionSimple random sampling Extracting a certain number of samples and each tuple is selected with equal probability.Stratified random sampling Tuples are divided into homogeneous groups and sample from each group.Systematic sampling Sampling at regular intervals until the sample size is satisfied.Cluster sampling Tuples are divided into non-overlapping groups and randomly select some groups as samples.Oversampling and Undersampling Oversampling randomly repeat the minority class samples, while Undersampling randomly discardthe majority class samples to balance the data.Reservoir sampling Adding tuples into the reservoir of a fixed size with unknown size of the entire data set. to sample from the data which contains uncertainty. He et al.[45] propose a block-based sampling (I-sampling) method forlarge-scale dataset when the whole dataset is already assignedon a distributed system. The processing flow of I-sampling isshown in Figure 3.The traditional sampling methods like simple random sam-pling, stratified sampling and systematic sampling are records-based. These records-based sampling methods require thecomplete pass over the whole dataset. Hence, they are com-monly used for small or medium scale datasets on singlecomputer. Even though the whole dataset is already assignedon a distributed system, it is very difficult to get a high-qualitypartition of the original dataset [46]. In the era of big data,data profiling tasks can be carried out on distributed systems,e.g., data profiling tasks on HDFS data. Therefore, block-based sampling proposed in [45] can be used as a promisingsampling method for data stored in distributed machines.He et al. [45] propose a block-based sampling method forlarge-scale dataset. They take block-based sampling as oneof components of their big data learning framework whichis called the asymptotic ensemble learning framework [47].However, the block-based sampling method is suitable for datathat is randomly ordered and not for those records that arestored in an orderly manner. In order to solve this problem,they propose a general block-based sampling method namedI-sampling.I-sampling contains four steps to get sample. Firstly, I-sampling divides large-scale dataset into non-overlapping pri-mary data blocks A i . Secondly, I-sampling shuffles primarydata blocks A i to get shuffling data blocks B i . The purposeof shuffling is to disrupt the order of original data. Thirdly,I-sampling randomly selects data from B i and put it into thebasic blocks to get a block pool C . Finally, a certain numberof basic blocks are randomly selected from the block pool, andthe data in these blocks is taken as sample. In experiments,He et al. [45] demonstrate that the block-based samplinghas the basically equal means and variances with simplerandom sampling. Besides, the distribution of I-sampling datais approximately the same with original dataset. And RMSEsof extreme learning machine based on records-based randomsampling and I-sampling are nearly the same. The processingflow of I-sampling is shown in Figure 3.As a matter of fact, data contains uncertainty in manyapplications. For example, when we do experiments, such assampling, uncertainty occurs because there are many potentialresults for sampling. Uncertainty means the diversity of poten-tial outcomes, which is unknown to us. And dealing with bigdata with uncertainty distribution is one of the most important Fig. 3: I-sampling workflow [45].issues of big data research [44]. The sample quality affects theaccuracy of data profiling. The following example shows howto use MapReduce to accelerate sampling from a big data setwith uncertainty distribution, and select Minimal ConsistentSubset with better sample quality. Minimal Consistent Subset(MCS) is a consistent subset with a minimum number ofelements.He et al. [44] use MapReduce to sample from the data whichcontains uncertainty. They propose a Parallel Sampling methodbased on Hyper Surface (PSHS) for big data with uncertaintydistribution to get the MCS of the original sample set. HyperSurface Classification (HSC) is a classification method basedon Jordan Curve Theorem and put forward by He et al. [48].The MCS of HSC is a sample subset by selecting one andonly one representative sample from each unit included inthe hyper surface. Some samples in the MCS are replaceable,while others are not, leading to the uncertainty of elements inMCS [44]. Because of large-scale of data, they use MapReducefor parallel sampling. MapReduce is a well-known distributedcomputing framework for big data today.PSHS algorithm needs to execute three kinds of MapReducejobs. In the first task, based on the value of each dimension ofthe data, the map function places each sample in the region towhich it belongs. The reduce function determines whether thisarea is pure, and labels each area with a corresponding label:pure or impure. In the second task, the corresponding decisiontree is generated and the samples that have no effect on thegenerated decision tree must be removed. The third task is thesampling task, where the map function is to place each samplein the pure region it belongs to according to the rules. In pureregions, these samples have the same effect on building theclassifier, hence the reduce task is to randomly select one andonly one from each region for building the MCS. The MinimalConsistent Subset selected by this parallel sampling is a good representation of the original data. C. Determination of Sample Size
It is very important to select effective samples [49]. If thesample size is too small, it may get an incorrect conclusion. Ifthe sample size is too large, the calculation time is too long.Therefore, when performing machine learning algorithms,data mining algorithms or data profiling tasks on large-scaledataset, how to choose the appropriate sampling method anddetermine the sample size are important factors in determiningwhether the correct result (within the allowable error range)can be obtained.There are some classic methods to determine the samplesize. Singh and Masuku [10] have detailed and summarizedthese traditional methods. For example, you can take thesample size in other similar studies as the size of the samplein your study. Furthermore, you can determine the size of thesample according to the published tables. These tables deter-mine the size of sample according to some given evaluationindicators and the size of the original dataset. Some commonlyused evaluation indicators include the level of precision, thelevel of confidence or risk, the degree of variability, etc.Another method to assure size of sample is to calculate thesize of sample according to some simple calculation formulas,which calculate the size of sample based on sampling error,confidence, and P-value. A simple formula (1) for estimatingthe sample size [10] is as follows, where n is the sample size, N is the amount of raw data, e is the level of precision, a95% confidence level and P = .5 are assumed.: n = N N ∗ e (1)When data mining algorithms are performed based on mas-sive amounts of data, much of current research prefers to scaleup data mining algorithms to deal with computational (timeand memory) constraints, but some scholars focus on selectinghow many samples to conduct data mining algorithms. In datamining algorithms, a common formula used to estimate thenumber of samples is Probably Close Enough (PCE). Theconvergence conditions are determined using PCE standardto obtain the best sample size for sampling. PCE is calculatedas formula (2). Pr [ | acc ( D ) − acc ( D i ) | ≥ ǫ ] ≤ δ (2)Where D i represents sample data, D represents all data, ǫ represents the error range threshold of accuracy, and δ represents probability.Furthermore, Satyanarayana [11] proposes a dynamic adap-tive sampling method for estimating the amount of data re-quired for the learning curve to converge at each iteration. Theauthor applies Chebyshev inequality to derive an expressionthat will estimate the number of instances at each iteration,which takes advantage of classification accuracy in order toget more precise estimates of the next sample. The expressionis formula (3), where D i is sample under consideration, acc(x i )is classification accuracy of each instance, ǫ is approximationparameter and δ is probability of failure. And Satyanarayana[11] uses the formula (4) to check convergence at each Fig. 4: Learning curves [50].iteration, where D i is the sample of current iteration and D i − is the sample of last iteration. m ≥ | D i | P D i i =1 acc ( x i ) [ 1 ǫ log 1 δ ] (3) | | D i | D i X i =1 acc ( x i ) − | D i − | D i − X i =1 acc ( x i ) | <ǫ (4)When sampling is used in machine learning, the most ap-propriate number of samples is to make the accuracy rate reachthe maximum value and increasing the number of samples canno longer improve the accuracy of the learning algorithm. Thecorresponding figure is Figure 4, where n min is the minimumsample size. In this case, there is a method for determiningthe minimum number of samples called sequential sampling.Sequential sampling refers to sample sequentially and stopsampling until a certain criterion is met. John and Langley [12]propose a method called Arithmetic Sampling. This methoduses a schedule Sa = (n , n + n δ , n + 2n δ , n + 3n δ , , N) tofind the minimum sample size, where n is the starting samplesize and n δ is fixed interval. Provost et al. [50] think that themain drawback is that if the minimum size of sample is a largemultiple of n δ , it will take many runs to reach convergence.Obviously, if n δ is too small, it will take many iterations toget convergence, and if n δ is too large, it may skip the optimalsize of sample.Singh et al. [13] propose another sequential sampling strat-egy for classification problem. They mention that data fortraining machine learning models typically originates fromcomputer experiments such as simulations. And computersimulations are often computationally expensive. In order toease the computation pressure, they use sampling to get aslittle data as possible. The sequential sampling starts withan initial small data set X δ , and it will iteratively increasethe sample by taking training points at well-chosen locations δ in the input space until stopping criteria is reached. Thesequential sampling strategy chooses a representative set ofdata samples that focuses the sampling on those locationsin the input space where the class labels are changing more rapidly, while making sure that no class regions are missed[13]. The sample update formula is formula (6) where classlabels Y δ obtained by formula (5) are result of simulatorevaluates X δ . With sequential sampling strategy, small andhigh quality samples can be obtained. Y δ := f ( X δ ) (5) S := S ∪ ( X δ , Y δ ) (6) D. Sampling Error and Sampling Bias
Sampling error is when a randomly chosen sample doesnot reflect the underlying population purely by chance andsampling bias is when the sample is not randomly chosen atall [51]. Sampling bias is one of the causes of sampling error.These two are often confused by some scholars. Sampling biasis caused by the failure of the sampling design, which cannottruly extract the sample randomly from the population [52].There is a typical case of sampling error. The large NursesHealth Study tracked 48,470 postmenopausal women for 10consecutive years, aged between 30 and 63 years old. Thestudy concluded that hormone replacement therapy can reducethe incidence of severe coronary heart disease by nearlyhalf [53]. Despite the large sample size, the study failed torecognize the atypical nature of the sample and the confusionof estrogen therapy with other active health habits [54]. Thisalso illustrates the importance of proper sampling methods andthe collected samples to get the right conclusions.To be able to correctly select the samples that representthe original data, Kim and Wang [55] focus on and solve theproblem of selection bias in the process of sampling. Sincebig data is susceptible to selection bias, they propose twoways to reduce the selection bias. One is based on the inversesampling method. This method is divided into two stages. Thefirst stage is to sample directly from the big data. The sampleis easily affected by the selection bias, thereby it is necessaryto calculate the importance of each element in the sample todetermine selection bias. In the second stage, the data sampledfrom the first stage is re-sampled according to the importanceweight of each element. In this way, they have achieved thegoal of realizing the correction of the selection deviation. Theother is the idea of using data integration. They propose touse the survey data and big data to correct the selection biasby means of the auxiliary information of survey data.From the perspective of official statisticians, Tam et al. [56]believe that big data is challenged by self-selection bias. Self-selection bias causes biased sample with non-probability sam-pling. Inferences from big data with this bias will be affected.Thus, they outline methods for adjusting self-selection biasto estimate proportions, e.g., using pseudo weights and superpopulation models [57].As a matter of fact, the case of 2016 US presidential electionstudied in [58] is precisely because of the existence of self-selection bias, which ultimately leads to data deceiving us.Therefore, to get the correct conclusion from the data, youneed to ensure the quality of the data. Probability sampling canguarantee the quality of the data. When probability samplingcannot be satisfied, the data will be affected by Law of Large Populations (LLP). The large amount of data N will affectour estimation error. In summary, when doing data sampling,data quality must be taken into account, and those high qualitydata sets should be given higher weight. This will prevent ourstatistical inferences from being affected.IV. S
AMPLING FOR S INGLE C OLUMN D ATA P ROFILING
Single column data profiling tasks are divided into cardinali-ties, value distributions, patterns, data types, and domains [79].Table II [16] lists typical metadata that may result from single-column data profiling. For some single-column data profilingtasks, such as decimals which calculates maximum number ofdecimals in numeric values, simple sampling methods cannotguarantee reliable results. And for identifying a domain ofone column, it is often more difficult and not fully automated[80]. Among them, cardinality, histograms and quantiles areoften used for query optimizers, therefore sampling techniquesare more commonly used in these tasks. Specifically, in Sec-tion IV-A, we introduce sampling for cardinality estimation.Section IV-B presents sampling for value distribution. Moreadvanced statistics include the probabilistic correlations on textattributes [81].
A. Sampling for Cardinality Estimation
Cardinalities or counts of values in a column are the mostbasic form of metadata [16]. Cardinalities usually includenumber of rows, number of null values and number of distinctvalues, which is the most important type of metadata [82].For some tasks, such as number of rows and number of nullvalues, a single pass over a column can get the exact result.However, finding the number of distinct values may require tosort or hash the value of column [80]. Similarly, when facinglarge data sets, statistics of the number of distinct values of anattribute have to face the pressure of memory and calculation.Therefore, the estimation of the number of distinct valuesbased on sampling has been studied [59]–[61].Haas et al. [59] propose several sampling-based estimatorsto estimate the number of different values of an attribute ina relational database. They use a large number of attributevalue distributions from various actual databases to comparethese new estimators with those in databases and statisticalliterature. Their experimental results prove that no estimatoris optimal for all attribute value distributions. And from theirexperimental results, it can be seen that the larger the samplingfraction, the smaller the estimated mean absolute deviation willbe. They therefore propose a sampling-based hybrid estimatorˆD hybrid and get the highest precision on average at a givensampling fraction.Similar to Haas et al., Charikar et al. [60] also obtaina negative result in the experiment that no estimator basedon sampling can guarantee small errors on the input dataof different distributions, unless a larger sampling fractionis performed on the input data. They therefore propose anew estimator Guaranteed-Error Estimator (GEE), which isprovably optimal. Although its error on the input of differentdistributions is small, it does not make use of the knowledgeof different distributions. For example, in the case of low-skew
TABLE II: Overview of single-column profiling tasks [16]
Category Task DescriptionCardinalities num-rows Number of rowsvalue length Measurements of value lengths (minimum, maximum, median, and average)null values Number or percentage of null valuesdistinct Number of distinct values; sometimes called ”cardinality”uniqueness Number of distinct values divided by the number of rowsValue distributions histogram Frequency histograms (equi-width, equi-depth, etc.)constancy Frequency of most frequent value divided by number of rowsquartiles Three points that divide the (numeric) values into four equal groupsfirst digit Distribution of first digit in numeric valuesPatterns, data types, and domains basic type Generic data type, such as numeric, alphabetic, alphanumeric, date, timedata type Concrete DBMS-specific data type, such as varchar, timestamp.size Maximum number of digits in numeric valuesdecimals Maximum number of decimals in numeric valuespatterns Histogram of value patterns (Aa9...)data class Semantic, generic data type, such as code, indicator, text, date/time, quantity, identifierdomain Classification of semantic domain, such as credit card, first name, city, phenotype
TABLE III: Summary of sampling for big data profiling tasks
Data Profiling Sampling-based methodSingle column Cardinality Estimation ˆD hybrid [59], GEE [60], AE [60], Distinct sampling [61]Histograms Random sampling [62], Backing sample [63]Quantiles Non-uniform random sampling [64], Improved random sampling [65]Multiple columns Correlations and association rules Sequential random sampling without replacement [14], Two-phased sampling [4], ISbFIM [66]Clusters and outliers Biased sampling [67]Summaries and sketches Error-bounded stratified sampling [68], [69]Regression analysis IBOSS [70], Random sampling without replacement [71]Dependency Uniqueness GORDIAN [72], HCA-Gordian [73]Functional dependencies AID-FD [74], HYFD [75], CORDS [76], BRRSC [77]Inclusion dependencies FAIDA [78] data with a large number of distinct values, GEE performs notvery well in practice. They further propose a new heuristicversion of GEE called Adaptive Estimator (AE), which avoidsthe problems encountered by GEE.Different from the previous research using random sam-pling, Gibbons [61] proposes distinct sampling to accuratelyestimate the number of distinct values. Distinct sampling cancollect distinct samples in a single scan of the data, and thesamples can be kept up to date in the state of data deletionsand insertions. On a truly confidential data set Call-center,distinct sampling uses only 1% of the data, and can achieve arelative error of 1% -10%, while increasing the speed of reportgeneration by 2-4 orders of magnitude. They compare distinctsampling with GEE, AE in the experiment and prove that inreal-world data sets, distinct sampling performs much betterthan GEE and AE.It is worth noting that Harmouch and Naumann [82] conductan experimental survey on cardinality estimation. In the experi-ment, they use the GEE [60] as an example of evaluation. Theyperform experiments on synthetic and real-world data sets. Itcan be seen from the experimental results that the larger thesampling fraction, the smaller the average estimation relativeerror. And when GEE wants to reach 1% relative error, it needsto collect more than 90% of the data. In conclusion, whenfaced with large data sets, cardinality estimation requires highmemory, and sampling can reduce memory consumption, butcannot guarantee reasonable accuracy all input distributions.
B. Sampling for Value Distribution
Value distribution is a very important part of single-columndata profiling. Histogram and quantile are two typical formsused to represent value distribution. The histogram is usedto describe the distribution of data, while quantile refers todividing the data into several equal parts.
1) Sampling for Histogram Construction:
Many commer-cial database systems maintain histograms to summarize thecontents of large relations and permit efficient estimation ofquery result sizes for use in query optimizers [63]. Histogramcan be used to describe the frequency distribution of attributesof interest, which groups attributes values into buckets and ap-proximates true attribute values and their frequencies based onsummary statistics maintained in each bucket [83]. However,the database is updated frequently, hence the histogram alsoneeds to be updated accordingly. Recalculating histograms isexpensive and unwise for large relations.Gibbons et al. [63] propose sampling-based approaches forincremental maintenance of approximate histograms. They usea ”backing sample” to update histograms. Backing sample isa random sample of the relation which is kept up to datein the presence of databases updates, which is generated byuniform random sampling. Therefore, random sampling canhelp to speed up histogram re-computation. For example, SQLServer recomputes histograms based on a random sample fromrelations [62].Chaudhuri et al. [62] focus on how much sample is enoughto construct a histogram. They propose a new error metriccalled the max error metric for approximate equip-depth histogram. The max error metric is formula (7) shown below,where b j is number of values in bucket j, k is the numberof buckets and n is the number of records. A k-histogram issaid to be a δ -deviant histogram when ∆ max ≤ δ . And sizeof sample r is calculated as the following formula (8), where δ ≤ nk and γ is predefined probability. ∆ max = max ≤ j ≤ k | b j − nk | (7) r ≥ n ln ( nγ ) kδ (8)As mentioned above, the histogram can be used to representthe distribution of data. In exploratory data analysis, analystswant to find a specific distribution from a large number ofcandidate histograms. The traditional approach is ”generateand test”, i.e., generating all possible histograms, and thentesting whether these histograms meet the requirements. Thisapproach is undesirable when the data set is large. Therefore,Macke et al. [84] propose a sampling-based approach toidentify the top k closest histograms called HistSim. The ideaof HistSim is using random sampling method without re-placement to collect samples for histogram constructing. Thenthey normalize the representation vector of the histogram, anduse l1 distance to calculate the similarity. Furthermore, theypropose FastMatch, which combines HistSim and block-basedsampling method, and obtain near-perfect accuracy with upto 35 speedup over approaches that do not use sampling onseveral real-world datasets in the experiment.
2) Sampling for Quantile Finding:
Quantiles can be usedto represent the distribution of single column value. Quantilesare used by query optimizers to provide selectivity estimatesfor simple predicates on table values [85]. Calculating exactquantiles on large data sets is time consuming and requiresa lot of memory. For example, quantile finding algorithm in[86] requires to store at least N/2 data elements to find themedian, which is memory unacceptable for large-scale data.Therefore, Manku et al. [64] present a novel non-uniformrandom sampling to find approximate quantile. They applynon-uniform random sampling to reduce memory require-ments. Non-uniform means that the probability of selectingeach element in the input is different. They set the earlierelements in the input sequence with larger probability thanthose arrive later. And the process of quantile finding is shownin Figure 5. When the data arrives, they randomly select anelement in each data block and put it into buffers. Then basedon sample, deterministic algorithms are performed to findquantiles.However, simply using random sampling method and calcu-lating the quantiles on the sample may not be accurate enoughon sensor networks. Hence, Huang et al. [65] propose a newsampling-based quantile computation algorithm for sensor net-works to reduce the communication cost. To improve accuracy,they augment the random sample with additional informationabout the data. They analyze how to add additional informationto the random sample under the flat model and the tree model.For example, in the flat model, each node first samples eachdata value independently with a certain probability p andcomputes its local rank. Then the samples and their local ranks Fig. 5: Sampling for quantile finding [64].are sent to base station. The base station estimates rank forany value it receives and then quantile queries can be solved.In the end, they prove through experiments that the quantilecomputation in Sensor Networks based on this new samplingmethod reduces one to two orders of magnitude in terms of thetotal communication cost compared with the previous method.V. S
AMPLING FOR M ULTIPLE C OLUMNS D ATA P ROFILING
As shown in Figure 1, the content of the multiple columnsdata profiling tasks includes association rule mining [87],clusters and outliers [88], summaries and sketches [16]. Be-sides, statistical methods such as regression analysis [89] canbe used to perform multiple columns analysis, analyzing therelationship between these columns. Specifically, in SectionV-A, we investigate sampling for discovering association rules.Section V-B presents the content of sampling for clustersand outliers. And sampling for summaries and sketches isintroduced in Section V-C. Then, in Section V-D, we introducesampling for helping perform regression analysis.
A. Sampling for Discovering Association Rules
The discovery of association rules is a typical problem indata profiling for multiple columns. The algorithm currentlyused to find association rules needs to scan the database severaltimes. For large data sets, the time overhead of scanningseveral times is hard to accept. Large amount of data leads toinput data, intermediate results and output patterns can be toolarge to fit into memory and prevents many algorithms fromexecuting [66]. Some scholars have proposed using parallelor distributed methods to solve the problem of data volume[90], [91]. But it is difficult to design parallel or distributedalgorithms.Therefore, Zaki et al. [14] use sampling to get samples oftransaction and find the association rules based on the obtainedsamples. They take sequential random sampling without re-placement as their sampling method and use Chernoff bounds to obtain sample size. Finally, they experimentally prove thatsampling can speed up the discovery of association rules bymore than an order of magnitude and provide high accuracyfor association rules.Chen et al. [4] propose a two-phased sampling-based algo-rithm to discover association rules in large databases. At thefirst stage, a large initial sample of transactions is randomlyselected from databases, which is applied to calculate supportof each individual item. And these estimated supports areused to trim the initial sample to a smaller final sample S0.At the second stage, association-rule algorithm is performedagainst the final sample S0 to get association rules according toprovided minimum support and confidence. In the experiment,the authors prove 90-95 % accuracy obtained using the finalsample S0 and the size of sample is only 15-33 % of the wholedatabases. This again proves that sampling can be used tospeed up data analysis and big data profiling.Wu et al. [66] propose an Iterative Sampling based FrequentItemset Mining method called ISbFIM. The same as [14], Wuet al. [66] use random sampling as the sampling method. Butthe difference is that they use iterative sampling to get multiplesubsets and find frequent items from these subsets. They canguarantee that the most frequent patterns for the entire data sethave been enumerated and implement a Map-Reduce versionof ISbFIM to demonstrate its scalability on big data. Becausethe volume of input data is reduced, the problem that inputdata, intermediate results, or the final frequent items cannot beloaded into memory is solved. And the traditional exhaustivesearch-based algorithms like Apriori can be fitted for big datacontext. B. Sampling for Clustering and Anomaly Detection
Clustering is to segment similar records into the same groupaccording to certain characteristics, and those records that can-not be classified into any group may be abnormal points. Thechallenge that clustering technology encounters in the era ofbig data is also the problem of data volume, and the clusteringoperation itself consumes a lot of calculations. Shirkhorshidi etal. [92] divide big data clustering into two categories: single-machine clustering and multiple-machine clustering. Singlecolumn reduces the amount of data by using data reductionmethods, e.g., sampling and dimensionality reduction. Multi-machine clustering refers to the use of parallel distributedcomputing frameworks, e.g., MapReduce and cluster resourcesto increase computing power.Kollios et al. [67] propose biased sampling to speed upclustering and anomaly detection on big data. Unlike the pre-vious work, they consider the data characteristics and analysisgoals during the sampling process. Based on the tasks ofclustering and anomaly detection, Kollios et al. [67] considerthe data density problem in the dataset. They propose a biasedsampling method to improve the accuracy of clustering andanomaly detection. The biased sampling is to make the datapoints in each cluster and the abnormal points have a higherprobability of being selected. In order to achieve this goal, theyuse the density estimation method to estimate density aroundthe data points. In the experiment, they prove that density- Fig. 6: Application of biased sampling in clustering tasks [67].based sampling has a better effect on clustering than uniformsampling.Figure 6 shows the use of biased samples in clustering.Figure 6(a) is the distribution of the original data and thereare three classes with higher density. Figure 6(b) is the resultof random sampling on the original data set. Figure 6(c) isthe result of applying the biased sampling to the originaldata. Figure 6(d) shows 10 data points selected from each ofthe three categories clustered based on the random sampling,and Figure 6(e) shows 10 data points selected from each ofthe three categories clustered based on the biased samplingmethod. After comparison with the categories in the originaldata, it is found that the clustering results of the biased samplesare more accurate.
C. Sampling for Summaries and Sketches
Summaries or sketches can be performed by samplingor hashing data values to a smaller domain [16]. Althoughdifferent scholars have applied different sampling algorithms,the most commonly used sampling algorithm among datascientists is random sampling [69]. The main reason is thatrandom sampling is the best and easiest to use, which is theonly technique commonly used by data scientists to quicklygain insights from big data sets.Rojas et al. [69] first interview 22 data scientists workingon large data sets and find that they basically use randomsampling or pseudo-random sampling. Certainly, these datascientists believe that other sampling techniques may achievebetter results than random sampling. These scientists performa data exploration task that used different sampling methods tosupport classification of more than 2 million generated samplesfrom data records of Wikipedia article edit. Research hasshown that sampling techniques other than random samplingcan generate insights into the data, which can help focuson the different characteristics of the data without affecting Fig. 7: Sparseness of one representative production data [93].the quality of data exploration and helping people understandthe data. This shows that with the application of sampling,Summaries or sketches of data can be created to help scientistobserve and understand the data.Aggregated queries are also a way to generate summaries ofdata. Aggregate queries are computationally expensive whichneed to traverse the data. In the era of big data, a singlemachine often cannot make such a large amount of data.Therefore, aggregate queries for big data are often performedon distributed systems that scales to thousands of machines.The commonly used distributed computing frameworks areHadoop, spark, etc. Although distributed systems providetremendous parallelism to improve performance, the process-ing cost of aggregated queries remains high [68]. Investigationin one cluster of [68] reveals that 90% of 2,000 data miningjobs are aggregation queries. These queries consume two-thousand machine hours on average, and some of them takeup to 10 hours.Therefore, Yan et al. [68] use sampling technique to re-duce the amount of data. When error bounds cannot becompromised and data is sparse, they think that conventionaluniform sampling often yields high sampling rates and thusdeliver limited or no performance gains. For example, uniformsampling with 20% error bound and 95% confidence needsto consume 99.91% of the data whose distribution is shownin Figure 7. Hence, they propose error-bounded stratifiedsampling, which is a variant of stratified sampling [93] andrelies on the insight, i.e., prior knowledge of data distribution,to reduce sample size. Error bound means that the real valuehas a large probability of falling within an interval. Sparse datameans that the data is generally limited but wide-ranging.Taking the data distribution in Figure 7 as an example,error-bounded stratified sampling can divide the data into twogroups. One group covers the header data and the other coversthe tail data. Because the data range of the first group is small,the sampling rate is also small. Although the data range of thesecond group is large, the data basically falls in the first group.Even if the data of the second group is all taken as a sample,the overall sampling rate is still low. It is worth mentioning thatthe technique has been implemented into Microsoft internalsearch query platform.
D. Sampling for Regression Analysis
Statistical analysis such as regression analysis can be usedto analyze the relationship between multiple columns in arelation. Sauter [94] think that statistics are learned from data.Statistics methods are often used for data profiling, which haveencountered the problem of excessive data volume in the era ofbig data. Statistical analysis of the entire big data set requiresa certain amount of calculation and time.Under the computational pressure of large data sets, manytraditional statistical methods are no longer applicable. Al-though sampling can help with data reduction, how to avoidsampling errors caused by sampling needs to be considered.For example, [70] mention that in the context of linearregression, traditional sub-sampling methods are prone tointroduce sampling errors and affect the covariance matrix ofthe estimator. Hence, they propose information-based optimalsubdata selection method called IBOSS. The goal of IBOSSis to select data points that are informative so that small-sized subdata retains most of the information contained in thecomplete data. Simulation experiments prove that IBOSS isfaster and suitable for distributed parallel computing.Jun et al. [71] propose to use sampling to divide big datainto some sub data sets in regression problem for reducing thecomputing burden. The traditional statistical analysis of bigdata is to sample from big data, and then perform statisticalanalysis on the sample to infer the population. Jun et al. [71]divide the big data closed to population into some sub datasets with small size closed to sample which is proper for bigdata analysis. They treat the entire data set as a populationand the sub set as a sample to reduce computing burden. Andthey select regression analysis to perform experiments. Thetraditional processing is shown in Figure 8, and their designis shown in Figure 9.Their design consists of three steps: the first step is tofirst generate M sub-data sets using random samples withoutreplacement; the second step is to calculate the regressionparameters of each sub-data set and calculate the averageof regression parameters of the M sub-data sets; the thirdstep is to use the averaged parameters obtained in the secondstep to estimate regression parameters on the entire data set.This design that combines sampling and parallel processinghelps them speed up regression analysis on big data. Byexperimenting with the data set from the simulation and UCImachine learning repository, the author proves that the regres-sion parameters obtained by distributed calculation on randomsamples are close to the regression parameters calculated onentire data set. This provides a reference for statistical analysison the entire large data set.VI. S
AMPLING FOR D EPENDENCIES
A dependency is a metadata that describes the relationshipbetween columns in relation, based on either value equalityor similarity [95]. There are many use cases for dependencies.For example, unique column combinations are used for findingkey attributes in relation [72], and functional dependenciescan be used for schema normalization [96] or consistent queryanswering [97], while inclusion dependencies can suggest how Fig. 8: Traditional big data regression analysis [71].Fig. 9: Sampling-based partitioning big data regression anal-ysis [71].to join two relations [16]. Inclusion dependencies togetherwith functional dependencies form the most important datadependencies used in practice [98]. But discovery of depen-dencies is time consuming and memory consuming. Manyfunctional dependencies discovery algorithms are not suitablefor large data sets. Sampling could be employed to estimatethe support and confidence measures of data dependencies[99], [100]. By sampling, you can select a small enoughrepresentative data set from the big data set. Hence, the choiceof sampling method is very important, which help to ensurethat the estimated inaccuracy rate is below a predefined boundwith high confidence. Specifically, based on the classificationfor dependency in [16], we investigate sampling for uniquecolumn combinations in Section VI-A, functional dependencyin Section VI-B, inclusion dependency in Section VI-C.
A. Sampling for Discovery of Unique Column Combinations
An important goal in data profiling is to find the right keyfor the relational table, e.g., primary key. The step beforekey discovery is to discover unique column combinations.Unique column combinations are sets of columns whosevalues uniquely identify rows, which is an important dataprofiling task [101]. But discovery of unique column combina-tions is computationally expensive, which is suitable for smalldataset or samples of large dataset. For large data set, samplingis a promising method for knowledge discovery [102]. Basedon sampling-based knowledge discovery, it is necessary to firstselect samples from the entire data set and obtain knowledge from the samples, and then use the entire data set to verifythat the acquired knowledge is correct.A typical algorithm for identifying key attributes is GOR-DIAN proposed by Sismanis et al. [72]. The main idea ofGORDIAN is to turn the problem of keys identification intocube computation problem, and then find non-keys throughcube computation. Finally, GORDIAN calculates the comple-ment of the non-keys set to obtain the desired set of keys.Therefore, the GORDIAN algorithm can be divided into threesteps: ( i ) create the prefix tree through a single pass over thedata; ( ii ) find maximal non-uniques by traversing the prefixtree with prunning; ( iii ) get minimal keys from set of maximalnon-uniques. In order to make GORDIAN scalable to largedatasets, Sismanis et al. combine GORDIAN with sampling.Experiments have shown that sampling-based GORDIAN canfind all true keys and approximate keys using only a relativelysmall number of samples.GORDIAN algorithm is further developed by Abedjan andNaumann [73] to discover unique column combinations. Sincethe existing algorithms are either too violent or have highmemory requirements and cannot be applied to big data sets.A hybrid solution HCA-Gordian, which combines Gordianalgorithm [72] and their new algorithm the Histogram-Count-based Apriori Algorithm (HCA), is proposed by Abedjan andNaumann [73] to discover unique column combinations. GOR-DIAN algorithm is used to find composite keys and the HCAis an optimized bottom-up algorithm which takes efficient can-didate generation and statistics-based pruning methods. HCA-Gordian performs Gordian algorithm on a smaller sample oftable to discover non-uniques and non-uniques will be used aspruning candidates when executing HCA on the entire table.In the experiment setup, the sample size for the prepro-cessing step of non-unique discover is always 10,000 tuplesample. Especially when the amount of data is large and thenumber of unique is small, the runtime of HCA-Gordian islower than Gordian. For example, when using real world tablesfor experiments, search speed of HCA-Gordian is four timesfaster than Gordian. And as the data set grows larger, e.g.,the National file contains 1,394,725 tuples, Gordian takes toolong to run, while HCA-Gordian only takes 115 seconds tocomplete. In addition, When the number of detected non-uniques is high, the discovery effect of HCA-Gordian is betterthan Gordian. B. Sampling for Functional Dependencies
A functional dependency refers to a set of attributes ina relationship that determines another set of attributes. Forexample, there is such a functional dependency A -¿ B , whichmeans that any two records in the relationship, when theirvalues on the attribute set A are equal, the values on theattribute set B must be equal. Bleifuss et al. [74] propose anapproximate discovery strategy AID-FD (Approximate Itera-tive Discovery of FDs) which sacrifices a certain correct ratein exchange for performance improvement. AID-FD uses anincremental, focused sampling of tuple pairs to deduce non-FDs until user-configured termination criterion is met. Theauthors have demonstrated in experiments that the AID-FD method uses only 2%-40% of the time of the exact algorithmwhen processing the same data set, but finds more than 99%of the functional dependencies.Papenbrock and Naumann [75] mention that today’s variousfunctional dependencies discovery algorithms do not have theability to process more than 50 columns and 1 million rowsof data. Thus, they propose the sampling-based FD discoveryalgorithm HYFD. And there are three properties in sampling-based FD discovery algorithms: Completeness, Minimality,Proximity, which are important for HYFD. HYFD combinescolumn-efficient FD induction techniques with row-efficientFD search techniques in two phases. In Phase 1, they applyfocused sampling techniques to select samples with a possiblylarge impact on the results precision and produce a set ofFD candidates based on samples. In Phase 2, the algorithmapplies row-efficient FD search techniques to validate the FDcandidates produced in Phase 1. The sampling method allowsfunctional dependencies discovery algorithms to be extendedto large data sets.In experiments, when the data set is not very large, the run-time of HYFD is almost all lower than other algorithms. Whenthe data set exceeds 50 columns and 10 million rows, HYFDcan get the result through a few days of calculation. However,other algorithms cannot complete the calculation, because thetime complexity for these algorithms is exponential. This againdemonstrates that sampling is important for data profiling, e.g.,FD discovery.In the above, we mention that using focused sampling to findfunctional dependencies. In this section, we will mention theuse of random sampling to find soft functional dependencies.The so-called ”soft” functional dependency is relative to the”hard” functional dependency. A ”hard” functional depen-dency means that the entire relationship satisfies the functionaldependency, while a ”soft” functional dependency means thatthe entire relationship is almost satisfied, or that there is a highprobability of satisfying the functional dependency.Ilyas et al. [76] propose sampling-based CORDS, whichmeans that automatic discovery of correlations and soft func-tional dependencies between columns, to find approximatedependencies. Among them, correlation refers to the generalstatistical dependence, while soft functional dependence refersto that value of attribute C1 determines the value of attributeC2 with high probability. CORDS use enumeration to generatepairs of columns that may be associated, and heuristically cutsout those unrelated column pairs with high probable. CORDSapply random sampling with replacement to generate sample.In the implementation of CORDS, they only use a few hundredrows of sample data, and the sample size is independent ofthe data size. In the experiment to evaluate the advantages ofapplying CORDS, where run a workload of 300 queries onthe Accidents database, the median query execution time andworst query execution time with CORDS applied were betterthan those without CORDS. Hence, CORDS is efficient andscalable when it encounters large-scale dataset.Approximate functional dependence is similar to the mean-ing of soft functional dependency. Approximate functionaldependence requires the normal functional dependency to besatisfied by most tuples of relation R [103], [104]. Of course, approximation functional dependencies contain exact func-tional dependencies that are satisfied throughout the relation-ship. As mentioned in [103], when the amount of data is large,the time for discovery of functional dependency will increaseexponentially. Therefore, Kivinen and Mannila [103] proposeto discover approximate dependencies by random sampling.In fact, sampling can be used not only to find approximatefunctional dependencies, but also to verify exact functionaldependencies [104]. If the exact functional dependency doesnot satisfy all the sample data, then the whole relationshipis definitely not satisfied, hence such functional dependenciescan be removed.Functional dependencies are satisfied for all tuples in the re-lation, while conditional functional dependencies (CFDs) is tohold on the subset of tuples that satisfies some patterns [105].And CFDs can be used for data cleaning [105], [106]. Fanet al. [107] propose three methods for conditional functionaldependencies discovery. However, when the size of data set islarge, no dependency discovery algorithms scale very well todiscover minimal conditional functional dependencies.When mining CFDs on big data, the volume issue of bigdata has to be solved. Li et al. [77] develop the samplingalgorithms to obtain a small representative training set fromlarge and low-quality datasets and discover CFDs on thesamples. They use sampling technology for two reasons. Oneis that finding CFD needs to scan the data set multiple times,and sampling helps reduce the amount of data. The secondis to use the sampling method to help them filter those dirtyitems on the low-quality data set and choose clean items as thetraining set. They define criteria for misleading tuples, whichare dirty, incomplete or very similar to popular tuples. Andthen they design a Representative and Random Sampling forCFDs (BRRSC), which is similar to reservoir sampling [43].The difference is that they combine the criteria defined aboveduring the sampling process. Furthermore, they propose fault-tolerant CFDs discovery and conflict-resolution algorithmsto find CFDs. Finally, experimental results show that theirsampling-based CFD discovery algorithms can find valid CFDrules for billions of data in a reasonable time. C. Sampling-based Test for Inclusion Dependency Candidates
The definition of inclusion dependencies (INDs) is thatthe combination of values that appear in a set of attributecolumns must also appear in another set of attribute columns[108]. Therefore, inclusion dependencies are often used todiscover foreign keys [98]. However, discovery of inclusiondependencies is computationally expensive. One of the reasonsis that the existing algorithms need to shuffle huge amountsof data to test inclusion dependencies candidates, which putspressure on both computing and memory [78].Under these circumstances, Kruse et al. [78] propose fastapproximate discovery of inclusion dependencies (FAIDA).FAIDA can guarantee to find all INDs and only false positiveswith a low probability in order to balance efficiency and cor-rectness. FAIDA uses algorithms [109], [110] of Apriori-styleto generate inclusion dependencies candidates. The invertedindex values and operates on a small sample of the input data. The sampling algorithm is applied to each table to get eachsample. Rather than use random sampling to get sample, theyassure that sample table contains min { s, d A } distinct valuesfor each column A , where s represents sample size and d A represents number of distinct values in column A .In their experiments, they set sample size to a default of500. In order to verify the efficiency of FAIDA, Kruse etal. [78] compare FAIDA’s runtime with the state-of-the-artalgorithm for exact IND discovery BINDER [110] on multipledatasets. On four datasets, FAIDA is steadily 5 to 6 timesfaster than BINDER, and they generate and test almost thesame number of IND candidates. Especially when one ofthe datasets reaches 79.4GB, BINDER takes 9 hours and 32minutes to complete, while FAIDA only takes 1 hour and 47minutes. Their evaluation shows that sampling-based FAIDAoutperforms the state-of-the-art algorithm by a factor of up tosix in terms of runtime without reporting any false positives.VII. S UMMARY AND F UTURE W ORKS
Data in various fields are increasing on a large scale. Bigdata brings us new opportunities and challenges. Throughdata analysis and data mining of big data, we can get alot of potential value. However, due to the large amountof data, it brings great challenges to the processing andstorage. Therefore, data analysis, data mining or data profilingon large data sets have to face the pressure of calculationand time. Increasing computing power by using clusters ofcomputers is one solution, but many times this is not the case,and designing distributed computing is often difficult. Hence,the application of data reduction techniques like sampling isvery important. There are some mature research articles ondata profiling and sampling, while little attention is paid tosampling and profiling over big data, therefore this articlefocuses on researching sampling and profiling in big datacontext. We first give a brief introduction of data profilingand introduce some important factors of sampling in detail.Then, according to the classification of data profiling in [16],we introduce the application of sampling in single columndata profiling, multiple columns data profiling and dependencydiscovery. In conclusion, Table III summarizes the samplingfor data profiling tasks investigated in survey, indicating thewidespread use of sampling in data profiling.The above survey on “sampling and profiling over bigdata” is mainly about relational databases, and rarely involvesgraph data or time series data. Since there is less researchon sampling-based data profiling for graph data or time seriesdata, we provide some future directions as follows.
A. Sampling for Profiling Time Series Data
Many tasks on time series data need data profiling, e.g.,matching heterogeneous events in a sequence [111], [112]with profiled patterns [113], [114], cleaning time series data[115] under speed constraints [22], or repairing timestampsaccording to the given temporal constraints [116] such assequential dependencies [117]. All these studies use dataprofiling to detect and repair erroneous temporal data. Thecomputational cost and time cost in large-scale temporal data streams can be high. Therefore, sampling for profiling timeseries data is valuable and necessary.In the time series data stream, we do not need to getexact results, e.g., when calculating the quantiles or probabilitydistributions of speeds. Approximate results are valuable intime-series data streams, for example approximate probabilitydistributions of speeds can also help us perform effectiveanomaly detection. In the sampling of time series data, sta-tistical probability distributions of speeds are different fromdiscovering quantiles. The speed of time series data dependson the adjacent time-series data points, which means thatsampling for calculating speed of time series requires a set ofdata points in a window. Therefore, how to apply the samplingtechnology to the aforesaid data profiling task of time seriesdata needs further experimental analysis and research.
B. Sampling for Profiling Graph Data
Data profiling is also heavily used in graph data, e.g., usingPetri Nets in process mining to recover missing events [118],[119] and clean event data [120], discovering keys for graphsand applying keys to study entity matching [121], or definingfunctional dependencies for graphs [25] and discovering them[122]. However, the above studies still seem to be difficultwhen encountering large graphs. Fan et al. [121] prove thatentity matching is NP-complete for graphs and recursivelydefined keys for graphs bring more challenges. In this case,one has to design two parallel scalable algorithms, in MapRe-duce and a vertex-centric asynchronous model. In order tofind Graph Functional Dependencies, Fan et al. [122] haveto deal with large-scale graphs by designing effective pruningstrategies, using parallel algorithms, and adding processors. Asmentioned earlier, designing parallel algorithms is difficult.Equivalently, profiling for graph data has to face the pressureof computing and memory when data profiling encounterslarge graphs. Therefore, it is necessary and worth researchingto sample the graph data and carry out the tasks of dataprofiling based on the sample. But sampling graph data is moredifficult than sampling relational data. Leskovec and Faloutsos[41] did practical experiments on sampling from large graphs.They concluded that best performing methods are the onesbased on random-walks and ”forest fire”, with sample sizes aslow as 15 % of the original graph. However, how to apply thesegraph sampling methods to the above-mentioned graph data-based data profiling tasks is waiting for further experimentsand exploration. C. Sampling for Profiling Heterogeneous Data
Data profiling is also widely used for heterogeneous data,e.g., discovering matching dependencies (MDs) [123], [124],reasoning about matching rules [125], [126], discovering aconcise set of matching keys [127] and conditional matchingdependencies (CMDs) [128]. However, these profiling tasksalso have to face computational pressure in a big data context.In fact, MDs, DDs and data dependencies are all basedon differential functions. When calculating the measures fordifferential dependencies, performing sampling of pairwisecomparison is more difficult. Given an instance of relation R with N data tuples, pairwise comparison M will increasethe total number to N ∗ ( N − , which will greatly increasethe number of populations. However, many pairs in M aremeaningless when calculating support for DDs [129], whichmeans that the proportion of pairs we want is very small.Therefore, we must increase the sampling rate to expect toinclude these pairs in the sample, so as to get the approximateresults as close as possible.R EFERENCES[1] Min Chen, Shiwen Mao, and Yunhao Liu. Big data: A survey.
MONET ,19(2):171–209, 2014.[2] James Manyika. Big data: The next frontier for inno-vation, competition, and productivity. , 2011.[3] J Anuradha et al. A brief introduction on big data 5vs characteristicsand hadoop technology.
Procedia computer science , 48:319–324, 2015.[4] Bin Chen, Peter J. Haas, and Peter Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In
Proceedings of the Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, July 23-26, 2002, Edmonton,Alberta, Canada , pages 462–468, 2002.[5] Albert Bifet. Mining big data in real time.
Informatica (Slovenia) ,37(1):15–20, 2013.[6] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. Datamining with big data.
IEEE Trans. Knowl. Data Eng. , 26(1):97–107,2014.[7] M. S. Mahmud, J. Z. Huang, S. Salloum, T. Z. Emara, and K. Sadat-diynov. A survey of data partitioning and sampling methods to supportbig data analysis.
Big Data Mining and Analytics , 3(2):85–101, June2020.[8] Sandra Gonz´alez-Bail´on. Social science in the era of big data.
Policy& Internet , 5(2):147–160, 2013.[9] Waleed Albattah. The role of sampling in big data analysis. In
Proceedings of the International Conference on Big Data and AdvancedWireless Technologies, BDAW 2016, Blagoevgrad, Bulgaria, November10-11, 2016 , pages 28:1–28:5, 2016.[10] Ajay S Singh and Micah B Masuku. Sampling techniques & deter-mination of sample size in applied statistics research: An overview.
International Journal of Economics, Commerce and Management ,2(11):1–22, 2014.[11] Ashwin Satyanarayana. Intelligent sampling for big data using boot-strap sampling and chebyshev inequality. In
IEEE 27th CanadianConference on Electrical and Computer Engineering, CCECE 2014,Toronto, ON, Canada, May 4-7, 2014 , pages 1–6, 2014.[12] George H. John and Pat Langley. Static versus dynamic sampling fordata mining. In
Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDD-96), Portland, Oregon,USA , pages 367–370, 1996.[13] Prashant Singh, Joachim Van Der Herten, Dirk Deschrijver, IvoCouckuyt, and Tom Dhaene. A sequential sampling strategy foradaptive classification of computationally expensive data.
Structural& Multidisciplinary Optimization , 55(4):1–14, 2017.[14] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Wei Li, and Mit-sunori Ogihara. Evaluation of sampling for data mining of associationrules. In , pages 42–50, 1997.[15] Dorian Pyle.
Data Preparation for Data Mining . Morgan Kaufmann,1999.[16] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Profilingrelational data: a survey.
VLDB J. , 24(4):557–581, 2015.[17] Shaoxu Song, Aoqian Zhang, Lei Chen, and Jianmin Wang. En-riching data imputation with extensive similarity neighbors.
PVLDB ,8(11):1286–1297, 2015.[18] Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang.Enriching data imputation under similarity rule constraints.
IEEETrans. Knowl. Data Eng. , 32(2):275–287, 2020.[19] Shaoxu Song, Han Zhu, and Jianmin Wang. Constraint-variancetolerant data repairing. In
Proceedings of the 2016 InternationalConference on Management of Data, SIGMOD Conference 2016, SanFrancisco, CA, USA, June 26 - July 01, 2016 , pages 877–892, 2016. [20] Felix Naumann. Data profiling revisited.
SIGMOD Record , 42(4):40–49, 2013.[21] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Data profiling.In , pages 1432–1435, 2016.[22] Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip S. Yu.SCREEN: stream data cleaning under speed constraints. In
Proceedingsof the 2015 ACM SIGMOD International Conference on Managementof Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015 , pages827–841, 2015.[23] Aoqian Zhang, Shaoxu Song, and Jianmin Wang. Sequential data clean-ing: A statistical approach. In
Proceedings of the 2016 InternationalConference on Management of Data, SIGMOD Conference 2016, SanFrancisco, CA, USA, June 26 - July 01, 2016 , pages 909–924, 2016.[24] Fang Wang, Menggang Li, Yiduo Mei, and Wenrui Li. Time series datamining: A case study with big data analytics approach.
IEEE Access ,8:14322–14328, 2020.[25] Wenfei Fan, Yinghui Wu, and Jingbo Xu. Functional dependenciesfor graphs. In
Proceedings of the 2016 International Conference onManagement of Data, SIGMOD Conference 2016, San Francisco, CA,USA, June 26 - July 01, 2016 , pages 1843–1857, 2016.[26] Shaoxu Song, Hong Cheng, Jeffrey Xu Yu, and Lei Chen. Repairingvertex labels under neighborhood constraints.
PVLDB , 7(11):987–998,2014.[27] Shaoxu Song, Boge Liu, Hong Cheng, Jeffrey Xu Yu, and Lei Chen.Graph repairing under neighborhood constraints.
VLDB J. , 26(5):611–635, 2017.[28] David Maier, Alon Y. Halevy, and Michael J. Franklin. Dataspaces:Co-existence with heterogeneity. In
Proceedings, Tenth InternationalConference on Principles of Knowledge Representation and Reasoning,Lake District of the United Kingdom, June 2-5, 2006 , page 3, 2006.[29] Shaoxu Song, Lei Chen, and Philip S. Yu. On data dependencies indataspaces. In
Proceedings of the 27th International Conference onData Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany ,pages 470–481, 2011.[30] Shaoxu Song, Lei Chen, and Philip S. Yu. Comparable dependenciesover heterogeneous data.
VLDB J. , 22(2):253–274, 2013.[31] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Data profiling:A tutorial. In
Proceedings of the 2017 ACM International Conferenceon Management of Data, SIGMOD Conference 2017, Chicago, IL,USA, May 14-19, 2017 , pages 1747–1751, 2017.[32] Hans T Schreuder, Timothy G Gregoire, and Johann P Weyer. For whatapplications can probability and non-probability sampling be used?
Environmental Monitoring and Assessment , 66(3):281–291, 2001.[33] Ronaldo Iachan, Lewis Berman, Tonja M Kyle, Kelly J Martin,Yangyang Deng, Davia N Moyse, Deirdre Middleton, and Audie AAtienza. Weighting nonprobability and probability sample surveys indescribing cancer catchment areas, 2019.[34] Konstantinos Slavakis, Georgios B. Giannakis, and Gonzalo Mateos.Modeling and optimization for big data analytics: (statistical) learningtools for our era of data deluge.
IEEE Signal Process. Mag. , 31(5):18–31, 2014.[35] Rajeev Agrawal, Anirudh Kadadi, Xiangfeng Dai, and Fr´ed´eric Andr`es.Challenges and opportunities with big data visualization. In
Pro-ceedings of the 7th International Conference on Management ofcomputational and collective intElligence in Digital EcoSystems,Caraguatatuba, Brazil, October 25 - 29, 2015 , pages 169–173, 2015.[36] Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V. Vasilakos.Machine learning on big data: Opportunities and challenges.
Neuro-computing , 237:350–361, 2017.[37] Cem Kadilar and Hulya Cingi. Ratio estimators in simple randomsampling.
Applied Mathematics and Computation , 151(3):893–902,2004.[38] Peter J Bickel and David A Freedman. Asymptotic normality and thebootstrap in stratified sampling.
The annals of statistics , pages 470–482, 1984.[39] HJG Gundersen and EB Jensen. The efficiency of systematic samplingin stereology and its prediction.
Journal of microscopy , 147(3):229–263, 1987.[40] Ralph H Henderson and Thalanayar Sundaresan. Cluster samplingto assess immunization coverage: a review of experience with asimplified sampling method.
Bulletin of the World Health Organization ,60(2):253, 1982.[41] Jure Leskovec and Christos Faloutsos. Sampling from large graphs.In
Proceedings of the Twelfth ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, Philadelphia, PA, USA,August 20-23, 2006 , pages 631–636, 2006. [42] Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman,Simon Fong, Zuraida Khairudin, and Nik Nik Abdullah. An applicationof oversampling, undersampling, bagging and boosting in handling im-balanced datasets. In Proceedings of the First International Conferenceon Advanced Data and Information Engineering, DaEng 2013, KualaLumpur, Malaysia, December 16-18, 2013 , pages 13–22, 2013.[43] Jeffrey Scott Vitter. Random sampling with a reservoir.
ACM Trans.Math. Softw. , 11(1):37–57, 1985.[44] Qing He, Haocheng Wang, Fuzhen Zhuang, Tianfeng Shang, andZhongzhi Shi. Parallel sampling from big data with uncertaintydistribution.
Fuzzy Sets and Systems , 258:117–133, 2015.[45] Yu-Lin He, Joshua Zhexue Huang, Hao Long, Qiang Wang, andChenghao Wei. I-sampling: A new block-based sampling method forlarge-scale dataset. In , pages360–367, 2017.[46] Zhicheng Liu, Biye Jiang, and Jeffrey Heer. imMens : Real-time visualquerying of big data.
Comput. Graph. Forum , 32(3):421–430, 2013.[47] Salman Salloum, Joshua Zhexue Huang, and Yu-Lin He. Empiricalanalysis of asymptotic ensemble learning for big data. In
Proceedingsof the 3rd IEEE/ACM International Conference on Big Data Comput-ing, Applications and Technologies, BDCAT 2016, Shanghai, China,December 6-9, 2016 , pages 8–17, 2016.[48] Qing He, Zhong-Zhi Shi, Li-An Ren, and ES Lee. A novel classificationmethod based on hypersurface.
Mathematical and computer modelling ,38(3-4):395–407, 2003.[49] Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, and Athanasios V.Vasilakos. Big data analytics: a survey.
J. Big Data , 2:21, 2015.[50] Foster J. Provost, David D. Jensen, and Tim Oates. Efficient progressivesampling. In
Proceedings of the Fifth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, San Diego,CA, USA, August 15-18, 1999 , pages 23–32, 1999.[51] Tim Harford. Big data: A big mistake?
Significance , 11(5):14–19,2014.[52] Jonathan Nagler and Joshua A Tucker. Drawing inferences and testingtheories with big data.
PS: Political Science & Politics , 48(1):84–88,2015.[53] Meir J Stampfer, Graham A Colditz, Walter C Willett, JoAnn EManson, Bernard Rosner, Frank E Speizer, and Charles H Hennekens.Postmenopausal estrogen therapy and cardiovascular disease: ten-yearfollow-up from the nurses’ health study.
New England Journal ofMedicine , 325(11):756–762, 1991.[54] Robert M Kaplan, David A Chambers, and Russell E Glasgow. Bigdata and large sample size: a cautionary note on the potential for bias.
Clinical and translational science , 7(4):342–346, 2014.[55] Jae Kwang Kim and Zhonglei Wang. Sampling techniques for big dataanalysis in finite population inference.
International Statistical Review ,2018.[56] Siu-Ming Tam and Jae-Kwang Kim. Big data ethics and selection-bias:An official statistician’s perspective.
Statistical Journal of the IAOS ,34(4):577–588, 2018.[57] Maisel and R. Inference from nonprobability samples.
Public OpinionQuarterly , 36(3):407–407.[58] Xiao-Li Meng et al. Statistical paradises and paradoxes in big data (i):Law of large populations, big data paradox, and the 2016 us presidentialelection.
The Annals of Applied Statistics , 12(2):685–726, 2018.[59] Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes.Sampling-based estimation of the number of distinct values of anattribute. In
VLDB’95, Proceedings of 21th International Conferenceon Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland ,pages 311–322, 1995.[60] Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek R.Narasayya. Towards estimation error guarantees for distinct values.In
Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, May 15-17, 2000,Dallas, Texas, USA , pages 268–279, 2000.[61] Phillip B. Gibbons. Distinct sampling for highly-accurate answers todistinct values queries and event reports. In
VLDB 2001, Proceedingsof 27th International Conference on Very Large Data Bases, September11-14, 2001, Roma, Italy , pages 541–550, 2001.[62] Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. Randomsampling for histogram construction: How much is enough? In
SIGMOD 1998, Proceedings ACM SIGMOD International Conferenceon Management of Data, June 2-4, 1998, Seattle, Washington, USA ,pages 436–447, 1998. [63] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fastincremental maintenance of approximate histograms.
ACM Trans.Database Syst. , 27(3):261–298, 2002.[64] Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay.Random sampling techniques for space efficient online computation oforder statistics of large datasets. In
SIGMOD 1999, Proceedings ACMSIGMOD International Conference on Management of Data, June 1-3,1999, Philadelphia, Pennsylvania, USA , pages 251–262, 1999.[65] Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu. Sampling basedalgorithms for quantile computation in sensor networks. In
Proceedingsof the ACM SIGMOD International Conference on Management ofData, SIGMOD 2011, Athens, Greece, June 12-16, 2011 , pages 745–756, 2011.[66] Xian Wu, Wei Fan, Jing Peng, Kun Zhang, and Yong Yu. Iterativesampling based frequent itemset mining for big data.
Int. J. MachineLearning & Cybernetics , 6(6):875–882, 2015.[67] George Kollios, Dimitrios Gunopulos, Nick Koudas, and Stefan Berch-told. Efficient biased sampling for approximate clustering and out-lier detection in large data sets.
IEEE Trans. Knowl. Data Eng. ,15(5):1170–1187, 2003.[68] Ying Yan, Liang Jeff Chen, and Zheng Zhang. Error-bounded samplingfor analytics on big sparse data.
PVLDB , 7(13):1508–1519, 2014.[69] Julian Ramos Rojas, Mary Beth Kery, Stephanie Rosenthal, andAnind K. Dey. Sampling techniques to improve big data exploration.In , pages 26–35, 2017.[70] HaiYing Wang, Min Yang, and John Stufken. Information-basedoptimal subdata selection for big data linear regression.
Journal ofthe American Statistical Association , 114(525):393–405, 2019.[71] Sunghae Jun, Seung-Joo Lee, and Jea-Bok Ryu. A divided regressionanalysis for big data.
International Journal of Software Engineeringand Its Applications , 9(5):21–32, 2015.[72] Yannis Sismanis, Paul Brown, Peter J. Haas, and Berthold Reinwald.GORDIAN: efficient and scalable discovery of composite keys. In
Proceedings of the 32nd International Conference on Very Large DataBases, Seoul, Korea, September 12-15, 2006 , pages 691–702, 2006.[73] Ziawasch Abedjan and Felix Naumann. Advancing the discoveryof unique column combinations. In
Proceedings of the 20th ACMConference on Information and Knowledge Management, CIKM 2011,Glasgow, United Kingdom, October 24-28, 2011 , pages 1565–1570,2011.[74] Tobias Bleifuß, Susanne B¨ulow, Johannes Frohnhofen, Julian Risch,Georg Wiese, Sebastian Kruse, Thorsten Papenbrock, and Felix Nau-mann. Approximate discovery of functional dependencies for largedatasets. In
Proceedings of the 25th ACM International Conference onInformation and Knowledge Management, CIKM 2016, Indianapolis,IN, USA, October 24-28, 2016 , pages 1803–1812, 2016.[75] Thorsten Papenbrock and Felix Naumann. A hybrid approach to func-tional dependency discovery. In
Proceedings of the 2016 InternationalConference on Management of Data, SIGMOD Conference 2016, SanFrancisco, CA, USA, June 26 - July 01, 2016 , pages 821–833, 2016.[76] Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, and AshrafAboulnaga. CORDS: automatic discovery of correlations and softfunctional dependencies. In
Proceedings of the ACM SIGMOD In-ternational Conference on Management of Data, Paris, France, June13-18, 2004 , pages 647–658, 2004.[77] M. Li, H. Wang, and J. Li. Mining conditional functional dependencyrules on big data.
Big Data Mining and Analytics , 3(1):68–84, March2020.[78] Sebastian Kruse, Thorsten Papenbrock, Christian Dullweber, MoritzFinke, Manuel Hegner, Martin Zabel, Christian Z¨ollner, and FelixNaumann. Fast approximate discovery of inclusion dependencies. In
Datenbanksysteme f¨ur Business, Technologie und Web (BTW 2017),17. Fachtagung des GI-Fachbereichs ,,Datenbanken und Information-ssysteme” (DBIS), 6.-10. M¨arz 2017, Stuttgart, Germany, Proceedings ,pages 207–226, 2017.[79] Ruihong Huang, Zhiwei Chen, Zhicheng Liu, Shaoxu Song, andJianmin Wang. Tsoutlier: Explaining outliers with uniform profilesover iot data. In , pages 2024–2027.IEEE, 2019.[80] Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Pa-penbrock.
Data Profiling . Synthesis Lectures on Data Management.Morgan & Claypool Publishers, 2018.[81] Shaoxu Song, Han Zhu, and Lei Chen. Probabilistic correlation-basedsimilarity measure on text records.
Inf. Sci. , 289:8–24, 2014. [82] Hazar Harmouch and Felix Naumann. Cardinality estimation: Anexperimental survey. PVLDB , 11(4):499–512, 2017.[83] Robert Kooi.
The Optimization of Queries in Relational Databases .PhD thesis, Case Western Reserve University, September 1980.[84] Stephen Macke, Yiming Zhang, Silu Huang, and Aditya G.Parameswaran. Adaptive sampling for rapidly matching histograms.
Proc. VLDB Endow. , 11(10):1262–1275, 2018.[85] Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin,Raymond A. Lorie, and Thomas G. Price. Access path selection ina relational database management system. In
Proceedings of the 1979ACM SIGMOD International Conference on Management of Data,Boston, Massachusetts, USA, May 30 - June 1 , pages 23–34, 1979.[86] Ira Pohl.
A minimum storage algorithm for computing the median .IBM TJ Watson Research Center, 1969.[87] Lili Zhang, Wenjie Wang, and Yuqing Zhang. Privacy preservingassociation rule mining: Taxonomy, techniques, and metrics.
IEEEAccess , 7:45032–45047, 2019.[88] Shaoxu Song, Chunping Li, and Xiaoquan Zhang. Turn waste intowealth: On simultaneous clustering and cleaning over dirty data.In
Proceedings of the 21th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, Sydney, NSW, Australia,August 10-13, 2015 , pages 1115–1124, 2015.[89] Aoqian Zhang, Shaoxu Song, Yu Sun, and Jianmin Wang. Learningindividual models for imputation. In , pages 160–171. IEEE, 2019.[90] Rakesh Agrawal and John C. Shafer. Parallel mining of associationrules.
IEEE Trans. Knowl. Data Eng. , 8(6):962–969, 1996.[91] David Wai-Lok Cheung, Jiawei Han, Vincent T. Y. Ng, Ada Wai-CheeFu, and Yongjian Fu. A fast distributed algorithm for mining associa-tion rules. In
Proceedings of the Fourth International Conference onParallel and Distributed Information Systems, December 18-20, 1996,Miami Beach, Florida, USA , pages 31–42, 1996.[92] Ali Seyed Shirkhorshidi, Saeed Reza Aghabozorgi, Ying Wah Teh,and Tutut Herawan. Big data clustering: A review. In
ComputationalScience and Its Applications - ICCSA 2014 - 14th InternationalConference, Guimar˜aes, Portugal, June 30 - July 3, 2014, Proceedings,Part V , pages 707–720, 2014.[93] Publishing Modelsarticle Dates Explained. Sampling: Design andanalysis.
Technometrics , 42(2):223–224, 2009.[94] Roger Sauter. Introduction to probability and statistics for engineersand scientists.
Technometrics , 47(3):378, 2005.[95] Shaoxu Song and Lei Chen. Differential dependencies: Reasoning anddiscovery.
ACM Trans. Database Syst. , 36(3):16:1–16:41, 2011.[96] Thorsten Papenbrock and Felix Naumann. Data-driven schema nor-malization. In
Proceedings of the 20th International Conference onExtending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017 , pages 342–353, 2017.[97] Xiang Lian, Lei Chen, and Shaoxu Song. Consistent query answersin inconsistent probabilistic databases. In
Proceedings of the ACMSIGMOD International Conference on Management of Data, SIGMOD2010, Indianapolis, Indiana, USA, June 6-10, 2010 , pages 303–314,2010.[98] St´ephane Lopes, Jean-Marc Petit, and Farouk Toumani. Discoveringinteresting inclusion dependencies: application to logical databasetuning.
Inf. Syst. , 27(1):1–19, 2002.[99] Shaoxu Song, Lei Chen, and Hong Cheng. Parameter-free determina-tion of distance thresholds for metric distance constraints. In
IEEE28th International Conference on Data Engineering (ICDE 2012),Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012 , pages846–857, 2012.[100] Shaoxu Song, Lei Chen, and Hong Cheng. Efficient determination ofdistance thresholds for differential dependencies.
IEEE Trans. Knowl.Data Eng. , 26(9):2179–2192, 2014.[101] Arvid Heise, Jorge-Arnulfo Quian´e-Ruiz, Ziawasch Abedjan, AnjaJentzsch, and Felix Naumann. Scalable discovery of unique columncombinations.
PVLDB , 7(4):301–312, 2013.[102] Jyrki Kivinen and Heikki Mannila. The power of sampling inknowledge discovery. In
Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May24-26, 1994, Minneapolis, Minnesota, USA , pages 77–85, 1994.[103] Jyrki Kivinen and Heikki Mannila. Approximate dependency inferencefrom relations. In
Database Theory - ICDT’92, 4th InternationalConference, Berlin, Germany, October 14-16, 1992, Proceedings , pages86–98, 1992. [104] Jixue Liu, Jiuyong Li, Chengfei Liu, and Yongfeng Chen. Discoverdependencies from data - A review.
IEEE Trans. Knowl. Data Eng. ,24(2):251–264, 2012.[105] Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and AnastasiosKementsietsidis. Conditional functional dependencies for data clean-ing. In
Proceedings of the 23rd International Conference on DataEngineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April15-20, 2007 , pages 746–755, 2007.[106] Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis.Conditional functional dependencies for capturing data inconsistencies.
ACM Trans. Database Syst. , 33(2):6:1–6:48, 2008.[107] Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. Discoveringconditional functional dependencies.
IEEE Trans. Knowl. Data Eng. ,23(5):683–698, 2011.[108] Fabien De Marchi, St´ephane Lopes, and Jean-Marc Petit. Effi-cient algorithms for mining inclusion dependencies. In
Advances inDatabase Technology - EDBT 2002, 8th International Conference onExtending Database Technology, Prague, Czech Republic, March 25-27, Proceedings , pages 464–476, 2002.[109] Fabien De Marchi, St´ephane Lopes, and Jean-Marc Petit. Unary andn-ary inclusion dependency discovery in relational databases.
J. Intell.Inf. Syst. , 32(1):53–73, 2009.[110] Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quian´e-Ruiz,and Felix Naumann. Divide & conquer-based inclusion dependencydiscovery.
PVLDB , 8(7):774–785, 2015.[111] Xiaochen Zhu, Shaoxu Song, Xiang Lian, Jianmin Wang, and Lei Zou.Matching heterogeneous event data. In
International Conference onManagement of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014 , pages 1211–1222, 2014.[112] Yu Gao, Shaoxu Song, Xiaochen Zhu, Jianmin Wang, Xiang Lian, andLei Zou. Matching heterogeneous event data.
IEEE Trans. Knowl.Data Eng. , 30(11):2157–2170, 2018.[113] Xiaochen Zhu, Shaoxu Song, Jianmin Wang, Philip S. Yu, and JiaguangSun. Matching heterogeneous events with patterns. In
IEEE 30thInternational Conference on Data Engineering, Chicago, ICDE 2014,IL, USA, March 31 - April 4, 2014 , pages 376–387, 2014.[114] Shaoxu Song, Yu Gao, Chaokun Wang, Xiaochen Zhu, Jianmin Wang,and Philip S. Yu. Matching heterogeneous events with patterns.
IEEETrans. Knowl. Data Eng. , 29(8):1695–1708, 2017.[115] Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. Timeseries data cleaning: From anomaly detection to anomaly repairing.
PVLDB , 10(10):1046–1057, 2017.[116] Shaoxu Song, Yue Cao, and Jianmin Wang. Cleaning timestamps withtemporal constraints.
PVLDB , 9(10):708–719, 2016.[117] Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and DiveshSrivastava. Sequential dependencies.
PVLDB , 2(1):574–585, 2009.[118] Jianmin Wang, Shaoxu Song, Xiaochen Zhu, and Xuemin Lin. Efficientrecovery of missing events.
PVLDB , 6(10):841–852, 2013.[119] Jianmin Wang, Shaoxu Song, Xiaochen Zhu, Xuemin Lin, and JiaguangSun. Efficient recovery of missing events.
IEEE Trans. Knowl. DataEng. , 28(11):2943–2957, 2016.[120] Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, and JianPei. Cleaning structured event logs: A graph repair approach. In , pages 30–41, 2015.[121] Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. Keys for graphs.
PVLDB , 8(12):1590–1601, 2015.[122] Wenfei Fan, Chunming Hu, Xueli Liu, and Ping Lu. Discovering graphfunctional dependencies. In
Proceedings of the 2018 InternationalConference on Management of Data, SIGMOD Conference 2018,Houston, TX, USA, June 10-15, 2018 , pages 427–439. ACM, 2018.[123] Shaoxu Song and Lei Chen. Discovering matching dependencies.In
Proceedings of the 18th ACM Conference on Information andKnowledge Management, CIKM 2009, Hong Kong, China, November2-6, 2009 , pages 1421–1424, 2009.[124] Shaoxu Song and Lei Chen. Efficient discovery of similarity constraintsfor matching dependencies.
Data Knowl. Eng. , 87:146–166, 2013.[125] Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. Reasoning aboutrecord matching rules.
PVLDB , 2(1):407–418, 2009.[126] Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma.Dynamic constraints for record matching.
VLDB J. , 20(4):495–520,2011.[127] Shaoxu Song, Lei Chen, and Hong Cheng. On concise set of relativecandidate keys.
PVLDB , 7(12):1179–1190, 2014.[128] Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng.Discovering conditional matching rules.