Proceedings of the 28th ACM International Conference on Information and Knowledge Management | 2019

Estimating the Number of Distinct Items in a Database by Sampling

Abstract

Counting the number of distinct items in a dataset is a well known computational problem with numerous applications. Sometimes, exact counting is infeasible, and one must use some approximation method. One approach to approximation is to estimate the number of distinct items from a random sample. This approach is useful, for example, when the dataset is too big, or when only a sample is available, but not the entire data. Moreover, it can considerably speed up the computation. In statistics, this problem is known as the \\em Unseen Species Problem. In this paper, we propose an estimation method for this problem, which is especially suitable for cases where the sample is much smaller than the entire set, and the number of repetitions of each item is relatively small. Our method is simple in comparison to known methods, and gives good enough estimates to make it useful in certain real life datasets that arise in data mining scenarios. We demonstrate our method on real data where the task at hand is to estimate the number of duplicate URLs.

Volume None

Proceedings of the 28th ACM International Conference on Information and Knowledge Management | 2019

Estimating the Number of Distinct Items in a Database by Sampling

Abstract

Volume None

Pages None

DOI 10.1145/3357384.3358084

Language English

Journal Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Full Text