2021 IEEE 37th International Conference on Data Engineering (ICDE) | 2021
LogLog Filter: Filtering Cold Items within a Large Range over High Speed Data Streams
Abstract
Many real-world datasets are given in the format of data streams, and processing these data streams is fundamental for many applications such as anomaly detection. In this paper, we study the problem of computing item frequencies, finding top-k hot items, and detecting heavy changes. However, the widely-used sketches cost large memory usage and their performance is easily affected by the unbalanced distribution of data streams. To solve this issue, a novel method Cold Filter (CF) is proposed to split cold items and hot items, and use a separate structure to record the frequencies of hot items. Typically, CF has a small filter range and is only effective for filtering cold items with small frequencies. For some real-world applications, however, the cold items’ frequencies may also be greater than hundreds or even tens of thousands. To solve the above challenges, we exploit the LogLog structure and develop a memory-efficient method LogLog Filter (LLF) to accurately estimate the above three metrics. LLF builds a register array where each register approximately counts the sum of item frequencies hashed into it. Our method remarkably enlarges the filter range of CF with fewer bits and only requires 4 bits to filter cold items with frequencies up to ${2^{{2^4}}}$. We conduct extensive experiments on real-world and synthetic datasets, and the experimental results demonstrate the efficiency and effectiveness of our method.