A Case for Partitioned Bloom Filters
AA Case for Partitioned Bloom Filters
Paulo Sérgio AlmeidaINESC TEC and University of Minho
Abstract
In a partitioned Bloom Filter the m bit vector is split into k dis-joint m/k sized parts, one per hash function. Contrary to hardware de-signs, where they prevail, software implementations mostly adopt stan-dard Bloom filters, considering partitioned filters slightly worse, due tothe slightly larger false positive rate (FPR). In this paper, by performingan in-depth analysis, first we show that the FPR advantage of standardBloom filters is smaller than thought; more importantly, by studying theper-element FPR, we show that standard Bloom filters have weak spots in the domain: elements which will be tested as false positives much morefrequently than expected. This is relevant in scenarios where an elementis tested against many filters, e.g., in packet forwarding. Moreover, stan-dard Bloom filters are prone to exhibit extremely weak spots if naivedouble hashing is used, something occurring in several, even mainstream,libraries. Partitioned Bloom filters exhibit a uniform distribution of theFPR over the domain and are robust to the naive use of double hashing,having no weak spots. Finally, by surveying several usages other thantesting set membership, we point out the many advantages of having dis-joint parts: they can be individually sampled, extracted, added or retired,leading to superior designs for, e.g., SIMD usage, size reduction, test of setdisjointness, or duplicate detection in streams. Partitioned Bloom filtersare better, and should replace the standard form, both in general purposelibraries and as the base for novel designs. A Bloom filter [3] is a probabilistic data structure to represent a set in a compactway. An element which has been inserted will always be reported as present; anelement not in the set may erroneously be reported as present (i.e., false positivesmay arise), but the Bloom filter may be configured such that the probability offalse positives may be as low as desired. Bloom filters are used in many settings,such as networking [7] and distributed systems [33].A standard Bloom filter is a single array of m bits over which k independenthash functions range. When inserting an element, each of the k functions isused to produce an index, and the corresponding bit is set. When querying,an element is considered present if all bits in the positions given by the k hashfunctions are set. 1 a r X i v : . [ c s . D S ] S e p variant, partitioned Bloom filters, proposed by Mullin [25], divides thearray into k disjoint parts of size m/k (assuming m multiple of k ). Each ofthe k hash functions ranges over m/k , being used to set or test a bit in thecorresponding part. The more obvious feature in partitioned Bloom filters isthe complete independence of each of the k parts and of each corresponding bitsetting/testing. This has some obvious advantages, such as parallel access toeach part, which has made partitioned Bloom filters widely adopted in hardwareimplementations, such as in [9, 31], where they are sometimes called parallelBloom signatures .A hybrid variant divides the filter in k/h parts, with h hash functions perpart, such as a hardware implementation in [12], where k/h independent multi-port memory cores, each allowing h accesses per cycle is used. For hardwaredesigns, an important consideration [31] is that using single-port SRAM, for thepartitioned scheme, requires much less area than using k -ported SRAM for thestandard scheme, or h -ported SRAM for the hybrid scheme, because the sizeof an SRAM cell increases quadratically with the number of ports. This settlesthe standard-versus-partitioned choice for hardware designs, leading them toopt for the partitioned variant.Concerning software implementations, standard Bloom filters prevail. Thegeneral feeling towards partitioned Bloom filters is that they are almost thesame as standard ones, but produce slightly worse false positive rates, speciallyin small Bloom filters. This comes from the observation [21] that partitionedBloom filters will have slightly more bits set than standard ones, and this slightlyhigher fill ratio (proportion of set bits) will result in a correspondingly higherfalse positive rate.As we will demonstrate in this paper, the issue is more subtle, and this slightadvantage comes at a substantial cost, including in the false positive rate itself.The main contributions of this paper are: • Perform an in-depth analysis of the false positive rate in Bloom filterswhere we: provide a simpler explanation, compared with current litera-ture, of why the standard formula is a strict lower bound of the true falsepositive rate; address the effect due to different hash functions collidingfor a given element; obtain for the first time an exact formula for the per-element false positive rate, i.e., the expected false positive rate, foreach specific element of the domain, over the range of filters that do notcontain it. • Point out the consequences for standard Bloom filters of the above hashcollision problem, namely the occurrence of weak spots in the domain:elements which will be tested as false positives much more frequently thanexpected. This can be a problem both for standard small capacity Bloomfilters, or for blocked Bloom filters [29], and its unexpectedly frequentoccurrence be as surprising as the
Birthday Problem [8]. • Expose pitfalls when using
Double Hashing with standard Bloom filters,of which many widespread libraries seem to be unaware off, and contrast2 x y
Figure 1: Standard Bloom filter using 4 hash functions.it with the robustness of partitioned Bloom filters in this matter. • Survey usages for Bloom filters other than testing set membership, iden-tifying many advantages that result from having disjoint parts that canbe individually sampled, extracted, added or retired. We identify how thepartitioned scheme leads to superior designs for SIMD techniques, testingset disjointness, reducing filter size, and duplicate detection in streams.
While most Bloom filters are used to represent large sets, in some scenariossmall Bloom filters are used. If a small false positive rate is also wanted, thecombination of a small m and a (relatively) large k will cause, for a standardBloom filter, a non-negligible probability that two or more of the k hash func-tions (applied to a given element) collide (produce the same index). Such acollision is illustrated in Figure 1, in yellow, where two of the 4 hash functionsapplied to y produce the same index, resulting in a total of three bits beingset for y , instead of the expected 4 bits. Such intra-element hash collisions arenot normally illustrated (or discussed) in Bloom filter presentations, which justfocus on inter-element collisions, such as the one between x and y , in red.In fact the surprisingly high probability of intra-element hash collisions isprecisely an instance of the Birthday Problem , stated in 1927 by H. Davenport ,as described in [8]. The probability that, for a given element, two or more ofthe k independent hash functions return the same value is: − P ( m, k ) m k , (1)where P ( m, k ) denotes the k -permutations of m . We now give some examples. Sets of words in small strings
Mullin [26] used Bloom filters to store sets ofwords occurring in strings (e.g., titles and authors of articles), typically up to 15words per string, with filters ranging from 32 up to 256 bits, the most commonone being 96 bits, and using 8 hash functions per filter. With m = 96 and k = 8 two or more hash function will collide in one out of four cases (25.88%),where the false positive error will be at least twice the expected from the classic But frequently misattributed to von Mises, who stated a similar but different version ofthe problem. Some archaeology about its origin can be found at [2].
11 11 0 00 0 0 00 0 0 00 0 0 00 00 0 0 00 0 0 00 00 00 00 0 0 0 00 00 00 0 0 00 00 x y
Figure 2: Partitioned Bloom filter using 4 hash functions, represented as abidimensional bit array with one row per part.formula (for filters that reached design capacity), or much higher than expected(for filters still far away from design capacity).
Packet forwarding
Whitaker and Wetherall [34] used small Bloom filters inpackets to detect possible forwarding loops in experimental routing protocols.In this case 64 bits filters were used, with “4 bits set to one”. With m = 64 and k = 4 two or more hash function will collide 9.1 percent of the time.Interestingly, and different from the more normal usage, in this case a givenelement (node) is tested against many Bloom filters (packets), and instead ofusing k hash functions for the element, a Bloom mask with exactly 4 ones atrandom positions is computed at start time, overcoming the collision problem.
Blocked Bloom filters
One problem with Bloom filters is the spreadingof memory accesses, hurting performance. This is avoided by blocked Bloomfilters [29], where the filter is divided into many blocks, each block a Bloomfilter fitting into a single cache line (e.g., 512 bits), and using an extra hashfunction to select the block. For a very high precision filter, with k = 16 and m = 512 , hash collisions will occur for 21 percent of elements, and even for amore normal setting of k = 8 , there will be collisions for 5.3 percent of elements.For an extreme performance BBF that requires a single memory access, usingword sized blocks, m = 64 , for k = 8 we have collisions 36 percent of time. So,the collision problem occurs in practice for BBFs.It should be emphasized that using blocking is the only way that Bloom fil-ters can remain performance-wise competitive with dictionary-based approaches(such as Cuckoo Filters [15] or Morton Filters [6]). Therefore, the scenario ofa small Bloom filter (a block of a BBF) is important, even for “big data” usinghuge BBFs.The above mentioned hash collision possibility is not a problem in partitionedBloom filters because each of the k functions is used to set/test bits in a differentpart. While in standard Bloom filters hash collisions will lead to bit collisions(the same bit being used for different functions), in partitioned Bloom filterssuch hash collisions will not lead to bit collisions. This is illustrated in Fig-ure 2, which shows a partitioned Bloom filter using 4 parts, represented as abidimensional bit array with one row per part. It can be seen that even if two4f the 4 hash functions applied to y produce the same value (column index),two different bits in the filter are set.So, while for partitioned Bloom filters, exactly k distinct bits in the filter areaccessed, in standard Bloom filters up to k distinct bits are accessed (most times k bits, but sometimes less than k bits). As we will see, this makes the standardfalse positive formula incorrect, producing a value lower than the actual one,and complicating the exact false positive calculation (something that has beenaddressed before) but it also produces a non-uniform distribution of the falsepositive rate, with the occurrence of weak spots in the domain, something thatwe address here for the first time.Interestingly, in the original proposal by Bloom exactly k bits are set/tested.From [3]: “each message in the set to be stored is hash coded into a number ofdistinct bit addresses” and “where d is the number of distinct bits set to 1 foreach message in the given set”. The original formula for false positive rate isconsistent with this behavior. This fact seems to have been mostly ignored inthe literature, one notable exception being [23] “In [Bl70], the assumption wasthat the k locations are chosen without repetitions; it is also possible to allowrepetitions, which makes the program simpler” and more recently [17], whichcompares the original proposal with standard Bloom filters.The original Bloom proposal is not practical, as it demands some extra effortto ensure exactly k distinct addresses, e.g., iterating over an unbounded familyof hash functions until k different values have been produced (with the need tocompare each new value to all the previous ones); or a way to directly producea pseudo-random k -permutation of m , keyed by the element. And even if littlecost seems to be required [30], practitioners typically would not be aware of theproblem or solution, and would not bother to address such minutiae. So, it isnot surprising that what became adopted as standard Bloom filters differs fromthe original proposal.Partitioned Bloom filters, which differ both from the original and the stan-dard ones, not only are immune to the birthday problem (being in a sense morein the spirit of the original proposal) but are also practical to implement. We now do a theoretical analysis of the false positive rate, revisiting the Bloom’sanalysis, the standard analysis, existing improvements to the standard analysisproducing a correct formula, the formula for partitioned Bloom filters, and com-pare standard with partitioned Bloom filters. In the next section we present anovel per-element false positive analysis, showing how the expected false positivebehaves for different elements in the domain.5 .1 Original Bloom’s analysis
Bloom’s analysis [3] states that the probability of a bit still being zero after n elements are added is (cid:18) − km (cid:19) n , (2)which, contrary to what sometimes is said, is correct, but for the original Bloomproposal where exactly k distinct bits are set, and that the false positive rateis: (cid:18) − (cid:18) − km (cid:19) n (cid:19) k . (3)The analysis is almost correct, but it suffers from the same problem as thestandard analysis below. But it is irrelevant for standard Bloom filters used inpractice, as they differ from the original Bloom proposal. The standard analysis, by Mullin [25], and widely used, states that the proba-bility of a bit still being zero after n elements are added is (cid:18) − m (cid:19) kn , (4)which is correct, and that the false positive rate is F a ( n, m, k ) = (cid:32) − (cid:18) − m (cid:19) kn (cid:33) k . (5)which is only approximate, as we discuss below. There is one problem with the standard analysis, which has already been de-tected and corrected before. The standard analysis derives the false positiverate only as function of the mean fill ratio p , as p k . Even though this gives avery good approximation for large Bloom filters, given the high concentrationof the fill ratio around its mean [24], it is not an exact formula.Exact formulas for standard Bloom filters were developed [5, 11], by derivingthe probability distribution of the fill ratio and weighing the false positive rateincurred by each concrete fill ratio with the probability of it occurring. A similarresult had already been derived in [23], for a Bloom filter variant divided in pages(essentially, a blocked Bloom filter with typically large blocks), and a formulafor the original Bloom filters was derived more recently in [17].6 simpler strict lower bound argument The standard formula, in Equa-tion 5, has also been proven to be a strict lower bound for the true false positiverate in [5] using considerations of conditional probability, and to be a lowerbound in [11] by resorting to Hölder’s inequality [18]. We now present a simplerand more elegant reasoning of why it is a strict lower bound. It results froma direct application of Jensen’s inequality [20]: for a convex function, such as f ( x ) = x k when k > and x > , and for a non-constant random variable R ,such as the fill ratio, f ( E [ R ]) < E [ f ( R )] . (6)This means that, for k > , raising the expected fill ratio to the power of k , asdone in the standard formula, produces a value always smaller than the expectedvalue of the fill ratio raised to the power of k , which is what gives the exactaverage false positive rate.As presented by the above mentioned works, computing the fill ratio distri-bution is an instance of the well known balls into bins experiment. It can becomputed by resorting to the number of surjective functions from an n -set to an i -set, e ni [16], that can be directly derived using the inclusion-exclusion principle(in the complementary form) as: e ni = i (cid:88) j =0 ( − j (cid:18) ij (cid:19) ( i − j ) n . (7)The probability B ( n, m, i ) of having exactly i non-empty bins, after throwing n balls randomly into m bins is then: B ( n, m, i ) = (cid:0) mi (cid:1) e ni m n . (8)The probability of having exactly i bits set after inserting n elements intoan m sized standard Bloom filter using k hash functions is then: S ( n, m, k, i ) = B ( nk, m, i ) (9)The false positive rate for a standard Bloom filter is then: F s ( n, m, k ) = m (cid:88) i =1 S ( n, m, k, i ) (cid:18) im (cid:19) k , (10) As the k parts are independently set/tested, the expected false positive rate isthe product of the individual expected rates, and so computed as the one foreach part to the power of k . For each part, the standard formula, with k = 1 ,gives the exact part false positive rate, as the inequality in Equation 6 becomesan equality when k = 1 . So, for a partitioned Bloom filter of size m , made up7able 1: Comparison between partitioned and standard Bloom filters false posi-tive rates, for different combinations of m and k , for filters at nominal occupation( n = mk ln 2 ), showing both the approximate ( F a ) and the exact ( F s ) values forstandard filters, the value for partitioned filters ( F p ) and the ratio F p /F s . m k F a F s F p F p /F s
64 4 0.06244514 0.06423247 0.06676410 1.039413608 0.00227672 0.00260362 0.00316870 1.21703762512 4 0.06126247 0.06148344 0.06176528 1.004584118 0.00375309 0.00381650 0.00389940 1.0217209716 0.00001409 0.00001513 0.00001661 1.097834754096 4 0.06233016 0.06235819 0.06239353 1.000566768 0.00385474 0.00386284 0.00387308 1.0026509416 0.00001486 0.00001499 0.00001516 1.01143019of k parts, each m/k bits, the exact false positive rate when n elements wereinserted is given by: F p ( n, m, k ) = (cid:18) − (cid:18) − km (cid:19) n (cid:19) k . (11)which is much simpler than the exact formula for standard Bloom filters (aswell as the one for original Bloom filters, described in [17]). Interestingly, itcoincides with Bloom’s formula for his original proposal, while being exact.This formula simplicity results from the conceptual simplicity: a partitionedBloom filter can be seen as an AND of k independent single-hash filters, all usedfor insertions. It also translates to a simplicity of presentation, which is better,pedagogically, than standard Bloom filters, as it allows deriving a more complex(composite) concept in terms of a simpler one (each part). Common folklore is that partitioned Bloom filters are not worth over standardones, e.g., in [21] “partitioned filters tend to have more 1’s than nonpartitionedfilters, resulting in larger false positive probabilities”. But hash collisions, eventhough decreasing the fill ratio, increase the false positives for elements sufferingthe collision, and so the question is more subtle. Using the exact formulas foreach case, Table 1 shows how partitioned and standard Bloom filters compare,namely the ratio of false positives F p /F s , for some combinations of m and k forfilters at full capacity with n = mk ln 2 .It can be seen that although partitioned filters have indeed slightly morefalse positives, the difference is less than what the standard formula ( F a ) would8able 2: Ratio between partitioned and standard Bloom filters false positiverates, F p /F s , for different combinations of m , k , and occupation (fraction of thenominal capacity n = mk ln 2 ). occupation m k k = 8 , with 22% higher false positive rate, buteven blocked Bloom filters normally aim for blocks of cache line size ( m = 512 ).Table 2 shows the ratio of false positives F p /F s for filters at different occu-pations (namely / , / , and / ) relative to the nominal capacity. The ratioincreases somewhat for word sized filters and small occupations, but those occu-pations for those filters are degenerate cases, with just a few elements inserted,and negligible false positive rates, whether for standard or partitioned filters.So, the average false positive rate is not relevant for making a choice betweenstandard versus partitioned Bloom filters. But as we discuss next, a more rele-vant issue is the distribution of false positives over the elements in the domainsubject to being tested. There are two ways that Bloom filters can be used, and two different points ofview regarding false positives:1. Filter point of view: having a filter, in which elements were inserted alongtime, test new elements using the filter.2. Element point of view: for a specific element, test it against many differentfilters that show up, to see if the element is present in them.The first usage is the more normal, for which we want to know the globalaverage false positive rate. The second usage corresponds to the packet forward-ing scenario, where at each node (representing an element) many different filters9rrive (each one representing a path that a packet took to reach the node). Forthis second usage we want to know, for each specific element in the domain, theaverage false positive rate over all possible filters (considering some fixed k , m ,and n ) that do not include the element. Particularly relevant is the question ofwhether this per-element rate is the same for all elements (the global average)or whether it is non-uniform, varying for different elements.For partitioned Bloom filters, with k independent parts, accessed by k inde-pendent hash functions, the per-element false positive rate is the same for allelements, and equal to the global average. But for standard Bloom filters, thepossibility of hash collisions makes some elements have less than k independentbits to test. We have thus a non-uniform distribution of false positives: for agiven element having d < k different bit positions to test, the average false posi-tive rate will be higher than for those elements for which no collisions occurred.Elements suffering collisions are then weak spots in the domain: they will beconsidered more often than expected as belonging to filters against which theyare tested. As we will see, for elements suffering several hash function collisions,the false positive rate can be more than one order of magnitude larger thanexpected. We now derive an exact formula for the per-element false positiverate. Consider a specific element e of the domain, having d different bit positionsresulting from the k independent hash functions, where d ≤ k . We want to knowthe average false positive rate F s ( n, m, k, d ) when e is tested against standardBloom filters of size m where a set of n elements not containing e was inserted.A first observation is that the per-element rate cannot be obtained by simplygoing to the exact formula in Equation 10, where the fill ratio is raised to thepower of k , and replace ( i/m ) k with ( i/m ) d , i.e., F s ( n, m, k, d ) (cid:54) = m (cid:88) i =1 S ( n, m, k, i ) (cid:18) im (cid:19) d . (12)The reason is that by saying that there are d different positions, they arenot independent, and we cannot use the independent testing assumption as forthe k positions. This can be seen by a simple example of a filter with k = 2 , m = 2 , n = 1 , and computing the false positive for elements with d = 2 differentbits. When considering the case i = 1 , i.e., one bit set in the filter, being the fillratio / , for d = 2 there is no possibility of a false positive, while using ( i/m ) d would give the erroneous (1 / = 1 / .The correct formula for the probability of d different bits being set when i of the m bits in the filter are set is: d − (cid:89) j =0 i − jm − j , (13)10able 3: Ratio between per-element and global false positive rate for standardBloom filters, F s ( n, m, k, d ) /F s ( n, m, k ) , for different combinations of m , k , andhash collisions c = k − d , for filters at different occupations.collisionsoccupation m k d positions is one of the i bits set, the second is one of theremaining i − , the third one of the remaining i − and so on. The probabilityis zero for d > i .The correct formula for the per-element false positive rate is then obtainedby averaging over the different possible numbers of bits set, weighted by theirprobability of occurring, as before, resulting in: F s ( n, m, k, d ) = m (cid:88) i =1 S ( n, m, k, i ) d − (cid:89) j =0 i − jm − j , (14)Table 3 shows how the per-element false positive rate compares with the(global) average false positive rate, showing the ratio F s ( n, m, k, d ) /F s ( n, m, k ) for different numbers of hash collision c = k − d , from no collision ( d = k ) upto three collisions ( d = k − ), for filters at different occupations (ratios relativeto nominal capacity n = mk ln 2 ).It can be seen that the false positive rate increases noticeably with thenumber of hash collision that occur for the element being tested, in relationto the global average rate for the filter. This effect is more prevalent for smalloccupations, with false positive rates reaching two orders of magnitude largerthan the global average for / occupation and three collisions. This may causesurprises in scenarios where a filter is dimensioned with some expectations about11able 4: Probability of having some hash collision(s) and of having exactly ≤ c ≤ hash collisions, for some combinations of k and m .collisions m k some 0 1 2 364 4 0.0911 0.9089 0.0894 0.0017 0.00008 0.3660 0.6340 0.3115 0.0510 0.0034512 8 0.0535 0.9465 0.0525 0.0010 0.000016 0.2108 0.7892 0.1905 0.0192 0.0011the false positives rate over its lifetime, from empty to full. Some elementswill incur much more false positives than what planned for, if using either thestandard or exact formula for the global average. The question of how frequent are those weak spots in the domain, speciallythe “very weak” spots having more than one hash collision is easily answered.The probability of an element being a weak spot is an instance of the birthdayproblem, as discussed above, with value given by Equation 1. For an m sizedBloom filter, the probability of the k hashes resulting in d different bits (i.e., c = k − d collisions) is an instance of the balls into bins experiment, with value B ( k, m, d ) as given by Equation 8.Table 4 shows the probability of having some (one or more) hash collisions,and of having exactly ≤ c ≤ collisions, for some combinations of k and m .It can be seen that collisions happen frequently not only in word sized filters(36% of elements for m = 64 and k = 8 ) but also for the important case of cacheline sized ( m = 512 ) blocks in blocked Bloom filters, reaching 21% for veryhigh accuracy ( K = 16 ) filters. Two collisions can happen with non-negligiblefrequency, in 5 percent of elements for the word sized filters with k = 8 , orin two percent of elements in the ( m = 512 , k = 16 ) case. And while threecollisions is indeed very rare, 3 in a thousand for the ( m = 64 , k = 8 ) filter orone in a thousand for the ( m = 512 , k = 16 ) filter, this is no consolation whenthose “unlucky” elements are subject to being tested against many filters. One technique used to improve performance, by avoiding the need to compute k hash functions, is to resort to double hashing , which amounts to using twohash functions { h , h } , to simulate k hash functions. In the more naive form12igure 3: Effects of double hashing when inserting an element x in a standard(left) versus partitioned (right) Bloom filter, when b = h ( x ) is 0, / , or / the size of the vector being indexed (filter or part).it amounts to computing g , . . . , g k − as: g i ( x ) = h ( x ) + ih ( x ) mod m The first time that double hashing was applied to Bloom filters seems to havebeen by Dillinger and Manolios [14], for model checking. It was popularized afterMitzenmacher [21] showed that it could be used to implement a Bloom filterwithout any loss in the asymptotic false positive probability, and experimentallyvalidating it for medium sized Bloom filters, starting with m = 10000 bits.However, small Bloom filters were not considered (e.g., a 512 bits block in aBBF) and, as usual, only the global false positive rate was considered.Here we address small filters and the possibility of a non-uniform distributionof false positives, with weak spots in the domain. We show that indeed, standardBloom filters, but not partitioned ones, are prone to even more problematicweak spots caused by the use of double hashing. Although more sophisticatedvariants, like enhanced double hashing or triple hashing have been proposed,naive doubling hashing in particular has become relatively popular, and canbe found in many Bloom filter implementations. Therefore, these issues havepractical consequences.Dillinger’s PhD dissertation [13], which includes a detailed study of differentforms of double and triple hashing, already recognized the existence of pitfalls,specially in naive double hashing. It identified three issues, which we now showthat only affect standard, but not partitioned, Bloom filters. Issue 1:
Some possibilities for b = h ( x ) can result in many repetitions of thesame index. The worse case would be if b = 0 (mod m ), in which case all indiceswould be the same, but the existence of common factors between b and m alsocauses problems. Figure 3 shows some examples, with b = 0 , b = m/ and b = m/ . On the left, for standard Bloom filters, there is overwhelming indexcollision, which causes bit collisions, resulting in very weak spots. In a BBF13igure 4: Full overlap between x and y when using double hashing in a standardBloom filter, when h ( y ) = h ( x ) + ( k − h ( x ) mod m and h ( y ) = m − h ( x )mod m (left), and the lack of such overlap in a partitioned Bloom filter (right).Figure 5: Partial overlap (yellow) between x (green) and y (blue) when usingdouble hashing in a standard Bloom filter, when h ( x ) = h ( y ) mod m (left),and the lack of such overlap in a partitioned Bloom filter (right).with 512 bit blocks, one out of 512 elements in the domain will have a singlebit set/tested, resulting in a disastrous / probability of them being testedas a false positive in filters at nominal capacity ( / fill ratio). Then, one out512 elements / probability, and so on. For partitioned Bloom filters, indexcollisions do not cause bit collisions, resulting always in k bits being set/tested. Issue 2:
The indices generated by double hashing, used to index a standardBloom filter are treated as a set, not a sequence, and we can compute thesame set going “forward” or going “backward”. Two elements x and y , canhave a full overlap of the k bits without both h and h colliding, if h ( y ) = h ( x ) + ( k − h ( x ) mod m and h ( y ) = m − h ( x ) mod m . For a partitionedBloom filter, such overlap does not occur, as the different parts are indexed inorder, and so we have effectively a sequence of indices. Figure 4 illustrates thefull overlap between x and y , for a standard Bloom filter and the absence ofoverlap in a partitioned Bloom filter. Issue 3:
Using double hashing in a standard Bloom filter is prone to partialoverlapping of the k indices, namely when h ( x ) = h ( y ) mod m . This isillustrated in Figure 5. In the same figure, it can be seen that in partitionedBloom filters such overlap does not occur.14tandard Bloom filters are thus subject to these anomalies, the more seriousbeing the possibility of extreme weak spots, if naive double hashing is used. Intheory, Issue 1 (which causes weak spots) is easy to overcome, by ensuring thereare no collisions, e.g., in the popular case when m is a power of two by restricting b = h ( x ) to produce odd numbers. In practice, implementers have been soldthe idea that double hashing can be used harmlessly, and commonly do nottake precautions, namely when the filter is parameterized, being m arbitraryand possibly small. This occurs even in mainstream libraries, such as in GoogleCore Libraries for Java [1]. Partitioned Bloom filters have the advantage of notbeing subject to such weak spots, and thus are robust to naive double hashingimplementations.It should be noted that if Issue 1 is addressed, the impact of double hashingon the global false positive rate is larger for partitioned Bloom filters than forthe standard ones. This impact comes from the probability of the pair of indicesfor one element colliding with the pair from another element, i.e., h ( x ) = h ( y ) and h ( x ) = h ( y ) (modulo vector size). Between two elements it is /m forstandard Bloom filters and / ( m/k ) for partitioned.In practice, for large Bloom filters the contribution of double hashing for theglobal false positive rate is negligible, unless high accuracy filters are wanted,in which case care must be taken and triple hashing may be needed. For smallfilters, or in general when BBFs are used, neither double nor triple hashingshould be used, as only a few bits per index are needed, and a single hash wordcan be split to obtain the k indices. Concretely, in a BBF with 512 bit blocksand k = 8 , we need 9 bits per index for standard and 6 bits per index forpartitioned filters. This means that a partitioned scheme needs ∗ bitsper block and a single 64 bit hash word is enough for filters up to − = 65536 blocks, i.e., = 33554432 bits, while if standard filters are used ∗ bitsper block are needed and a 64 bit hash word is not enough even for small filters.This reinforces the superiority of partitioned Bloom filters over standard ones. Regardless of the false positive rate itself, the disjointness of the parts in apartitioned Bloom filter provides several advantages over standard filters, eitherin terms of obtaining fast implementations or making the partitioned schememore flexible to be used in more scenarios, or as the base for further extensions.Each disjoint part can be sampled, extracted, added, or retired individually,leading to interesting outcomes. We conclude our case by surveying some ofthese usages and advantages.
In addition to improving memory accesses, through blocked Bloom filters, an-other way to improve performance is to use Single Instruction Multiple Data(SIMD) processor extensions, to test multiple bits in a single processor cycle.15owever, standard Bloom filters are not directly suitable to SIMD, because the k bits are spread over memory, needing an extra gather step to collect and placethem appropriately, causing some slowdown.A sophisticated SIMD approach [28] for standard Bloom filters uses pre-cisely gather instructions to collect bits spread over memory. It achieves higherthroughput, by testing different hashes of different elements at each step, butnot lower latency of individual query operations.Even using BBFs based on standard Bloom filter blocks is not directly suit-able to SIMD, because the k bits are not placed over independent disjoint partsof the cache line (e.g., words) to be used together as a vector register. Whenintroducing BBFs the authors already discussed SIMD usage, and to overcomethis problem they propose using a table of k bit block-sized patterns. However,to avoid collisions between elements when indexing, the table cannot be toosmall, competing for cache usage.Partitioned Bloom filters are more directly suitable to SIMD. A blockedBloom filter using the partitioned scheme, with cache-line sized blocks and wordsized parts is perfect for SIMD, and arises as the natural combination of blockingand partitioning. This is precisely what Ultra-Fast Bloom Filters [22] haverecently proposed. We may conjecture that, had partitioned Bloom filters beenthe norm at the time when BBFs were introduced, this combination would haveappeared one decade earlier.
Bloom filters can also be used for set union and intersection. Unlike for union(bitwise or) which is exact, intersection of filters (bitwise and) over-representsthe filter for the intersection: given sets S and S , we have F ( S ) ∧ F ( S ) ≥ F ( S ∩ S ) . In addition to testing for the presence of some element, an importantuse case is testing for set disjointness, i.e., that the intersection is an empty set.An example is checking whether two set of addresses, representing a read-setand a write-set are disjoint, when implementing transactional memory .Using standard Bloom filters, being sure that the sets are disjoint is onlypossible when the resulting filter intersection is completely empty (all zeroes).Having less than k one bits is not enough, due to weak spots. As alreadynoticed [19], even if the intersection result had a single bit it could be (even ifextremely unlikely) due to an element, present in both sets, having the k hashfunctions collide.Partitioned Bloom filters are much better suited for testing set disjoint-ness, as it is enough that one of the k parts of the filter intersection is emptyto conclude that the set intersection is empty. This was already exploited [9]for Speculative Multithreading. A comparison of set disjoitness testing con-cluded [19] that the probability of false set-overlap reporting was substantiallysmaller for partitioned Bloom filters than standard Bloom filters. This probabil-ity, for standard ( P s ) and partitioned ( P p ) m sized filters with k hash functions,16epresenting sets with n and n elements, compares as: P s = 1 − (cid:18) − m (cid:19) k n n > − (cid:18) − km (cid:19) n n > (cid:18) − (cid:18) − km (cid:19) n n (cid:19) k = P p . This is intuitively easy to understand: the probability of a false set-overlapfor a standard m sized filter, due to some of the k ∗ n ∗ k ∗ n pairs of indicescolliding, is greater than the probability of such an overlap in a given m/k sized part for the partitioned scheme, which is substantially greater than theprobability that there is an overlap in each of the k parts. Sometimes it is useful to obtain a smaller sized, lower accuracy, version of aBloom filter. Either because the filter was overdimensioned and we do not needthe resulting overly high accuracy; or we want to obtain an explicitly loweraccuracy view (but enough for some purpose), e.g., to ship over the network,wanting to save bandwidth.A standard Bloom filter is not suitable for this purpose because of the min-gling of bits from different hash functions. What can be done is to use the same k hashes, but remap the indices to a smaller m (cid:48) sized vector (preferably with m some multiple of m (cid:48) ), moving the bit in position i to i modulo m (cid:48) , and usingmodulo m (cid:48) indexing for the new filter. The problem is that the resulting fillrate renders the filter, when not immediately useless, having an overly high falsepositive rate, when comparing with the optimal for the new smaller size and thesame number of elements [27].Partitioned Bloom filters are much better for this purpose. Due to thedisjointness of the k parts, we can simply consider the first k (cid:48) parts as a smallerBloom filter, e.g., to be shipped elsewhere. For the worst case of a filter alreadyat full capacity, the new one will provide the optimal false positive rate for thenew smaller size. Considerable size reductions are viable, which would rendera standard Bloom filter useless due to the fill rate approaching 1. The samepaper proposes Block-partitioned Bloom filters , composed of several blocks (eacha standard filter, with insertions in each, and using AND for queries), to be ableto extract some blocks as a new filter. It mentions that maximum size flexibilityis achieved by using one hash per block, i.e., by using a partitioned Bloom filter.
Bloom filter based approaches to achieve queries over a sliding window of aninfinite stream tend to be space inefficient. Traditionally they have been basedeither on some variation of Counting Bloom filters [4], on storing the insertiontimestamp [35], or using several disjoint segments which can be individuallyadded and retired, one example being Double Buffering [10]. This uses a pairof active and warm-up
Bloom filters, using the active for queries and insertingin both until the warm-up is half-full, at which point it becomes the active, theprevious active is discarded and a new empty warm-up is added.17hile with standard Bloom filters a segment must be a whole filter, par-titioned Bloom filters can be used as a base for better designs, in which eachdisjoint part can be treated as a segment.
Age-Partitioned Bloom Filters [32]use k + l (for some configurable l ) parts in a circular buffer, using the k more“recent” parts for insertions, discarding (zeroing) the “oldest” part after each gen-eration (batch of insertions), and testing for the presence of k adjacent matchesfor queries. This results in the currently best Bloom filter based design forquerying a sliding window over a stream. Frequently, a focus on one small difference in a quantitative aspect misses thewhole picture. Partitioned Bloom filters have thus been considered worse thanstandard, and frequently not adopted, due to having slightly more false positives.This is ironic given that the difference amounts to a negligible variation ofcapacity, for the same false positive rate.In this paper we have shown how much simpler, elegant, robust and versa-tile partitioned Bloom filters are. The simplicity of the exact formula resultsfrom the conceptual simplicity of them being essentially the AND of single-hashfilters. Standard Bloom filters have a more complex nature due to the possi-bility of intra-element hash collisions, with a resulting complex exact formula,normally approximated, leading sometimes to surprises.But essentially, we have shown how standard Bloom filters exhibit a non-uniform distribution of the false positive probability, with weak spots in the do-main: elements that are reported much more frequently as false positives thanexpected. This is an aspect than has been neglected from the literature. More-over, the issue of weak spots is much aggravated when naive double hashing isused. Even though easily circumventable, many libraries, including mainstreamones, suffer from this anomaly. The lesson seems to be that practitioners fre-quently skim over results, failing to notice subtle problems. Partitioned Bloomfilters have a uniform distribution of false positives over the domain, with noweak spots, even if naive double hashing is used. Moreover, the need for lesshash bits makes such schemes less warranted.Finally, going beyond set-membership test, by surveying other usages, theflexibility of being able to sample, extract, add or retire individual parts be-comes clear, showing the partitioned scheme to be better. Like the hardwarecommunity already did, partitioned Bloom filters should be widely adopted bysoftware implementers, replacing standard Bloom filters as the new normal.
References [1] Dimitris Andreou and Kurt Alfred Kluever. BloomFilterStrategies ingoogle core libraries for java. https://github.com/google/guava/blob/18aster/guava/src/com/google/common/hash/BloomFilterStrategies.java,2011 (accessed September 8, 2020).[2] Pat’s Blog. Who created the birthday problem, and even one moreversion. https://pballew.blogspot.com/2011/01/who-created-birthday-problem-and-even.html, 2011 (accessed May 26, 2020).[3] Burton H. Bloom. Space/time trade-offs in hash coding with allowableerrors.
Communications of the ACM , 13(7):422–426, 1970.[4] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, andGeorge Varghese. An improved construction for counting bloom filters.In
Algorithms - ESA 2006, 14th Annual European Symposium, Zurich,Switzerland, September 11-13, 2006, Proceedings , pages 684–695, 2006.[5] Prosenjit Bose, Hua Guo, Evangelos Kranakis, Anil Maheshwari, PatMorin, Jason Morrison, Michiel Smid, and Yihui Tang. On the false-positive rate of bloom filters.
Information Processing Letters , 108(4):210 –213, 2008.[6] Alexander Breslow and Nuwan Jayasena. Morton filters: Faster, space-efficient cuckoo filters via biasing, compression, and decoupled logical spar-sity.
PVLDB , 11(9):1041–1055, 2018.[7] Andrei Broder and Michael Mitzenmacher. Network applications of bloomfilters: A survey.
Internet mathematics , 1(4):485–509, 2004.[8] W. W. Rouse Ball. Revised by H. S. M. Coxeter.
Mathematical Recreationsand Essays . Macmillan, 11th edition, 1939.[9] Luis Ceze, James Tuck, Josep Torrellas, and Calin Cascaval. Bulk dis-ambiguation of speculative threads in multiprocessors.
ACM SIGARCHComputer Architecture News , 34(2):227–238, 2006.[10] Francis Chang, Kang Li, and Wu-chang Feng. Approximate caches forpacket classification. In
Proceedings IEEE INFOCOM 2004, The 23rd An-nual Joint Conference of the IEEE Computer and Communications Soci-eties, Hong Kong, China, March 7-11, 2004 , pages 2196–2207, 2004.[11] Ken Christensen, Allen Roginsky, and Miguel Jimeno. A new analysis ofthe false positive rate of a bloom filter.
Information Processing Letters ,110(21):944–949, 2010.[12] Sarang Dharmapurikar, Praveen Krishnamurthy, Todd Sproull, and JohnLockwood. Deep packet inspection using parallel bloom filters. In
Highperformance interconnects, 2003. proceedings. 11th symposium on , pages44–51. IEEE, 2003.[13] Peter C. Dillinger.
Adaptive approximate state storage . PhD thesis, North-eastern University, 2010. 1914] Peter C. Dillinger and Panagiotis Manolios. Fast and accurate bitstateverification for SPIN. In Susanne Graf and Laurent Mounier, editors,
ModelChecking Software, 11th International SPIN Workshop, Barcelona, Spain,April 1-3, 2004, Proceedings , volume 2989 of
Lecture Notes in ComputerScience , pages 57–75. Springer, 2004.[15] Bin Fan, David G. Andersen, Michael Kaminsky, and Michael Mitzen-macher. Cuckoo filter: Practically better than bloom. In
Proceedings ofthe 10th ACM International on Conference on emerging Networking Ex-periments and Technologies, CoNEXT 2014, Sydney, Australia, December2-5, 2014 , pages 75–88, 2014.[16] F Gerrish. 63.29. surjections from an m-set to an n-set.
The MathematicalGazette , 63(426):259–261, 1979.[17] Fabio Grandi. On the analysis of bloom filters.
Inf. Process. Lett. , 129:35–39, 2018.[18] O. Hölder. Ueber einen mittelwertsatz.
Nachrichten von der Königl.Gesellschaft der Wissenschaften und der Georg-Augusts-Universität zuGöttingen , (2):38–47, 1889.[19] Mark C. Jeffrey and J. Gregory Steffan. Understanding bloom filter inter-section for lazy address-set disambiguation. In Rajmohan Rajaraman andFriedhelm Meyer on the Heath, editors,
SPAA 2011: Proceedings of the23rd Annual ACM Symposium on Parallelism in Algorithms and Architec-tures, San Jose, CA, USA, June 4-6, 2011 (Co-located with FCRC 2011) ,pages 345–354. ACM, 2011.[20] J. L. W. V. Jensen. Sur les fonctions convexes et les inégalités entre lesvaleurs moyennes.
Acta mathematica , 30:175–193, 1906.[21] Adam Kirsch and Michael Mitzenmacher. Less hashing, same performance:Building a better bloom filter.
Random Struct. Algorithms , 33(2):187–218,2008.[22] Jianyuan Lu, Ying Wan, Yang Li, Chuwen Zhang, Huichen Dai, Yi Wang,Gong Zhang, and Bin Liu. Ultra-fast bloom filters using SIMD techniques.
IEEE Trans. Parallel Distrib. Syst. , 30(4):953–964, 2019.[23] Udi Manber and Sun Wu. An algorithm for approximate membership check-ing with application to password security.
Information Processing Letters ,50(4):191–197, 1994.[24] Michael Mitzenmacher. Compressed bloom filters.
IEEE/ACM Transac-tions on Networking (TON) , 10(5):604–612, 2002.[25] James K. Mullin. A second look at bloom filters.
Communications of theACM , 26(8):570–571, 1983. 2026] James K. Mullin. Accessing textual documents using compressed indexes ofarrays of small bloom filters.
The Computer Journal , 30(4):343–348, 1987.[27] Odysseas Papapetrou, Wolf Siberski, and Wolfgang Nejdl. Cardinality es-timation and dynamic length adaptation for bloom filters.
Distributed Par-allel Databases , 28(2-3):119–156, 2010.[28] Orestis Polychroniou and Kenneth A. Ross. Vectorized bloom filters foradvanced SIMD processors. In Alfons Kemper and Ippokratis Pandis, ed-itors,
Tenth International Workshop on Data Management on New Hard-ware, DaMoN 2014, Snowbird, UT, USA, June 23, 2014 , pages 6:1–6:6.ACM, 2014.[29] Felix Putze, Peter Sanders, and Johannes Singler. Cache-, hash-, and space-efficient bloom filters.
ACM Journal of Experimental Algorithmics , 14,2009.[30] Charles S. Roberts. Partial-match retrieval via the method of superimposedcodes.
Proceedings of the IEEE , 67(12):1624–1642, 1979.[31] Daniel Sanchez, Luke Yen, Mark D Hill, and Karthikeyan Sankar-alingam. Implementing signatures for transactional memory. In , pages 123–133. IEEE, 2007.[32] Ariel Shtul, Carlos Baquero, and Paulo Sérgio Almeida. Age-partitionedbloom filters.
CoRR , abs/2001.03147, 2020.[33] Sasu Tarkoma, Christian Esteve Rothenberg, and Eemil Lagerspetz. The-ory and practice of bloom filters for distributed systems.
IEEE Communi-cations Surveys & Tutorials , 14(1):131–155, 2012.[34] Andrew Whitaker and David Wetherall. Forwarding without loops inicarus. In
Open Architectures and Network Programming Proceedings, 2002IEEE , pages 63–75. IEEE, 2002.[35] Linfeng Zhang and Yong Guan. Detecting click fraud in pay-per-clickstreams of online advertising networks. In28th IEEE International Confer-ence on Distributed Computing Systems (ICDCS 2008), 17-20 June 2008,Beijing, China