[PDF] A Genetic Algorithm for Obtaining Memory Constrained Near-Perfect Hashing

Abstract

The problem of fast items retrieval from a fixed collection is often encountered in most computer science areas, from operating system components to databases and user interfaces. We present an approach based on hash tables that focuses on both minimizing the number of comparisons performed during the search and minimizing the total collection size. The standard open-addressing double-hashing approach is improved with a non-linear transformation that can be parametrized in order to ensure a uniform distribution of the data in the hash table. The optimal parameter is determined using a genetic algorithm. The paper results show that near-perfect hashing is faster than binary search, yet uses less memory than perfect hashing, being a good choice for memory-constrained applications where search time is also critical.

Full PDF

aa r X i v : . [ c s . N E ] J u l A Genetic Algorithm for Obtaining MemoryConstrained Near-Perfect Hashing

Dan Domnit¸a ∗† , Ciprian Opris¸a ∗†∗ Bitdefender † Technical University of Cluj-Napoca { ddomnita, coprisa } @bitdefender.com Abstract —The problem of fast items retrieval from a ﬁxed col-lection is often encountered in most computer science areas, fromoperating system components to databases and user interfaces.We present an approach based on hash tables that focuses onboth minimizing the number of comparisons performed duringthe search and minimizing the total collection size. The standardopen-addressing double-hashing approach is improved with anon-linear transformation that can be parametrized in order toensure a uniform distribution of the data in the hash table. Theoptimal parameter is determined using a genetic algorithm. Thepaper results show that near-perfect hashing is faster than binarysearch, yet uses less memory than perfect hashing, being a goodchoice for memory-constrained applications where search time isalso critical.

I. I

NTRODUCTION

The ability to quickly lookup an element in a givencollection is very important for various applications, fromoperating system components to databases or user interfaces. Ahandful of techniques were developed over time, each havingadvantages and disadvantages and performing better or worsefor speciﬁc constraints. This paper addresses the problem ofsearching in a ﬁxed collection that can be pre-processed ofﬂineand the only permitted operations are searches (no insertionor deletion operations).Linear search takes optimal space, by keeping the collectionunordered, no extra data being required. This method takes O ( n ) time, since every time an element is searched for, theentire collection needs to be traversed. Binary search doesbetter, by keeping the elements ordered and performing thesearch in O (log n ) time. The space is also optimal, since noextra data is required. The technique takes advantage of theproblem constraint that no insertion or deletion is allowed afterthe collection is built.Hash tables have an average search time of O (1) . However,due to hash collisions, the number of actual comparisonsnecessary for ﬁnding an element or deciding that it is notpresent in the hash table may vary. The basic idea of hashtables is to determine the position of each element througha hash function. Generally, hash functions are not guaranteedto be injective, meaning that hash collisions can occur. Thecollisions can be treated by chaining and open addressing [1].For open addressing, the ﬁll factor α is deﬁned as the ratiobetween the number of elements in the hash table and thehash table size. The ﬁll factor represents a trade-off between the memory usage and the search speed. It is proven in [1]that the average number of comparisons required for a searchis − α . A large value will ensure efﬁcient memory usage butwill also increase the number of required comparisons.The concept of perfect hashing has been introduced in[2], providing a data structure with worst-case O (1) look-up time. The approach is based on chaining rather than openaddressing and although the memory consumption is O ( n ) ,memory constraints may prohibit its usage.This paper will present near-perfect hashing , a method tooptimize the number of searches for an open addressing hashtable, by employing a genetic algorithm to ﬁnd a hash functionthat minimizes this number. Near-perfect hashing is based onthe open addressing approach and selects a hash function thatminimizes the number of comparisons for the search operation.Security applications can beneﬁt from fast searches in aﬁxed collection. The authors of [3] and [4] show how machinelearning models can be optimized for malware detection.A recurring operation in both papers is the search in ﬁxedcollections. By reducing the running time for such operations,the overall algorithm can be improved.The next section will discuss similar attempts to optimizethe number of comparisons in hash table searches. The thirdsection describes in detail the hash table search and the geneticalgorithm used for selecting the best hash function. SectionIV presents a new method to compute the average number ofcomparisons for a given ﬁll factor. The experimental resultsin section V show that near-perfect hashing is a compromisebetween perfect hashing, that provides speed but has a largermemory footprint and binary search, with optimal memoryusage but a larger running time. The last section presents theconclusions and future work.II. R ELATED W ORK

Czech, Havas, and Majewski showed that a function fororder preserving minimal perfect hash can be found [5].Their work is based on random graphs for generating orderpreserving minimal perfect hash functions. The hash functioncontains multiple hash functions, some of which are universalhash functions. The solution is both time and space optimal.We have a simpler hash function, but we lose precision. In apaper in 1997 Czech, Havas, and Majewski further theoreticizethe perfect hashing and prove some lower and upper boundsfor minimal perfect hashing [6]. (cid:13) otelho, Pagh and Ziviani found an algorithm that con-structs near-perfect hash structures in practical time [7]. Spe-cial focus has been accorded to the space size that the structurerequires, their solution providing near optimal space size.Limasset, Rizk, Chikhi and Peterlongo offer an algorithmfor ﬁnding minimum perfect hash functions, which is space-efﬁcient and collision- free on static sets [8]. The hash tableis represented as a bitmap. They map the initial set of keys toa bitmap, and if a key mapped without a collision the positionis marked with 1 otherwise 0. A new set is formed with allthe keys that collided at the previous step. The new set isused to create a new bitmap using a new hash function, andso on, until no key remains mapped. The hash table is theconcatenation of the bitmaps. This method is best used if weonly want to know if the key is in the hash table. If we want tostore additional information with the key this method becomesspace inefﬁcient.Botelho, Brand˜ao and Ziviani used Bloom ﬁlters to storedata [9]. The dispersion of data inside the Bloom ﬁlter is madeby using perfect hashing. Their data structure is build in lineartime and uses near-optimal space.III. A

LGORITHM D ESCRIPTION

A. The probing function

Near-perfect hashing uses the open addressing principle,where the position of an element x in the hash table isgiven by a probing function, that also takes as input theattempt number. If the computed position is occupied by adifferent element, the attempt number is increased and theposition is re-calculated until the searched element is found ora free position is encountered. The probing function is basedon double hashing [10], [11], a technique that approximatesuniform open addressing and proves successful in avoiding theclustering effect.Our probing function is a modiﬁed version from the originalone and is presented in Equation 1 (the operator ⊗ denotesbitwise XOR). This equation computes the position where wewill attempt to insert/search the element x , at attempt att . h and h are regular hash function, used for the double hashingtechnique. P k ( x, att ) = ( h ( x ) ⊗ k + ( h ( x ) ⊗ k ) · att ) mod N (1)Equation 1 extends the double hashing probing by perform-ing the bitwise XOR operation between the result of the twohash functions h and h with a constant k . Different valuesfor the constant k will lead to different element distributionsin the hash table, some of them being closer to uniformdistribution than others.The goal of the genetic algorithm described in subsectionIII-C is to ﬁnd the value of k that optimizes the ﬁtness functiondescribed in subsection III-B. B. The ﬁtness function

The ﬁtness function will measure the quality of a givensolution. For a hash table, we are interested in the numberof comparisons performed by the algorithm until it ﬁnds the searched element or until it decides that it is not present inthe hash table. This number of comparisons can be evaluatedin terms of average case or worst caste value. A constant λ ∈ (0 , will insure a trade-off between the two cases, as inEquation 2. F ( k ) = λ · AVG - COMP ( k ) + (1 − λ ) · WORST - COMP ( k ) (2) Algorithm 1

COMPUTE - FITNESS ( k, keySet, α ) Require: the ﬁtness for a given XOR key k Ensure: the XOR key k , a set of keys to test on keySet anda ﬁll factor α table ← BUILD - HASH - TABLE ( keySet.toInsert, k, α ) totalComp, maxComp ← , for key ∈ keySet.toSearch do nrComp ← SEARCH - COMPARISONS ( key, table ) totalComp ← totalComp + nrComp if nrComp > maxComp then maxComp ← nrComp end if end for return λ · totalComp | keySet.toSearch | + (1 − λ ) · maxComp Algorithm 1 describe how this ﬁtness function is computed.The input keySet has two ﬁelds: keySet.toInsert , that willbe inserted in the hash table and keySet.toSearch that willbe searched. The set of keys to be searched contains bothelements that should be found and elements that should notbe found.First of all, the hash table is built at line 1. The nextline initializes both the total number of comparisons and themaximum number of comparisons to 0. The for loop at lines3-8 searches each key from keySet.toSearch in the hash tableand computes the number of comparisons. This number isadded to the total and replace the maximum, if greater. Thelast line of the algorithm returns the ﬁtness value, computedas in Equation 2.The algorithm complexity depends on the size of keySet and on the ﬁll factor α . If we consider both the insert and thesearch operations to have the complexity O ( − α ) , then thetotal algorithm complexity is O ( | keySet | × − α ) . C. Genetic algorithm description

A genetic algorithm is a metaheuristic inspired from naturalselection [12]. Genetic algorithms are used to probe a samplespace that is too big to search exhaustively, but any data pointcan be accessed at any time.We will use a genetic algorithm to ﬁnd the best k thatwill be used in the hash function presented in Equation 1.The idea behind the XOR operation with the number k is tominimize the number of collisions as much as possible. Wetry to minimize the number of collisions between the datainside the static dataset, also we try to minimize the numberf collisions between the data in dataset and data not in thedataset. We do this because we are trying to minimize thenumber of comparisons needed for a successful search and anunsuccessful search. Algorithm 2

GEN - ALG ( keySet, α ) Require: the best XOR key k to use in the hash function Ensure: a set of keys to test on keySet and a ﬁll factor α pop ← PSIZE S i =1 { RAND () } genN r, lastImprove, maxF itness ← , , while genN r < θ and genN r − lastImprove < θ do genN r ← genN r + 1 for i = 1 →| pop | do f itness [ i ] ← COMPUTE - FITNESS ( pop [ i ] , keySet, α ) end for if max( f itness ) > maxF itness then maxF itness ← max( f itness ) lastImprove ← genN r end if newP op ← SELECT - TOP ( pop, f itness, ELITE SIZE ) while | newP op | < PSIZE do k , k ← ROULETTE - SELECT ( pop, f itness ) k ′ , k ′ ← CROSSOVER ( k , k ) newP op ← newP op ∪ { k ′ , k ′ } end while for i = ELITE SIZE + 1 →| newP op | do newP op [ i ] ← MUTATE ( newP op [ i ]) end for pop ← newP op end while return pop [arg max ≤ i ≤| pop | f itness [ i ]] The genetic algorithm starts with a population of

PSIZE sample points (called individuals), the ﬁrst generation (line1). It will run until a certain condition is met (e.g. a speciﬁcnumber of generation passed since the algorithm started orthere have been a certain number of generations in which themaximum ﬁtness did not change). The population size

PSIZE is ﬁxed, set at the algorithm start.Every individual in the population will be evaluated in orderto compute the ﬁtness value (line 6). In order to be able tocompute the ﬁtness function we need the average number ofcomparisons and the maximum number of comparisons neededfor searching in the hash table, as detailed in the previoussubsection.The next step for the genetic algorithm is to select the indi-viduals for to the next generation. There are many strategiesfor selection, such as roulette wheel selection, elitism andtournament. A more detailed explanation can be found in [13]by Shukla, Pandey and Mehrotra.The top

ELITE SIZE individuals ranked by ﬁtness willautomatically survive for the next generation (line 12). Thisstrategy, called elitism , will ensure that the most ﬁt individuals will also be found in the next generation, so the overall largestﬁtness will never decrease.The rest of the individuals for the next generation areobtained by applying the crossover operator on individualsselected by roulette wheel strategy (lines 13-17). For thisstrategy every individual has the probability of being selectedequal to its ﬁtness value divided by the generation total ﬁtness.The crossover operator is a binary operator that operateson the binary representation of the individuals. In a genericcontext, there is a determined number of crossover points andfor each crossover point the location in the binary represen-tation is established. Using this crossover point the binaryrepresentation is ”cut” in multiple segments. The resultedsegments are mixed resulting two new individuals.The binary representation of our individuals is a numberrepresented on 32 bits. We chose a single crossover point,splitting the individual in two 16 bit numbers. The numberscontaining the less signiﬁcant information from the individualsare swapped.If a genetic algorithm is implemented only with this in-formation and strategies, the algorithm is likely to get stuckin a local minimum. To prevent that from happening a newoperator is added. The mutation operator is used to randomlyﬂip bits of an individual. Not every individual is sure to bemutated. The probability of mutation is best to vary from 5%to 10% as shown by Haupt in [14]. After the probability ofmutation is determined we computed the number of bits tobe ﬂipped and randomly chose bits and ﬂipped them. Thisoperator is applied at line 19.The number of iterations performed by the algorithm isdetermined by two constants θ and θ . The ﬁrst constantlimits the total number of iterations, while the second onelimits the number of iterations that the algorithm performswithout improving the best solution so far.The algorithm ends by returning the individual with thehighest ﬁtness, from the last computed generation.IV. T HEORETICAL N UMBER OF C OMPARISONS

This section will present an alternative proof, different fromthe one described in [1] for the fact that the average numberof comparisons for a hash table with open addressing and ﬁllfactor α is − α .The hash table can be abstracted as a sequence of bits, theprobability for a bit to be 1 being equal to the ﬁll factor α ,while the probability for a 0 bit is − α . A search for a givenkey starts from the position given by the hash function andcontinue as long as we encounter 1 bits (they correspond tooccupied positions) until the element is found or a 0 bit isencountered.If we encounter the sequence , one comparisons is needed.If we encounter a sequence of k bits of followed by a bit, we will require k + 1 comparisons. Since the doublehashing ensures a uniform distribution, we can assume eachbit is independent. In this case, the probability to encountersuch a sequence is given by the Equation 3 . (11 . . . | {z } k bits of 1

0) = α · α · . . . · α | {z } k times · (1 − α ) = α k (1 − α ) (3)The expected number of comparisons will be obtained bysumming the lengths of the sequences multiplied by theirprobabilities. E = N − X k =0 ( k + 1) · P (11 . . . | {z } k bits of 1 N − X k =0 ( k + 1) · α k (1 − α )= (1 − α ) N − X k =0 ( k + 1) · α k The sum above can be computed using the derivation trick.We will consider the function f k ( x ) = x k +1 . The derivative is f ′ k ( x ) = ( k + 1) · x k . Since the sum of the derivatives equalsthe derivative sum, the expression above becomes: E = (1 − α ) N − X k =0 f ′ k ( α )= (1 − α ) N − X k =0 f k ( α ) ! ′ = (1 − α ) N − X k =0 α k +1 ! ′ = (1 − α ) (cid:18) α N +1 − α − − (cid:19) ′ = (1 − α ) ( N + 1) · α N ( α − − ( α N +1 − α − = (1 − α ) N · α N +1 − ( N + 1) · α N + 1(1 − α ) Since α < and N is a large number, α N ≈ . This meansthat the expected number of comparisons becomes: E ≈ − α (4)V. E XPERIMENTAL R ESULTS

A. Evaluating the number of comparisons against theoreticalexpectation

As proved in section IV, the expected number of compar-isons for searching an element in a hash table with ﬁll factor α is − α . The ﬁrst experiment presented in this section willshow that the hash function carefully chosen using the geneticalgorithm outperforms this expectation.Figure 1 plots the average number of comparisons againstthe ﬁll factor α from the values in Table I. The experimentalresults are the average number of comparisons measured by . . . . . . . . . Fill factor N u m b e r o f c o m p a r i s on s Experimental resultTheoretical expectation

Fig. 1. Average number of comparisons by ﬁll factor our experiments. The theoretical expectation is computed de-pending on the ﬁll factor, as in Equation 4, while the speeduppresents the difference between expected and measured valueas a percentage of the expected value.

TABLE IE

XPERIMENTAL VS THEORETICAL NUMBER OF COMPARISONS α / (1 − α ) Experimental Speedup0.1 1.11 1.08 2.8%0.2 1.25 1.19 4.8%0.3 1.43 1.32 7.6%0.4 1.67 1.50 10.0%0.5 2.00 1.73 13.5%0.6 2.50 2.07 17.2%0.7 3.33 2.60 22.0%0.8 5.00 3.60 28.0%0.9 10.00 6.27 37.3%

As Figure 1 and Table I show, by applying the geneticalgorithm in order to select the hash function, we obtainbetter results, with greater speedups for greater ﬁll factors.For instance if the ﬁll factor is . , our hash table will require13.5% less comparisons. B. Comparison with binary search

The previous subsection showed that by carefully selectingthe hash function, using a genetic algorithm, we can obtaina better performance than the theoretical expectation. In thissubsection we will compare our results with those obtainedwith binary search, for choosing the right ﬁll factor.Figure 2 shows the average number of comparisons per-formed by the near-perfect hashing algorithm to ﬁnd anelement in the hash table, for various ﬁll factors. As expected,the number of comparisons increases with the ﬁll factor but re-mains relatively constant as the number of elements increases.The plot also contains the average number of comparisonsperformed by the binary search algorithm, which is greaterthan the number of comparisons for near-perfect hashing, evenfor a ﬁll factor α = 0 . . Input size N u m b e r o f c o m p a r i s on s PH α = 0 . α = 0 . α = 0 . α = 0 . BS Fig. 2. Average number of comparisons by input size

Input size N u m b e r o f c o m p a r i s on s PH α = 0 . α = 0 . α = 0 . α = 0 . BS Fig. 3. Worst number of comparisons by input size

Although the average case is the most important in practice,there are situation when we are interested in the worst casescenario, so we also plotted the worst number of comparisonsin Figure 3. The ﬁgure shows that for ﬁll factors α = 0 . , theworst number than comparisons for near-perfect hashing isstill smaller than the worst number of comparisons for binarysearch. For α = 0 . , binary search is better than near-perfecthashing in the worst-case scenario. C. Comparison with perfect hashing

The previous subsection showed that our method is fasterthan binary search, even for the worst-case scenario if we usea ﬁll factor α = 0 . . Such a ﬁll factor means that we usedtwice as much memory than the most compact representationof the dataset (the one used by binary search). This subsectionwill show that even if we do not match the performance ofperfect hashing, we use less memory. According to [2] and [1], a perfect hash table is an arrayof pointers of the same size or greater than the number ofelements n , each pointer pointing to a secondary array, whosesize is the number of collisions at that position, squared. Usingthe assumption that a pointer occupies the same size as anelement in the hash table, the hash table has size n , and theposition i stores c i elements, the total size (in number ofelements, not in bytes) of the hash table is given by Equation5. size ph ( n ) = n + n − X i =0 c i . (5). The experimental comparison between the table size forperfect hashing, binary search and near-perfect hashing isdepicted in Figure 4. ,

000 4 ,

000 6 ,

000 8 ,

000 10 , , , , , Number of elements T a b l e s i ze Perfect hashingNear-perfect hashing ( α = 0 . )Binary search Fig. 4. Average number of comparisons by ﬁll factor

As expected, the binary search approach takes the leastamount of memory. A near-perfect hash table constructed bythe technique described in this paper with a ﬁll factor α = 0 . takes twice as much memory, as half the positions in the hashtable are unoccupied. The experiments showed that for perfecthashing, the amount of memory used is about 3 times as muchas for binary search and with 50% more than the amount fornear-perfect hashing.The number of comparisons of the perfect hash method isconstant. Usually one on the ﬁrst level and one on the secondlevel, but this may vary depending on the hash function.The hash function tends to be more complicated that ours,especially on very large sets, so more time is spent to ﬁnd thehash value. VI. C ONCLUSION

This paper described the concept of near-perfect hashing,used for searching in a ﬁxed collection faster than usingbinary search and with a smaller memory footprint than perfecthashing.he presented approach modiﬁes the double hashing prob-ing by adding a parameter k that affects the function in anon-linear way. A genetic algorithm that determines the bestvalue for k , given the ﬁxed collection is presented.The experimental results compare the performance of near-perfect hashing with regular hashing, binary search and perfecthashing. Our approach is faster than regular hashing, asthe number of comparisons in the search function is lower,while the memory usage is the same. Compared with thebinary search technique, near-perfect hashing is faster than theaverage case, even for large ﬁll factors like 0.9. In worst caseterms, a ﬁll factor of 0.5 ensures that near-perfect hashingis still faster. Compared to perfect hashing, the number ofcomparisons is greater, but the memory footprint is smallerby 50%.The presented technique can be used for solving variousproblems where fast data retrieval in a ﬁxed collection isnecessary. A CKNOWLEDGMENT

Research supported, in part, by EC H2020 SMESEC GA

EFERENCES[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein et al. , Introductionto algorithms . MIT press Cambridge, 2001, vol. 2.[2] M. L. Fredman, J. Koml´os, and E. Szemer´edi, “Storing a sparse tablewith 0 (1) worst case access time,”

Journal of the ACM (JACM) , vol. 31,no. 3, pp. 538–544, 1984.[3] D. Gavrilut, R. Benchea, and C. Vatamanu, “Optimized zero falsepositives perceptron training for malware detection,” in

Symbolic andNumeric Algorithms for Scientiﬁc Computing (SYNASC), 2012 14thInternational Symposium on . IEEE, 2012, pp. 247–253.[4] ——, “Practical optimizations for perceptron algorithms in large mal-ware dataset,” in

Symbolic and Numeric Algorithms for Scientiﬁc Com-puting (SYNASC), 2012 14th International Symposium on . IEEE, 2012,pp. 240–246.[5] Z. J. Czech, G. Havas, and B. S. Majewski, “An optimal algorithmfor generating minimal perfect hash functions,”

Information ProcessingLetters , vol. 43, no. 5, pp. 257–264, 1992.[6] ——, “Perfect hashing,”

Theoretical Computer Science , vol. 182, no.1-2, pp. 1–143, 1997.[7] F. C. Botelho, R. Pagh, and N. Ziviani, “Practical perfect hashing innearly optimal space,”

Information Systems , vol. 38, no. 1, pp. 108–131, 2013.[8] A. Limasset, G. Rizk, R. Chikhi, and P. Peterlongo, “Fast and scal-able minimal perfect hashing for massive key sets,” arXiv preprintarXiv:1702.03154 , 2017.[9] F. C. Botelho, W. C. Brand˜ao, and N. Ziviani, “Minimal perfecthashing and bloom ﬁlters made practical,” in

Proceedings of the IADISInternational Conference Applied Computing , 2011, pp. 465–470.[10] L. J. Guibas and E. Szemeredi, “The analysis of double hashing,”

Journal of Computer and System Sciences , vol. 16, no. 2, pp. 226–274,1978.[11] G. S. Lueker and M. Molodowitch, “More analysis of double hashing,”

Combinatorica , vol. 13, no. 1, pp. 83–96, 1993.[12] D. Whitley, “A genetic algorithm tutorial,”

Statistics and computing ,vol. 4, no. 2, pp. 65–85, 1994.[13] A. Shukla, H. M. Pandey, and D. Mehrotra, “Comparative reviewof selection techniques in genetic algorithm,” in

Futuristic Trends onComputational Analysis and Knowledge Management (ABLAZE), 2015International Conference on . IEEE, 2015, pp. 515–519.[14] R. L. Haupt, “Optimum population size and mutation rate for a simplereal genetic algorithm that optimizes array factors,” in