A Genetic Algorithm for Obtaining Memory Constrained Near-Perfect Hashing
aa r X i v : . [ c s . N E ] J u l A Genetic Algorithm for Obtaining MemoryConstrained Near-Perfect Hashing
Dan Domnit¸a ∗† , Ciprian Opris¸a ∗†∗ Bitdefender † Technical University of Cluj-Napoca { ddomnita, coprisa } @bitdefender.com Abstract —The problem of fast items retrieval from a fixed col-lection is often encountered in most computer science areas, fromoperating system components to databases and user interfaces.We present an approach based on hash tables that focuses onboth minimizing the number of comparisons performed duringthe search and minimizing the total collection size. The standardopen-addressing double-hashing approach is improved with anon-linear transformation that can be parametrized in order toensure a uniform distribution of the data in the hash table. Theoptimal parameter is determined using a genetic algorithm. Thepaper results show that near-perfect hashing is faster than binarysearch, yet uses less memory than perfect hashing, being a goodchoice for memory-constrained applications where search time isalso critical.
I. I
NTRODUCTION
The ability to quickly lookup an element in a givencollection is very important for various applications, fromoperating system components to databases or user interfaces. Ahandful of techniques were developed over time, each havingadvantages and disadvantages and performing better or worsefor specific constraints. This paper addresses the problem ofsearching in a fixed collection that can be pre-processed offlineand the only permitted operations are searches (no insertionor deletion operations).Linear search takes optimal space, by keeping the collectionunordered, no extra data being required. This method takes O ( n ) time, since every time an element is searched for, theentire collection needs to be traversed. Binary search doesbetter, by keeping the elements ordered and performing thesearch in O (log n ) time. The space is also optimal, since noextra data is required. The technique takes advantage of theproblem constraint that no insertion or deletion is allowed afterthe collection is built.Hash tables have an average search time of O (1) . However,due to hash collisions, the number of actual comparisonsnecessary for finding an element or deciding that it is notpresent in the hash table may vary. The basic idea of hashtables is to determine the position of each element througha hash function. Generally, hash functions are not guaranteedto be injective, meaning that hash collisions can occur. Thecollisions can be treated by chaining and open addressing [1].For open addressing, the fill factor α is defined as the ratiobetween the number of elements in the hash table and thehash table size. The fill factor represents a trade-off between the memory usage and the search speed. It is proven in [1]that the average number of comparisons required for a searchis − α . A large value will ensure efficient memory usage butwill also increase the number of required comparisons.The concept of perfect hashing has been introduced in[2], providing a data structure with worst-case O (1) look-up time. The approach is based on chaining rather than openaddressing and although the memory consumption is O ( n ) ,memory constraints may prohibit its usage.This paper will present near-perfect hashing , a method tooptimize the number of searches for an open addressing hashtable, by employing a genetic algorithm to find a hash functionthat minimizes this number. Near-perfect hashing is based onthe open addressing approach and selects a hash function thatminimizes the number of comparisons for the search operation.Security applications can benefit from fast searches in afixed collection. The authors of [3] and [4] show how machinelearning models can be optimized for malware detection.A recurring operation in both papers is the search in fixedcollections. By reducing the running time for such operations,the overall algorithm can be improved.The next section will discuss similar attempts to optimizethe number of comparisons in hash table searches. The thirdsection describes in detail the hash table search and the geneticalgorithm used for selecting the best hash function. SectionIV presents a new method to compute the average number ofcomparisons for a given fill factor. The experimental resultsin section V show that near-perfect hashing is a compromisebetween perfect hashing, that provides speed but has a largermemory footprint and binary search, with optimal memoryusage but a larger running time. The last section presents theconclusions and future work.II. R ELATED W ORK
Czech, Havas, and Majewski showed that a function fororder preserving minimal perfect hash can be found [5].Their work is based on random graphs for generating orderpreserving minimal perfect hash functions. The hash functioncontains multiple hash functions, some of which are universalhash functions. The solution is both time and space optimal.We have a simpler hash function, but we lose precision. In apaper in 1997 Czech, Havas, and Majewski further theoreticizethe perfect hashing and prove some lower and upper boundsfor minimal perfect hashing [6]. (cid:13) otelho, Pagh and Ziviani found an algorithm that con-structs near-perfect hash structures in practical time [7]. Spe-cial focus has been accorded to the space size that the structurerequires, their solution providing near optimal space size.Limasset, Rizk, Chikhi and Peterlongo offer an algorithmfor finding minimum perfect hash functions, which is space-efficient and collision- free on static sets [8]. The hash tableis represented as a bitmap. They map the initial set of keys toa bitmap, and if a key mapped without a collision the positionis marked with 1 otherwise 0. A new set is formed with allthe keys that collided at the previous step. The new set isused to create a new bitmap using a new hash function, andso on, until no key remains mapped. The hash table is theconcatenation of the bitmaps. This method is best used if weonly want to know if the key is in the hash table. If we want tostore additional information with the key this method becomesspace inefficient.Botelho, Brand˜ao and Ziviani used Bloom filters to storedata [9]. The dispersion of data inside the Bloom filter is madeby using perfect hashing. Their data structure is build in lineartime and uses near-optimal space.III. A
LGORITHM D ESCRIPTION
A. The probing function
Near-perfect hashing uses the open addressing principle,where the position of an element x in the hash table isgiven by a probing function, that also takes as input theattempt number. If the computed position is occupied by adifferent element, the attempt number is increased and theposition is re-calculated until the searched element is found ora free position is encountered. The probing function is basedon double hashing [10], [11], a technique that approximatesuniform open addressing and proves successful in avoiding theclustering effect.Our probing function is a modified version from the originalone and is presented in Equation 1 (the operator ⊗ denotesbitwise XOR). This equation computes the position where wewill attempt to insert/search the element x , at attempt att . h and h are regular hash function, used for the double hashingtechnique. P k ( x, att ) = ( h ( x ) ⊗ k + ( h ( x ) ⊗ k ) · att ) mod N (1)Equation 1 extends the double hashing probing by perform-ing the bitwise XOR operation between the result of the twohash functions h and h with a constant k . Different valuesfor the constant k will lead to different element distributionsin the hash table, some of them being closer to uniformdistribution than others.The goal of the genetic algorithm described in subsectionIII-C is to find the value of k that optimizes the fitness functiondescribed in subsection III-B. B. The fitness function
The fitness function will measure the quality of a givensolution. For a hash table, we are interested in the numberof comparisons performed by the algorithm until it finds the searched element or until it decides that it is not present inthe hash table. This number of comparisons can be evaluatedin terms of average case or worst caste value. A constant λ ∈ (0 , will insure a trade-off between the two cases, as inEquation 2. F ( k ) = λ · AVG - COMP ( k ) + (1 − λ ) · WORST - COMP ( k ) (2) Algorithm 1
COMPUTE - FITNESS ( k, keySet, α ) Require: the fitness for a given XOR key k Ensure: the XOR key k , a set of keys to test on keySet anda fill factor α table ← BUILD - HASH - TABLE ( keySet.toInsert, k, α ) totalComp, maxComp ← , for key ∈ keySet.toSearch do nrComp ← SEARCH - COMPARISONS ( key, table ) totalComp ← totalComp + nrComp if nrComp > maxComp then maxComp ← nrComp end if end for return λ · totalComp | keySet.toSearch | + (1 − λ ) · maxComp Algorithm 1 describe how this fitness function is computed.The input keySet has two fields: keySet.toInsert , that willbe inserted in the hash table and keySet.toSearch that willbe searched. The set of keys to be searched contains bothelements that should be found and elements that should notbe found.First of all, the hash table is built at line 1. The nextline initializes both the total number of comparisons and themaximum number of comparisons to 0. The for loop at lines3-8 searches each key from keySet.toSearch in the hash tableand computes the number of comparisons. This number isadded to the total and replace the maximum, if greater. Thelast line of the algorithm returns the fitness value, computedas in Equation 2.The algorithm complexity depends on the size of keySet and on the fill factor α . If we consider both the insert and thesearch operations to have the complexity O ( − α ) , then thetotal algorithm complexity is O ( | keySet | × − α ) . C. Genetic algorithm description
A genetic algorithm is a metaheuristic inspired from naturalselection [12]. Genetic algorithms are used to probe a samplespace that is too big to search exhaustively, but any data pointcan be accessed at any time.We will use a genetic algorithm to find the best k thatwill be used in the hash function presented in Equation 1.The idea behind the XOR operation with the number k is tominimize the number of collisions as much as possible. Wetry to minimize the number of collisions between the datainside the static dataset, also we try to minimize the numberf collisions between the data in dataset and data not in thedataset. We do this because we are trying to minimize thenumber of comparisons needed for a successful search and anunsuccessful search. Algorithm 2
GEN - ALG ( keySet, α ) Require: the best XOR key k to use in the hash function Ensure: a set of keys to test on keySet and a fill factor α pop ← PSIZE S i =1 { RAND () } genN r, lastImprove, maxF itness ← , , while genN r < θ and genN r − lastImprove < θ do genN r ← genN r + 1 for i = 1 →| pop | do f itness [ i ] ← COMPUTE - FITNESS ( pop [ i ] , keySet, α ) end for if max( f itness ) > maxF itness then maxF itness ← max( f itness ) lastImprove ← genN r end if newP op ← SELECT - TOP ( pop, f itness, ELITE SIZE ) while | newP op | < PSIZE do k , k ← ROULETTE - SELECT ( pop, f itness ) k ′ , k ′ ← CROSSOVER ( k , k ) newP op ← newP op ∪ { k ′ , k ′ } end while for i = ELITE SIZE + 1 →| newP op | do newP op [ i ] ← MUTATE ( newP op [ i ]) end for pop ← newP op end while return pop [arg max ≤ i ≤| pop | f itness [ i ]] The genetic algorithm starts with a population of
PSIZE sample points (called individuals), the first generation (line1). It will run until a certain condition is met (e.g. a specificnumber of generation passed since the algorithm started orthere have been a certain number of generations in which themaximum fitness did not change). The population size
PSIZE is fixed, set at the algorithm start.Every individual in the population will be evaluated in orderto compute the fitness value (line 6). In order to be able tocompute the fitness function we need the average number ofcomparisons and the maximum number of comparisons neededfor searching in the hash table, as detailed in the previoussubsection.The next step for the genetic algorithm is to select the indi-viduals for to the next generation. There are many strategiesfor selection, such as roulette wheel selection, elitism andtournament. A more detailed explanation can be found in [13]by Shukla, Pandey and Mehrotra.The top
ELITE SIZE individuals ranked by fitness willautomatically survive for the next generation (line 12). Thisstrategy, called elitism , will ensure that the most fit individuals will also be found in the next generation, so the overall largestfitness will never decrease.The rest of the individuals for the next generation areobtained by applying the crossover operator on individualsselected by roulette wheel strategy (lines 13-17). For thisstrategy every individual has the probability of being selectedequal to its fitness value divided by the generation total fitness.The crossover operator is a binary operator that operateson the binary representation of the individuals. In a genericcontext, there is a determined number of crossover points andfor each crossover point the location in the binary represen-tation is established. Using this crossover point the binaryrepresentation is ”cut” in multiple segments. The resultedsegments are mixed resulting two new individuals.The binary representation of our individuals is a numberrepresented on 32 bits. We chose a single crossover point,splitting the individual in two 16 bit numbers. The numberscontaining the less significant information from the individualsare swapped.If a genetic algorithm is implemented only with this in-formation and strategies, the algorithm is likely to get stuckin a local minimum. To prevent that from happening a newoperator is added. The mutation operator is used to randomlyflip bits of an individual. Not every individual is sure to bemutated. The probability of mutation is best to vary from 5%to 10% as shown by Haupt in [14]. After the probability ofmutation is determined we computed the number of bits tobe flipped and randomly chose bits and flipped them. Thisoperator is applied at line 19.The number of iterations performed by the algorithm isdetermined by two constants θ and θ . The first constantlimits the total number of iterations, while the second onelimits the number of iterations that the algorithm performswithout improving the best solution so far.The algorithm ends by returning the individual with thehighest fitness, from the last computed generation.IV. T HEORETICAL N UMBER OF C OMPARISONS
This section will present an alternative proof, different fromthe one described in [1] for the fact that the average numberof comparisons for a hash table with open addressing and fillfactor α is − α .The hash table can be abstracted as a sequence of bits, theprobability for a bit to be 1 being equal to the fill factor α ,while the probability for a 0 bit is − α . A search for a givenkey starts from the position given by the hash function andcontinue as long as we encounter 1 bits (they correspond tooccupied positions) until the element is found or a 0 bit isencountered.If we encounter the sequence , one comparisons is needed.If we encounter a sequence of k bits of followed by a bit, we will require k + 1 comparisons. Since the doublehashing ensures a uniform distribution, we can assume eachbit is independent. In this case, the probability to encountersuch a sequence is given by the Equation 3 . (11 . . . | {z } k bits of 1
0) = α · α · . . . · α | {z } k times · (1 − α ) = α k (1 − α ) (3)The expected number of comparisons will be obtained bysumming the lengths of the sequences multiplied by theirprobabilities. E = N − X k =0 ( k + 1) · P (11 . . . | {z } k bits of 1 N − X k =0 ( k + 1) · α k (1 − α )= (1 − α ) N − X k =0 ( k + 1) · α k The sum above can be computed using the derivation trick.We will consider the function f k ( x ) = x k +1 . The derivative is f ′ k ( x ) = ( k + 1) · x k . Since the sum of the derivatives equalsthe derivative sum, the expression above becomes: E = (1 − α ) N − X k =0 f ′ k ( α )= (1 − α ) N − X k =0 f k ( α ) ! ′ = (1 − α ) N − X k =0 α k +1 ! ′ = (1 − α ) (cid:18) α N +1 − α − − (cid:19) ′ = (1 − α ) ( N + 1) · α N ( α − − ( α N +1 − α − = (1 − α ) N · α N +1 − ( N + 1) · α N + 1(1 − α ) Since α < and N is a large number, α N ≈ . This meansthat the expected number of comparisons becomes: E ≈ − α (4)V. E XPERIMENTAL R ESULTS
A. Evaluating the number of comparisons against theoreticalexpectation
As proved in section IV, the expected number of compar-isons for searching an element in a hash table with fill factor α is − α . The first experiment presented in this section willshow that the hash function carefully chosen using the geneticalgorithm outperforms this expectation.Figure 1 plots the average number of comparisons againstthe fill factor α from the values in Table I. The experimentalresults are the average number of comparisons measured by . . . . . . . . . Fill factor N u m b e r o f c o m p a r i s on s Experimental resultTheoretical expectation
Fig. 1. Average number of comparisons by fill factor our experiments. The theoretical expectation is computed de-pending on the fill factor, as in Equation 4, while the speeduppresents the difference between expected and measured valueas a percentage of the expected value.
TABLE IE
XPERIMENTAL VS THEORETICAL NUMBER OF COMPARISONS α / (1 − α ) Experimental Speedup0.1 1.11 1.08 2.8%0.2 1.25 1.19 4.8%0.3 1.43 1.32 7.6%0.4 1.67 1.50 10.0%0.5 2.00 1.73 13.5%0.6 2.50 2.07 17.2%0.7 3.33 2.60 22.0%0.8 5.00 3.60 28.0%0.9 10.00 6.27 37.3%
As Figure 1 and Table I show, by applying the geneticalgorithm in order to select the hash function, we obtainbetter results, with greater speedups for greater fill factors.For instance if the fill factor is . , our hash table will require13.5% less comparisons. B. Comparison with binary search
The previous subsection showed that by carefully selectingthe hash function, using a genetic algorithm, we can obtaina better performance than the theoretical expectation. In thissubsection we will compare our results with those obtainedwith binary search, for choosing the right fill factor.Figure 2 shows the average number of comparisons per-formed by the near-perfect hashing algorithm to find anelement in the hash table, for various fill factors. As expected,the number of comparisons increases with the fill factor but re-mains relatively constant as the number of elements increases.The plot also contains the average number of comparisonsperformed by the binary search algorithm, which is greaterthan the number of comparisons for near-perfect hashing, evenfor a fill factor α = 0 . . Input size N u m b e r o f c o m p a r i s on s PH α = 0 . α = 0 . α = 0 . α = 0 . BS Fig. 2. Average number of comparisons by input size
Input size N u m b e r o f c o m p a r i s on s PH α = 0 . α = 0 . α = 0 . α = 0 . BS Fig. 3. Worst number of comparisons by input size
Although the average case is the most important in practice,there are situation when we are interested in the worst casescenario, so we also plotted the worst number of comparisonsin Figure 3. The figure shows that for fill factors α = 0 . , theworst number than comparisons for near-perfect hashing isstill smaller than the worst number of comparisons for binarysearch. For α = 0 . , binary search is better than near-perfecthashing in the worst-case scenario. C. Comparison with perfect hashing
The previous subsection showed that our method is fasterthan binary search, even for the worst-case scenario if we usea fill factor α = 0 . . Such a fill factor means that we usedtwice as much memory than the most compact representationof the dataset (the one used by binary search). This subsectionwill show that even if we do not match the performance ofperfect hashing, we use less memory. According to [2] and [1], a perfect hash table is an arrayof pointers of the same size or greater than the number ofelements n , each pointer pointing to a secondary array, whosesize is the number of collisions at that position, squared. Usingthe assumption that a pointer occupies the same size as anelement in the hash table, the hash table has size n , and theposition i stores c i elements, the total size (in number ofelements, not in bytes) of the hash table is given by Equation5. size ph ( n ) = n + n − X i =0 c i . (5). The experimental comparison between the table size forperfect hashing, binary search and near-perfect hashing isdepicted in Figure 4. ,
000 4 ,
000 6 ,
000 8 ,
000 10 , , , , , Number of elements T a b l e s i ze Perfect hashingNear-perfect hashing ( α = 0 . )Binary search Fig. 4. Average number of comparisons by fill factor
As expected, the binary search approach takes the leastamount of memory. A near-perfect hash table constructed bythe technique described in this paper with a fill factor α = 0 . takes twice as much memory, as half the positions in the hashtable are unoccupied. The experiments showed that for perfecthashing, the amount of memory used is about 3 times as muchas for binary search and with 50% more than the amount fornear-perfect hashing.The number of comparisons of the perfect hash method isconstant. Usually one on the first level and one on the secondlevel, but this may vary depending on the hash function.The hash function tends to be more complicated that ours,especially on very large sets, so more time is spent to find thehash value. VI. C ONCLUSION
This paper described the concept of near-perfect hashing,used for searching in a fixed collection faster than usingbinary search and with a smaller memory footprint than perfecthashing.he presented approach modifies the double hashing prob-ing by adding a parameter k that affects the function in anon-linear way. A genetic algorithm that determines the bestvalue for k , given the fixed collection is presented.The experimental results compare the performance of near-perfect hashing with regular hashing, binary search and perfecthashing. Our approach is faster than regular hashing, asthe number of comparisons in the search function is lower,while the memory usage is the same. Compared with thebinary search technique, near-perfect hashing is faster than theaverage case, even for large fill factors like 0.9. In worst caseterms, a fill factor of 0.5 ensures that near-perfect hashingis still faster. Compared to perfect hashing, the number ofcomparisons is greater, but the memory footprint is smallerby 50%.The presented technique can be used for solving variousproblems where fast data retrieval in a fixed collection isnecessary. A CKNOWLEDGMENT
Research supported, in part, by EC H2020 SMESEC GA
EFERENCES[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein et al. , Introductionto algorithms . MIT press Cambridge, 2001, vol. 2.[2] M. L. Fredman, J. Koml´os, and E. Szemer´edi, “Storing a sparse tablewith 0 (1) worst case access time,”
Journal of the ACM (JACM) , vol. 31,no. 3, pp. 538–544, 1984.[3] D. Gavrilut, R. Benchea, and C. Vatamanu, “Optimized zero falsepositives perceptron training for malware detection,” in
Symbolic andNumeric Algorithms for Scientific Computing (SYNASC), 2012 14thInternational Symposium on . IEEE, 2012, pp. 247–253.[4] ——, “Practical optimizations for perceptron algorithms in large mal-ware dataset,” in
Symbolic and Numeric Algorithms for Scientific Com-puting (SYNASC), 2012 14th International Symposium on . IEEE, 2012,pp. 240–246.[5] Z. J. Czech, G. Havas, and B. S. Majewski, “An optimal algorithmfor generating minimal perfect hash functions,”
Information ProcessingLetters , vol. 43, no. 5, pp. 257–264, 1992.[6] ——, “Perfect hashing,”
Theoretical Computer Science , vol. 182, no.1-2, pp. 1–143, 1997.[7] F. C. Botelho, R. Pagh, and N. Ziviani, “Practical perfect hashing innearly optimal space,”
Information Systems , vol. 38, no. 1, pp. 108–131, 2013.[8] A. Limasset, G. Rizk, R. Chikhi, and P. Peterlongo, “Fast and scal-able minimal perfect hashing for massive key sets,” arXiv preprintarXiv:1702.03154 , 2017.[9] F. C. Botelho, W. C. Brand˜ao, and N. Ziviani, “Minimal perfecthashing and bloom filters made practical,” in
Proceedings of the IADISInternational Conference Applied Computing , 2011, pp. 465–470.[10] L. J. Guibas and E. Szemeredi, “The analysis of double hashing,”
Journal of Computer and System Sciences , vol. 16, no. 2, pp. 226–274,1978.[11] G. S. Lueker and M. Molodowitch, “More analysis of double hashing,”
Combinatorica , vol. 13, no. 1, pp. 83–96, 1993.[12] D. Whitley, “A genetic algorithm tutorial,”
Statistics and computing ,vol. 4, no. 2, pp. 65–85, 1994.[13] A. Shukla, H. M. Pandey, and D. Mehrotra, “Comparative reviewof selection techniques in genetic algorithm,” in
Futuristic Trends onComputational Analysis and Knowledge Management (ABLAZE), 2015International Conference on . IEEE, 2015, pp. 515–519.[14] R. L. Haupt, “Optimum population size and mutation rate for a simplereal genetic algorithm that optimizes array factors,” in