aa r X i v : . [ c s . D B ] M a y BrePar tition: Optimized High-Dimensional k NNSearch with Bregman Distances
Yang Song, Yu Gu, Rui Zhang,
Senior Member, IEEE and Ge Yu,
Member, IEEE
Abstract —Bregman distances (also known as Bregman divergences) are widely used in machine learning, speech recognitionand signal processing, and k NN searches with Bregman distances have become increasingly important with the rapid advances ofmultimedia applications. Data in multimedia applications such as images and videos are commonly transformed into space ofhundreds of dimensions. Such high-dimensional space has posed significant challenges for existing k NN search algorithms withBregman distances, which could only handle data of medium dimensionality (typically less than 100). This paper addresses theurgent problem of high-dimensional k NN search with Bregman distances. We propose a novel partition-filter-refinementframework. Specifically, we propose an optimized dimensionality partitioning scheme to solve several non-trivial issues. First, aneffective bound from each partitioned subspace to obtain exact k NN results is derived. Second, we conduct an in-depth analysis ofthe optimized number of partitions and devise an effective strategy for partitioning. Third, we design an efficient integrated indexstructure for all the subspaces together to accelerate the search processing. Moreover, we extend our exact solution to anapproximate version by a trade-off between the accuracy and efficiency. Experimental results on four real-world datasets and twosynthetic datasets show the clear advantage of our method in comparison to state-of-the-art algorithms.
Index Terms —Bregman Distance, High-Dimensional, k NN Search, Dimensionality Partitioning. ✦ NTRODUCTION B REGMAN distances (also called Bregman divergences),as a generalization of a wide range of non-metricdistance functions (e.g., Squared Mahalanobis Distanceand Itakura-Saito Distance), play an important role inmany multimedia applications such as image and videoanalysis and retrieval, speech recognition and time seriesanalysis [1], [2], [3]. This is because metric measurements(such as Euclidian distance) satisfy the basic propertiesin metric space, such as non-negativity, symmetry andtriangular inequality. Although it is empirically provedsuccessful, the metric measurements represented by Eu-clidian distance are actually inconsistent with human’sperception of similarity [4], [5]. Examples from [6] and [4]illustrate that the distance measurement is not metricwhen comparing images. As can be seen in Fig. 1(a), themoon and the apple are similar in shape, the pentagramand the apple are similar in color, but there is no similaritybetween the moon and the pentagram. In this case, ourperception of similarity violates the notion of triangularinequality and illustrates that human beings are oftencomfortable when deploying or using non-metric dissim-ilarity measurements instead of metric ones especiallyon complex data types [6], [7]. Likewise, as shown inFig 1(b) [4], both the man and the horse are perceptually • Y. Song, Y. Gu, and G. Yu are with the School of Computer Science andEngineering, Northeastern University, Shenyang, Liaoning 110819,China. E-mail: [email protected], { guyu, yuge } @mail.neu.edu.cn. • R. Zhang is with the School of Computing and Information Systems,The University of Melbourne, Parkville VIC 3010, Australia. E-mail:[email protected]. • Corresponding author: G. Yu similar to their composition, but the two obviously differfrom each other. Therefore, it is not appropriate to employEuclidian distance as the distance measurement in manypractical scenarios.Since Bregman distances have the capability of explor-ing the implicit correlations of data features [8], they havebeen widely used in recent decades in a variety of ap-plications, including image retrieval, image classificationand sound processing [7], [9], [10]. Over the last severalyears, they are also used in many practical applications.For example, they are employed to measure the closenessbetween Hermitian Positive-Definite (HPD) matrices torealize target detection in a clutter [11]. They are alsoused as the similarity measurements of the registrationfunctional to combine various types of image characteriza-tion as well as spatial and statistical information in imageregistration [12] and apply multi-region information to ex-press the global information in image segmentation [13].In addition, they are applied in graph embedding, matrixfactorization and tensor factorization in the field of socialnetwork analysis [14]. Among the operations that employBregman distances, k NN queries are demanded as a coreprimitive or a fundamental step in the aforementionedapplications [15], [16], [17].In addition, data in multimedia applications such asimages and videos are commonly transformed into spaceof hundreds of dimensions, such as the commonly useddatasets (Audio, Fonts, Deep, SIFT, etc) illustrated in ourexperimental evaluation. Existing approaches [6], [18] for k NN searches with Bregman distances focus on designingindex structures for Bregman distances, but these index (cid:238) (cid:54)(cid:75)(cid:68)(cid:83)(cid:72) (cid:38)(cid:82)(cid:79)(cid:82)(cid:85) (a) Example 1 (b) Example 2
Fig. 1: Examplesstructures perform poorly in high-dimensional space dueto either large overlaps between clusters or intensivecomputation.Aiming at these situations, this paper addresses theurgent problem of high-dimensional k NN searches withBregman distances. In this paper, we propose a partition-filter-refinement framework. We first partition a high-dimensional space into many low-dimensional subspaces.Then, range queries in all the subspaces are performedto generate candidates. Finally, the exact k NN results areevaluated by refinement from the candidates. However,realizing such a framework requires solving the followingthree challenges: • Bound:
In metric space, bounds are usually de-rived based on triangular inequality, but in non-metric space, triangular inequality does not hold.Therefore, it is challenging to derive an efficientbound for Bregman distances which are not metric. • Partition:
How to partition the dimensions to getthe best efficiency is a challenge. We need to workout both how many partitions and at which dimen-sions should we partition. • Index:
It is challenging to design an I/O efficientindex that handles all the dimension partitions in aunified manner. Specifically, by effectively organiz-ing the data points on disks, the index is desiredto adapt to our proposed partition strategy andfacilitate the data’s reusability across partitions.To address the above challenges, and make the follow-ing contributions in this paper: • We derive the upper bounds between the givenquery point and an arbitrary data point in eachsubspace mathematically based on Cauchy in-equality and the proper upper bounds are selectedas the searching bounds from these subspaces. Thefinal candidate set is the union of the candidatesubsets of all the subspaces. We theoretically provethat the k NN candidate set obtained following ourbound contains the k NN results. • For dimensionality partitioning, we observe atrade-off between the number of partitions andthe search time. Therefore, we derive the algo-rithm’s time complexity and the optimized numberof partitions in theory. Furthermore, a strategynamed Pearson Correlation Coefficient-based Par-tition (PCCP) is proposed to reduce the size of the candidate set by partitioning highly-correlateddimensions into different subspaces. • After dimensionality partitioning, we employ BB-trees in our partitioned low-dimensional subspacesand design an integrated and disk-resident indexstructure, named BB-forest. BB-trees can handlelow-dimensional data efficiently, so they work wellin our framework since the data has been parti-tioned into low-dimensional subspaces. In addi-tion, the data points are well-organized on thedisks based on our proposed PCCP to improvedata’s reusability in all the subspaces, so that theI/O cost can be reduced. • In order to improve the search efficiency whileensuring comparable accuracy with the probabilityguarantee, we make a trade-off between the effi-ciency and the accuracy and propose a solution toapproximate k NN search through the data distri-bution. • Extensive experimental evaluations demonstratethe high efficiency of our approach. Our algorithmnamed BrePartition can clearly outperform state-of-the-art algorithms in running time and I/O cost.The rest of the paper is structured as follows. Sec-tion 2 presents the related works. The preliminaries andoverview are discussed in Section 3. We present thederivation of the upper bound in Section 4 and the di-mensionality partitioning scheme in Section 5. The indexstructure, BB-Forest, is described in Section 6. We presentthe overall framework in Section 7. The extended solutionto approximate k NN search is presented in Section 8.Experimental results are disscussed in Section 9. Finally,we conclude our work in Section 10.
ELATED W ORKS k NN search is a fundamental problem in many applicationdomains. Here we review existing works on the k NNsearch problem in both metric and non-metric spaces.
The metric similarity search problem is a classic topic anda plethora of methods exist for speeding up the nearestneighbor retrieval. Existing methods contain tree-baseddata structures including KD-tree [19], R-tree [20], B+-treevariations [21], [22], [23] and transformation-based meth-ods including Product Quantization (PQ)-based meth-ods [24], [25], Locality Sensitive Hashing (LSH) fam-ily [26], [27], [28], [29] and some other similarity searchmethods based on variant data embedding or dimension-ality reduction techniques [30], [31], [32]. These methodscan’t be utilized in non-metric space where the metricpostulates, such as symmetry and triangle inequality, arenot followed.
Due to the widespread use of Bregman distances inmultimedia applications, a growing body of work is tai-lored for the k NN search with bregman distances. Theprime technique is Bregman voronoi diagrams derived by
Nielsen et al. [33]. Soon after that, Bregman Ball tree (BB-tree) is introduced by Cayton [18]. BB-trees are built by ahierarchical space decomposition via k -means, sharing thesimilar flavor with KD-tree. In a BB-tree, the clusters ap-pear in terms of Bregman balls and the filtering conditionin the dual space on the Bregman distance from a queryto a Bregman ball is computed for pruning out portions ofthe search space. Nielsen et al. [34] extend the BB-tree tosymmetrized Bregman distances. Cayton [35] explores analgorithm to solve the range query based on BB-tree. Nev-ertheless, facing higher dimensions, considerable overlapbetween clusters will be incurred, and too many nodeshave to be traversed during a k NN search. Therefore theefficiency of BB-tree is dramatically degraded, sometimeseven worse than the linear search. Zhang et al. [6] devisea novel solution to handle the class of Bregman distancesby expanding and mapping data points in the originalspace to a new extended space. It employs typical indexstructures, R-tree and VA-file, to solve exact similaritysearch problems. But it’s also inefficient for more than100 dimensions, because too many returned candidatesin a filter-refinement model lead to intensive computationoverhead of Bregman distances.Towards more efficient search processing, there havebeen increasing attentions focusing on the approximatesearch methods for Bregman distances [4], [18], [34], [36],[37], [38]. These approximate methods achieve the effi-ciency promotions with the price of losing accuracies. Forexample, the state-of-the-art approximate solution [34],which is designed for the high-dimensional space, exploitsthe datas distribution and employs a variational approxi-mation to estimate the explored nodes during backtrack-ing in the BB-tree. Nevertheless, all these methods can’tprovide the precision guarantees, while some of themcannot be applied to the high-dimensional space [18], [36],[37], [38].
There also exist many works in the context of non-metricsimilarity search without the limit to Bregman distances.Space-embedding techniques [39], [40], [41], [42] embednon-metric spaces into an Euclidean one where twopoints that are close to each other in the original spaceare more likely close to each other in the new space.Distance-mapping techniques transform the non-metricdistance by modifying the distance function itself whilepreserving the original distance orderings. Skopal [43]develops TriGen algorithm to derive an efficient mappingfunction among concave functions by using the distancedistribution of the database. NM-tree [44] combines M-tree and TriGen algorithm for the non-metric search. Liuet al, [45] propose a simulated-annealing-based techniqueto derive optimized transform functions while preservingthe original similarity orderings. Chen et al. [46] employthe constant shifting embedding with a suitable clusteringof the dataset for a more effective lower-bounds. Recently,a representative technique based on Distance-Based Hash-ing (DBH) [47] is presented and a general similarityindexing methods for non-metric distance measurements TABLE 1: Non-metric search methods
Name NM BDS HD ExactBrePartition (Our solution) √ √ √ √
Zhang et al. [6] √ √ √
BB-tree [18], [35] √ √ √
Bregman voronoi diagram [33] √ √ √
BB-tree variants [34] √ √
Ackermann et al. [36] √ √
Abdullah et al. [37] √ √
Coviello et al. [34] √ √ √
Ferreira et al. [38] √ √
Non-metric LSH [4] √ √ √
FastMap [39] √ √
Wang et al. [40] √ √
Boostmap [41] √ Athitsos et al. [42] √ √
TriGen [43] √ √
NM-tree [44] √ Liu et al [45] √ LCE [46] √ DBH [47] √ √
Dyndex [48] √ √
HNSW [49] √ √ is designed by optimizing hashing functions. In addition,Dyndex [48], as the most impressive technique based onclassification performs classification of the query pointto answer similarity search by categorizing points intoclasses. These methods degrade dramatically in perfor-mance when dealing with high-dimensional issues. Thereexists an approximate solution called Navigable SmallWorld graph with controllable Hierarchy (HNSW) [49],which can be extended to non-metric space. However, itis not a disk-resient solution, while we mainly focus ondisk-resident solutions in this paper.We summarize the properties of representative non-metric search methods in Table 1. In Table 1, NM meansthat the method adopts the distance functions of non-metric space instead of metric space, BDS means that themethod is designed specifically for Bregman distances,and HD means that the method works well in high-dimensional space (more than 100 dimensions). Our pro-posed solution BrePartition is the first algorithm that pos-sesses all the four desired properties compared to existingalgorithms.
RELIMINARIES AND O VERVIEW
Given a d -dimensional vector space S , a query y =( y , y , ..., y d ) and an arbitrary data point x =( x , x , ..., x d ) , the Bregman distance between x and y isdefined as D f ( x, y ) = f ( x ) − f ( y ) − h∇ f ( y ) , x − y i ,where f ( · ) is a convex function mapping points in S to real numbers, ∇ f ( y ) is the gradient of f ( · ) at y ,and h· , ·i denotes the dot product between two vectors.When different convex functions are employed, Bregmandistances define several well-known distance functions.Some representatives contain: • Squared Mahalanobis Distance:
The given f ( x ) = x T Qx generates D f ( x, y ) = (cid:51)(cid:68)(cid:85)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:75)(cid:76)(cid:74)(cid:75)(cid:16)(cid:71)(cid:76)(cid:80)(cid:72)(cid:81)(cid:86)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:86)(cid:83)(cid:68)(cid:70)(cid:72)(cid:3)(cid:76)(cid:81)(cid:87)(cid:82)(cid:3)(cid:86)(cid:72)(cid:89)(cid:72)(cid:85)(cid:68)(cid:79)(cid:3)(cid:86)(cid:88)(cid:69)(cid:86)(cid:83)(cid:68)(cid:70)(cid:72)(cid:86) (cid:38)(cid:82)(cid:81)(cid:86)(cid:87)(cid:85)(cid:88)(cid:70)(cid:87)(cid:3)(cid:37)(cid:37)(cid:16)(cid:87)(cid:85)(cid:72)(cid:72)(cid:86)(cid:3)(cid:76)(cid:81)(cid:3)(cid:86)(cid:88)(cid:69)(cid:86)(cid:83)(cid:68)(cid:70)(cid:72)(cid:86)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:76)(cid:81)(cid:87)(cid:72)(cid:74)(cid:85)(cid:68)(cid:87)(cid:72)(cid:3)(cid:87)(cid:75)(cid:72)(cid:80)(cid:3)(cid:73)(cid:82)(cid:85)(cid:3)(cid:68)(cid:3)(cid:37)(cid:37)(cid:16)(cid:73)(cid:82)(cid:85)(cid:72)(cid:86)(cid:87) (cid:55)(cid:85)(cid:68)(cid:81)(cid:86)(cid:73)(cid:82)(cid:85)(cid:80)(cid:3)(cid:75)(cid:76)(cid:74)(cid:75)(cid:16)(cid:71)(cid:76)(cid:80)(cid:72)(cid:81)(cid:86)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:71)(cid:68)(cid:87)(cid:68)(cid:3)(cid:83)(cid:82)(cid:76)(cid:81)(cid:87)(cid:86)(cid:3)(cid:76)(cid:81)(cid:87)(cid:82)(cid:3)(cid:87)(cid:88)(cid:83)(cid:79)(cid:72)(cid:86)(cid:55)(cid:85)(cid:68)(cid:81)(cid:86)(cid:73)(cid:82)(cid:85)(cid:80)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:74)(cid:76)(cid:89)(cid:72)(cid:81)(cid:3)(cid:84)(cid:88)(cid:72)(cid:85)(cid:92)(cid:3)(cid:83)(cid:82)(cid:76)(cid:81)(cid:87)(cid:3)(cid:76)(cid:81)(cid:87)(cid:82)(cid:3)(cid:68)(cid:3)(cid:87)(cid:85)(cid:76)(cid:83)(cid:79)(cid:72) (cid:38)(cid:82)(cid:80)(cid:83)(cid:88)(cid:87)(cid:72)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:88)(cid:83)(cid:83)(cid:72)(cid:85)(cid:3)(cid:69)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:73)(cid:82)(cid:85)(cid:3)(cid:85)(cid:68)(cid:81)(cid:74)(cid:72)(cid:3)(cid:84)(cid:88)(cid:72)(cid:85)(cid:92) (cid:51)(cid:72)(cid:85)(cid:73)(cid:82)(cid:85)(cid:80)(cid:3)(cid:85)(cid:68)(cid:81)(cid:74)(cid:72)(cid:3)(cid:84)(cid:88)(cid:72)(cid:85)(cid:92)(cid:3)(cid:3)(cid:9)(cid:3)(cid:85)(cid:72)(cid:73)(cid:76)(cid:81)(cid:72)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:70)(cid:68)(cid:81)(cid:71)(cid:76)(cid:71)(cid:68)(cid:87)(cid:72)(cid:86)(cid:3)(cid:73)(cid:82)(cid:85)(cid:3)(cid:78)(cid:49)(cid:49)(cid:3)(cid:85)(cid:72)(cid:86)(cid:88)(cid:79)(cid:87)(cid:86)(cid:51)(cid:85)(cid:72)(cid:70)(cid:82)(cid:80)(cid:83)(cid:88)(cid:87)(cid:72)(cid:52)(cid:88)(cid:72)(cid:85)(cid:92) Fig. 2: OverviewTABLE 2: Frequently used symbols
Symbol Explanation S dataset n number of data points d dimensionality of each data point k number of returned points that users require x data point y query point P ( x ) transformed data point Q ( y ) transformed query point D f ( x, y ) the Bregman distance between x and yUB ( x, y ) the upper bound of Bregman distance between x and yM number of partitions ( x − y ) T Q ( x − y ) which can be consideredas a generalization of the above squared Euclideandistance. • Itakura-Saito Distance (ISD):
When thegiven f ( x ) = − P log x i , the distanceis IS distance which is denoted by D f ( x, y ) = P (cid:16) x i y i − log x i y i − (cid:17) . • Exponential Distance (ED):
When the given f ( x ) = e x , the Bregman distance is representedas D f ( x, y ) = e x − ( x − y + 1) e y . In this paper, wename it exponential distance .In addition, our method can be applied to most mea-sures belonging to Bregman distances, such as Shannonentropy, Burg entropy, l p -quasi-norm and l p -norm, exceptKL-divergence, since it’s not cumulative after the dimen-sionality partitioning. Our method consists of the precomputation and the searchprocessing. In the precomputation, we first partition thedimensions (described in Section 5). Second, we constructBB-trees in the partitioned subspaces and integrate themto form a BB-forest (described in Section 6). Third, wetransform the data points into tuples for computing thesearching bound (described in Section 4). During thesearch processing, we first transform the query point intoa triple and compute the bound used for the range query(described in Section 4). Second, we perform the rangequery for the candidates. Finally, the k NN results areevaluated from these candidates. The whole process isillustrated in Fig 2. We summarize the frequently-usedsymbols in Table 2.
ERIVATION OF B OUND
For k NN queries, it is crucial to avoid the exhaustivesearch by deriving an effective bound as a pruning con- dition. We exploit the property of Bregman distances andan upper bound is derived from Cauchy inequality.Given a data set D , suppose x = ( x , ..., x d ) T and y = ( y , ..., y d ) T are two d -dimensional vectors in D .After dimensionality partitioning, x is partitioned into M disjoint subspaces represented by M subvectors. These M subvectors are denoted by: x i · = (cid:16) x ⌈ dM ⌉ × ( i − , ..., x ⌈ dM ⌉ × i (cid:17) T = ( x i , ..., x i ⌈ dM ⌉ ) T , where ≤ i ≤ M . Vector y is partitioned in the samemanner, while each part is denoted by y i · (1 ≤ i ≤ M ) . ByCauchy inequality, we can prove Theorem 1 below, whichcan be used to derive the upper bound between arbitrary x i · and y i · ( ≤ i ≤ M ) in the same subspace. Theorem 1.
The upper bound between x i · and y i · ( ≤ i ≤ M ) is derived: D f ( x i · , y i · ) ≤ α ( i ) x + α ( i ) y + β ( i ) yy + q γ ( i ) x × δ ( i ) y . For simplicity, α ( i ) x , α ( i ) y , β ( i ) yy , γ ( i ) x and δ ( i ) y are used to markthese formulas: α ( i ) x = ⌈ d/M ⌉ X j =1 f ( x ij ) , α ( i ) y = − ⌈ d/M ⌉ X j =1 f ( y ij ) ,β ( i ) xy = − ⌈ d/M ⌉ X j =1 ( x ij × ∂f ( y ) ∂y ij ) , β ( i ) yy = ⌈ d/M ⌉ X j =1 ( y ij × ∂f ( y ) ∂y ij ) ,γ ( i ) x = ⌈ d/M ⌉ X j =1 x ij , δ ( i ) y = ⌈ d/M ⌉ X j =1 ∂f ( y ) ∂y ij . Proof.
Please see Section 1 in the supplementary file.Based on Theorem 1, we can derive the upper boundbetween two arbitrary data points in each subspace. Fur-thermore, we can obtain the upper bound between twodata points in the original high-dimensional space bysumming up the M upper bounds from all the subspaces.Theorem 2 proves it. Theorem 2.
The Bregman distance between x and y isbounded by the sum of each upper bound from its corre-sponding subspace. Formally, D f ( x, y ) ≤ U B ( x, y ) , where U B ( x, y ) = P Mi =1 U B ( x i · , y i · ) .Proof. Please see Section 2 in the supplementary file.Therefore, when dealing with a k NN search, we firstcompute the upper bounds between the given querypoints and every data point in the dataset. And then the k th smallest upper bound is selected and its componentsare selected as the searching bounds to perform rangequeries in their corresponding subspaces as the filterprocessing. The union of the searching result sets of allthe subspaces is the final candidate set for the refinementprocessing. Theorem 3 proves that all the k NN points arein the final candidate set. i y (cid:152) i x (cid:152) (cid:11) (cid:12) (cid:11) (cid:12) (cid:11) (cid:12) (cid:11) (cid:12) , , i i iy yy y (cid:68) (cid:69) (cid:71) (cid:11) (cid:12) (cid:11) (cid:12) (cid:11) (cid:12) , i ix x (cid:68) (cid:74) (cid:11) (cid:12) , i i UB x y (cid:152) (cid:152) (cid:11) (cid:12) , f i i D x y (cid:152) (cid:152)
Fig. 3: Transformation
Theorem 3.
The candidate set of each subspace is denoted by C i ( ≤ i ≤ M ), and the final candidate set: C = C ∪ C ∪ · · · ∪ C M . The k NN points of the given query point y exist in C .Proof. Please see Section 3 in the supplementary file.Based on the upper bound derived in Theorem 1, twocomponents α x and γ x of data points in all the subspacescan be computed offline. It can be considered as theprecomputation that each partitioned multi-dimensionaldata point in each subspaces is transformed into a two-dimensional tuple denoted by P ( x ) = ( α x , γ x ) (See Fig. 3for an illustration). Once the query point is given, we onlyneed to compute the values of α y , β yy and δ y , which canbe considered as a triple denoted by Q ( y ) = ( α y , β yy , δ y ) ,to obtain the upper bounds while the computation over-head is very small. Relying on the precomputation, thesearch processing will be dramatically accelerated.Algorithm 1 describes the process of computing theupper bound between two arbitrary data points by theircorresponding transformed vectors. Algorithm 2 and Al-gorithm 3 describe the transformations of the partitioneddata point and the partitioned query point, respectively.Algorithm 4 describes the process of determining thesearching bound from each subspace. Algorithm 1
UBCompute ( P ( x ) , Q ( y ) ) Input: the transformed vectors of x and y , P ( x ) = ( α x , γ x ) and Q ( y ) = ( α y , β yy , δ y ) . Output: the upper bound of x and y , ub . ub = α x + α y + β yy + p γ x × δ y ; return ub ; Algorithm 2
PTransform ( X ) Input: the subvectors’ set of a partitioned data point X = { x , x , ..., x M } . Output: the transformed tuples’ set of a partitioned datapoint P = { P ( x ) , P ( x ) , ..., P ( x M ) } . for i = 1 to M do Compute α ( i ) x and γ ( i ) x ; P ( x ) = ( α ( i ) x , γ ( i ) x ) ; end for P = { P ( x ) , P ( x ) , ..., P ( x M ) } ; return P ; Algorithm 3
QTransform ( Y ) Input: the subvectors’ set of the partitioned query point Y = { y , y , ..., y M } . Output: the transformed tuples’ set of the partitioned querypoint Q = { Q ( y ) , Q ( y ) , ..., Q ( y M ) } . for i = 1 to M do Compute α ( i ) y , β ( i ) yy and δ ( i ) y ; Q ( x i ) = ( α ( i ) y , β ( i ) yy , δ ( i ) y ) ; end for Q = { Q ( y ) , Q ( y ) , ..., Q ( y M ) } ; return Q ; Algorithm 4
QBDetermine ( S t , Q ) Input: the transformed dataset S t = { P , P , ..., P n } , thetransformed query point Q . Output: the set containing M subspaces’ searching bounds QB . UB = ∅ , QB = ∅ ; for i = 1 to n do for j = 1 to M do ub ij = UBCompute ( P ij , Q j ) ; // Algorithm 1 ub i + = ub ij ; QB i = QB i ∪ { ub ij } ; end for UB = UB ∪ ub i ; end for Sort UB and find the k th smallest ub t ( ≤ t ≤ n ); QB = QB t = { ub t , ub t , ..., ub tM } . return QB ; IMENSIONALITY P ARTITIONING
According to Cauchy inequality, we can prove: M X i =1 q γ ( i ) x × δ ( i ) y > M X i =1 q γ ( i ) x × δ ( i ) y , when M < M . It indicates that more partitions bringtighter bounds. Furthermore, we can prove that there isan exponential relationship between the upper bound andthe number of partitions. However, more partitions leadto more online computation overhead. Therefore, it’s anunresolved issue to determine the number of partitions.Besides, how to partition these dimensions may also affectthe efficiency. In this section, the optimized number ofpartitions which contributes to the efficiency is derivedtheoretically and a heuristic strategy is devised for thepurpose of reducing the candidate set by partitioning. More partitions lead to a tighter bound, which indicates asmaller candidate set intuitively. But more partitions alsoincur more computation overhead when calculating theupper bounds from more subspaces at the same time.It’s a trade-off between the number of partitions and theefficiency. In this part, we will theoretically analyze theonline processing to derive the optimized number of par-titions for the optimized efficiency. The online processingmainly includes two parts, one is to compute the upperbounds between the given query point and an arbitrary data point in the dataset and sort these upper bounds todetermine the searching bound. The other is to obtain thecandidate points by performing range queries and refinethe candidate points for the results. Below we will derivethe time complexities of the online k NN search.As a prerequisite, we transform the given query into atriple: Q ( y ) = ( − X f ( y ) , X ∂f ( y ) ∂y , X y × ∂f ( y ) ∂y ) . When it comes to computing upper bounds, these triplesin the M partitions can be computed in O ( d ) time. Sincewe have already transformed the multi-dimensional datapoints in each subspace into tuples, we can computethe upper bounds in O ( M n ) time in all M subspaces.The time complexity of summing up all n points’ up-per bounds is also O ( M n ) and the time complexity ofsorting them to find the k th smallest one is O ( n log k ) .Therefore, the whole time complexity of transforming thequery point, computing the upper bounds, summing upthem and sorting them to find the k th smallest one is O ( d + M n + n log k ) .In the filter-refinement processing, we specify a pa-rameter λ (0 < λ < to describe the pruning effectof the searching bound for the complexity analyses. Wesuppose that data points are stored in BB-trees and eachleaf node is full. And we set the capacity of each leafnode in a BB-tree as C and the number of leaf nodes is n/C . Thus the number of accessed leaf nodes during therange queries can be estimated as λn/C . According tothe proposed method in [35], the secant method is em-ployed to determine whether a cluster and the searchingrange intersect, or whether one contains another whenperforming the range queries. For the time consumptionof the determination is negligible, the time complexityof searching in the M BB-trees is equivalent to that ofsearching in a binary tree and the time complexity is O ( M × λnC log nC ) . In BB-trees, the capacity of each leafnode increases with the size of the corresponding datasetsince we have to restrict the height of the tree, and thevalue of nC can be considered as a constant. Consequently,the time complexity of searching in the BB-trees is ignored.Once we have identified the clusters that intersectthe given searching range, all the data points in theseclusters will be loaded from the disk into the memory forprocessing as the candidate points which will be refinedto obtain the k NN points. Thus we should estimate thesize of the candidate set primarily. There is an exponentialrelationship between the upper bound
U B and the num-ber of partitions M and we describe it as U B = Aα M ,where < α < . In addition, we assume the parameter λ which describes the pruning effect is proportional to theupper bound U B and is represented as λ = βU B . Basedon these, the number of the candidate points is βAα M n .Actually, the final candidate set is the union of all the sub-spaces’ candidate subsets. Here we directly consider thecandidate set obtained by searching in the original spacewith the summing searching bound as the final candidateset, since it is a very good approximation of the union of all the subspaces’ candidate subsets which is verifiedby experiments. Therefore, the refinement process takes O ( βAα M nd + βAα M n log k ) . The first item representsthe time complexity of computing the Bregman distancesbetween each candidate point and the query, while thesecond represents the time complexity of evaluating the k NN points.Therefore, the online time complexity is O ( d + M n + n log k + βAα M nd + βAα M n log k ) in total and we target at minimizing the total time cost. Theorem 4.
For any user-specified k (0 < k ≤ n ) , the timecomplexity can be minimized by setting the number of partitions M = log α n − µ ln α ( d + log k ) , where µ = βAn .Proof. Please see Section 4 in the supplementary file.Especially, since the number of partitions requires to bedetermined offline, we set the value of k to which willnot impact too much on the number of partitions since k isnegligible compared to the value of n . Moreover, the resultcomputed by Theorem 4 may not be an integer, hence wecompute the time costs in both cases of rounding up anddown and choose the best value of M . A and α can bedetermined by fitting the function U B = Aα M throughtwo arbitrary points’ U B and the corresponding M , whilethe points are randomly selected from the dataset. And β can be determined by computing the proportion of thepoints within each sample’s U B to n .As M increases, the I/O cost decreases exponentially,and the corresponding time consumption can be esti-mated as βAα M nBv , where B denotes the disk’s page sizeand v denotes the disk’s IOPS. Since the IOPS of currentmainstream SSD is very high, the loss of the I/O’s timeconsumption can be negligible compared to the gainsin the CPU’s running time for the optimized partitionnumber. When using the low-level storage device, thistime consumption incurred by I/O operations can beadded to the above cost model for deriving the optimizednumber of partitions. Initially, we simply choose an equal and contiguous strat-egy for the dimensionality partitioning. Since the finalcandidate set is the union of the candidate subsets of allthe subspaces, the size of the candidate set depends on thesize of the candidate subset of each partition. Therefore,we attempt to develop a strategy for the dimensionalitypartitioning to further reduce the size of the final candi-date set.Intuitively, when the number of the candidate pointsgenerated from each partition is constant, if the intersec-tion between these partitions is small, the candidate setwill be large. In the worst case, the final candidate setreaches the maximum in the case that their intersection is empty. Conversely, large portions of overlap among thesepartitions’ candidate sets may lead to a smaller candidateset. At best, each of them is equivalent to the final candi-date set. Therefore, our objective is to reduce the size of theentire candidate set by making the partitions’ candidateset overlap as large and as possible.Simply, if there exist two identical partitions, their cor-responding candidate sets will overlap completely lead-ing to a smaller candidate set. Therefore, it is crucial tomeasure and compare the similarities between differentpartitions. However, there is no reasonable indicator satis-fying our requirements. To address this issue, we simplifypartitions to dimensions and employ Pearson CorrelationCoefficient to measure the correlations between differentdimensions and the correlations are utilized to indicate thesimilarities. The dimensions with strong correlations willbe assigned to different partitions to make dimensions ineach partition uniformly distributed. It is a heuristic algo-rithm and named Pearson Correlation Coefficient-basedPartition (PCCP) for the dimensionality partitioning.Given a d -dimensional dataset, we attempt to divide d dimensions into M partitions while each partition contains ⌈ d/N ⌉ dimensions. The correlation between two arbitrarydimensions X and Y is measured by the Pearson correla-tion coefficient r ( X, Y ) = cov ( X, Y ) / p var ( X ) var ( Y ) ,where cov ( X, Y ) represents the covariance between X and Y , and var ( X ) and var ( Y ) represent the variances of X and Y , respectively. Specifically, we only consider thelevel of dimensions’ correlations and ignore whether it ispositive or negative, so we only consider the absolute val-ues of the Pearson correlation coefficients. Our proposedstrategy consists of two steps: Assignment : We assign d dimensions to M groupsand attempt to ensure high similarities between the di-mensions in each group. In detail, we first select a di-mension randomly and insert it into a group, find thedimension having the largest Pearson correlation coeffi-cient with the selected dimension and insert it into thesame group. Then, the dimension which has the largestabsolute Pearson correlation coefficient with an arbitraryinserted dimension is selected and inserted into the cur-rent group. Following these steps, we continue searchingand inserting until M dimensions have been inserted anda group is formed. The above steps are repeated until all d dimensions are assigned to ⌈ d/M ⌉ different groups. Partitioning : We select a dimension from every groupand insert it into the current partition so that each parti-tion has ⌈ d/N ⌉ dimensions. The above steps are repeateduntil the M partitions are obtained.Fig. 4 illustrates the process by an example. Thereexists a six dimensional dataset whose dimensions aredenoted by a , b , c , d , e and f . In the assignment, werandomly select a dimension such as a , and compute thecorrelations between a and other dimensions. We find | r ( a, e ) | is the largest, so we assign a and e to the firstgroup. Whereafter, we compute the correlations between e and other dimensions since we have computed thecorrelations between a and other dimensions. We find a b c d e f (cid:22)(cid:24)(cid:20)(cid:26) (cid:20)(cid:21)(cid:25)(cid:25) (cid:28)(cid:22)(cid:23)(cid:22) (cid:27)(cid:23)(cid:25)(cid:26) (cid:22)(cid:21)(cid:24)(cid:20) (cid:25)(cid:25)(cid:27)(cid:21) a (cid:22)(cid:24)(cid:20)(cid:26) e f (cid:22)(cid:21)(cid:24)(cid:20) (cid:25)(cid:25)(cid:27)(cid:21) b c d (cid:20)(cid:21)(cid:25)(cid:25) (cid:28)(cid:22)(cid:23)(cid:22) (cid:27)(cid:23)(cid:25)(cid:26) a (cid:22)(cid:24)(cid:20)(cid:26) b (cid:20)(cid:21)(cid:25)(cid:25) e (cid:22)(cid:21)(cid:24)(cid:20) c (cid:28)(cid:22)(cid:23)(cid:22) f (cid:25)(cid:25)(cid:27)(cid:21) d (cid:27)(cid:23)(cid:25)(cid:26) Fig. 4: An example of PCCP | r ( e, f ) | is the largest among the correlations between a and other dimensions except e , as well as the correlationsbetween e and other dimensions except a . So we insert f into the group containing a and e . The others, b , c and d ,are assigned to the second group. After assignment, werandomly select a dimension from each of the two groupsand insert them into a partition. The above procedure isrepeated three times and the final partitions are obtainedas Fig. 4. NDEXING S UBSPACES FOR k NN S
EARCH
After partitioning, the original high-dimensional spaceis divided into many low-dimensional subspaces. Thisenables us to use existing indexing techniques whichare efficient for low-dimensional spaces such as [6], [18].Although the previous BB-tree [18] is only designed forthe k NN search, an algorithm is explored to specificallysolve the range query based on BB-tree in [35], whichshows good performance. Therefore, we employ BB-treefor the range queries in each partition’s filtering processand adopt the algorithm in [35] in this paper. We constructa BB-tree for each subspace and all the BB-trees form a
BB-forest . Note that directly conducting k NN queries in eachpartition cannot obtain a correct candidate. Thus the k NNalgorithm and the filtering condition proposed in [18]cannot be simply extended using the partition strategy.On the other hand, the range query algorithm proposedin [35] cannot directly solve the k NN problem. Therefore,our framework is essentially different from the existingBB-tree based filtering algorithms.Intuitively, if we store each node’s corresponding clus-ter’s center and radius, and their pointers to their childrenor the addresses of the data points in this cluster, BB-treecan be simply extended to the disks to process large-scaledatasets. Fig. 5 illustrates the disk-resident integratedindex structure which consists of M BB-trees constructedin the M partitioned subspaces. In each BB-tree, theintermediate nodes store their corresponding clusters’centers and radii, denoted by C.center and
C.radius ,respectively. And the leaf nodes store not only clusters’centers and radii, but the addresses of the data pointsin their corresponding clusters, denoted by
P.address ,which are used for reading the original high-dimensionaldata points from the disks.According to [35], which develops an algorithm forefficient range search with Bregman distances, only eachcluster’s center and radius are required to determine
C.center
C.radius
N.pointer
C.center
C.radius
P.address (cid:23445) (cid:23445) (cid:23445) (cid:17)(cid:17)(cid:17) M (cid:23395)(cid:23395) (cid:23429)(cid:23429)(cid:23408)(cid:23447)(cid:23477)(cid:23464)(cid:23464)(cid:23478)(cid:23442)(cid:23477)(cid:23468)(cid:23466)(cid:23468)(cid:23473)(cid:23460)(cid:23471)(cid:23395)(cid:23431)(cid:23460)(cid:23479)(cid:23460) (cid:23439) (cid:23440) (cid:23440) (cid:23439) (cid:23439) (cid:23440) ... Fig. 5: BB-forestwhether two clusters intersect or one cluster containsanother during the search processing. Therefore, the disk-resident version of BB-tree still works for efficient rangesearch with Bregman distances. However, since our pro-posed BB-forest is an integrated index structure, how toorganize the data on the disk is a major issue to address.If the clusters in these subspaces are significantly different,different BB-trees’ candidate nodes obtained by rangesqueries will index different data points which causes moreI/O accesses when reading high-dimensional data pointsfrom the disks.Benefitting from our devised PCCP, similar clustersare obtained in different subspaces. Therefore, after thedimensionality partitioning, we construct a BB-tree ina randomly selected subspace and the original high-dimensional data points indexed by each leaf node are or-ganized according to the order of these leaf nodes. At thesame time, the address of each point containing the disknumber and offset is recorded. During the construction ofother BB-trees, the recorded addresses of these data pointsare stored in leaf nodes for indexing. Since the candidatepoints in different subspaces are the same at best, we willread the same part of the disks when performing rangequeries in different subspaces to reduce I/O accesses.
VERALL F RAMEWORK
In this paper, we solve the k NN search problem withBregman distances in high-dimensional space by our pro-posed partition-filter-refinement framework. The frame-work consists of the precomputation in Algorithm 5 andthe search processing in Algorithm 6. In Algorithm 5,we first determine the number of partitions (Line 2),which is introduced in Section 5.1. Then, we perform thedimensionality partitioning in the dataset based on ourproposed strategy PCCP denoted as
PCCP ( S ) (Line 3),which is introduced in Section 5.2. Finally, we transformthe original data points in each subspace into tuplesused for computing the upper bounds later (Lines 4-7),which is introduced in Section 4. At the same time, thepartitioned data points will be indexed by the BB-trees ofall the subspaces forming the BB-forest (Line 8), and theconstruct of the BB-forest is denoted as BBFConstruct ( S p ) which is introduced in Section 6.In Algorithm 6, we first rearrange the query pointaccording to the partitioned data points (Line 2). Second,we transform the query point into triples in each subspace Algorithm 5
BrePartitionConstruct ( S , n , d , A , α , β ) Input: S , n , d , A , α , β . Output: S t , BBF . S t = ∅ , BBF orest = ∅ ; M = log α n − βAnd ln α ; S p = PCCP ( S ) ; for each X ∈ S p do P = PTransform ( X ) ; // Algorithm 2 S t = S t ∪ P ; end for BBF = BBFConstruct ( S p ) ; return S t , BBF ; (Line 3) and compute the searching bounds from all thesubspaces (Line 4), which is introduced in Section 4. Third,we perform range queries with the computed searchingbounds over all the BB-trees in the BB-forest for retrievingthe candidates (Lines 5-7). Finally, we evaluate the candi-date points for the k NN points and return the result set(Lines 8-9).
Algorithm 6
BrePartitionSearch ( S p , y , k , S t , BBF ) Input: S p , y , k , S t , BBF . Output:
Res . Cand = ∅ , Res = ∅ ; Rearrange the query point to obtain Y ; Q = QTransform ( Y ) ; // Algorithm 3 QB = QBDetermine ( S t , Q ) ; // Algorithm 4 for each T j ∈ BBF do Cand = Cand ∪ T j . rangeQuery ( Y [ j ] , QB [ j ]) ; end for Evaluate the k NN result set
Res = { x , x , ..., x k } ; return Res ; XTENSION TO A PPROXIMATE k NN S
EARCH
In previous sections, we mainly concentrate on the exactretrieval of k NN searches. However, through our research,we can tighten the bounds derived above for comparableapproximate results with probability guarantees to im-prove the efficiency. We present an approximate solutionwith probability guarantees. By the solution, given aquery point q and a probability guarantee p ( < p ≤ ),the retrieval k points are the exact k NN points of q withthe probability guarantee p .According to the previous description in Theorem 1,we relax the item β xy by employing Cauchy-inequalityand an exact bound for the k NN search can be derived.The exact searching bound can be simply represented inthe form of κ + µ , where κ = d X i =1 f ( x i ) − d X i =1 f ( y i ) + d X i =1 ( y i × ∂f ( y ) ∂y i ) which isn’t affected when computing the upper bound,and µ = vuut d X i =1 x i × d X i =1 ∂f ( y ) ∂y i which is obtained by relaxing β xy . Therefore, the search-ing bound is actually determined by µ . In our proposed X - (cid:539) (cid:541) Xc (cid:541) Exact Range Approximate Range - (cid:539) Fig. 6: Exact and Approximate Casesapproximate k NN search solution, µ is tightened by mul-tiplying a coefficient denoted as c ( < c ≤ ) in thecontext that the distribution of each dimension’s data isknown. With this tighter bound, we can obtain the k NNresults more efficiently within a smaller bound with theprobability guarantee. Therefore, we mainly focus on howto derive the value of c when the value of µ is known. Proposition 1.
We set the following two events with respectto A and B , which describe the exact condition and the approx-imate condition, respectively. And C indicates that Bregmandistances are non-negative. A : β xy < µ, B : β xy ≤ cµ (0 < c ≤ , C : β xy ≥ − κ. When the distribution of each dimension in the dataset isknown and the given probability guarantee is p , the value of c is: c = Ψ − β xy ( p Ψ β xy ( µ ) + (1 − p )Ψ β xy ( − κ )) /µ, where Ψ β xy and Ψ − β xy are used to denote the variable β xy ’scumulative distribution function (cdf) and its inverse function,respectively.Proof. Please see Section 5 in the supplementary file.Proposition 1 illustrates that the exact bound is tight-ened by multiplying an approximate coefficient to obtainthe approximate results with probability guarantees in thecondition that the distribution of each dimension in thedataset is known. Fig. 6 shows the exact and approximatecases intuitively. When given the exact bound, we firstlycompute the approximate coefficient in the original spaceand the exact bounds obtained in all partitions are multi-plied by the approximate coefficient to be each partition’snew approximate bound.
XPERIMENTS
We conduct extensive experiments on four real datasetsand two synthetic datasets to verify the efficiency of ourmethods. In this section, we present the experimentalresults.
According to TABLE 1, we select two state-of-the-arttechniques [6], [18], which are designed for the exact k NN search for Bregman distances. We denote them by”VAF” and ”BBT”, respectively. Our method is denoted by
1. Histograms can be used to describe each dimension’s distri-bution and a known distribution similar to the histogram (such asNormal distribution) can be chosen to fit the dimension’s distributionby the least squares method. ”BP” (the abbreviation of ”BrePartition”) and its approxi-mate versions are denoted by ”ABP” (the abbreviation of”Approximate BrePartition”). Three key parameters areshown in TABLE 3. We measure the index constructiontime, I/O cost and CPU time to verify the efficiency of thebenchmarks and our proposed method. These methodsare implemented in Java. All experiments were done ona PC with Intel Core i7-2600M 3.40GHz CPU, 8 GB mem-ory and 1 TB WD Blue 3D NAND SATA SSD, runningWindows x64. TABLE 3: Parameters
Parameter Varying RangeResult Size k
20, 40, 60, 80, 100Dimensionality (Fonts) 100, 200, 300, 400Data Size (Sift) 2M, 4M, 6M, 8M, 10M
Four real-life datasets Audio , Fonts , Deep , Sift andtwo synthetic datasets are summarized in TABLE 4. Nor-mal is a 200-dimensional synthetic dataset which has50000 points generated by simulating the random num-ber of standard normal distribution. Uniform is a 200-dimensional synthetic dataset which has 50000 points gen-erated by simulating the random number of uniform dis-tribution between [0 , . Normal and Uniform are onlyused for evaluating our proposed approximate solution.For all these datasets, 50 points are randomly selected asthe query sets. And we randomly select 50 samples forcomputing A , α and β .TABLE 4: Datasets Parameter n d M
Page Size MeasureAudio 54387 192 28 32KB EDFonts 745000 400 50 128KB ISDDeep 1000000 256 37 64KB EDSift 11164866 128 22 64KB EDNormal 50000 200 25 32KB EDUniform 50000 200 21 32KB ISD T i m e ( m i n ) Dataset
VAFBPBBT
Fig. 7: Index Construction Time
400 600 800 1000 1200 4 10 16 22 28 34 40 46 52 58 64 I/ O C o s t k k=20k=60k=100 (a) Audio I/ O C o s t k k=20k=60k=100 (b) Fonts I/ O C o s t k k=20k=60k=100 (c) Deep I/ O C o s t k k=20k=60k=100 (d) Sift Fig. 8: I/O Cost (Impact of M )
500 600 700 800 4 10 16 22 28 34 40 46 52 58 64 R unn i ng T i m e ( m s ) k k=20k=60k=100 (a) Audio R unn i ng T i m e ( s ) k k=20k=60k=100 (b) Fonts
15 30 4517 22 27 32 37 42 47 52 57 62 R unn i ng T i m e ( s ) k k=20k=60k=100 (c) Deep
50 100 150 200 6 10 14 18 22 26 30 34 38 42 R unn i ng T i m e ( s ) k k=20k=60k=100 (d) Sift Fig. 9: Running Time (Impact of M ) In Fig. 7, we illustrate the index construction time of allthe three testing methods on six testing datasets. The con-struction of VA-file is the fastest among all the methodson all the datasets, while the index structures constructedbased on Bregman balls, such as BB-forest and BB-tree,increase the construction time by at least one magnitudebecause of the time-consuming clustering process. Andthe construction of BB-tree is slower than our proposedBB-forest since the clustering becomes less effective withthe increase of dimensionality. In addition, it’s indicatedthat the construction of VA-file becomes slower in high-dimensional spaces as well.
In our method, we develop the dimensionality partition-ing technique to solve the k NN search problem in thehigh-dimensional space. Through theoretical analysis, theoptimized number of partitions M can be derived andwe obtain the optimized numbers of partitions on thefour datasets according to Theorem 4, which are shown inTABLE 4. In addition, a novel strategy named PCCP areexplored. In this part, we will evaluate the impact of theparameter M and PCCP on the efficiency of our methodand validate that the derived value of M is optimized. M We show the results on four real datasets to demonstratethe impact of M on the efficiency. We vary k to 20, 40 and60 to evaluate the I/O cost and the running time when thevalue of M is set within a certain range and the results areshown in Figs. 8 and 9. From Fig. 8, the I/O cost decreasesas M increases in all the cases. Moreover, the I/O costdecreases more and more slowly as M increases. In Fig. 9, the trend of running time is different from that of the I/Ocost, which generally has a trend of descent first and thenascent. M As shown in Fig. 8, the I/O cost decreases more and moreslowly as M increases. This is because the number ofcandidate points decreases at a slower and slower rateas the searching range decreases at a slower and slowerrate. As can be seen in Fig. 8, the I/O cost is reducedat a low speed after M reaches the optimized value.Meanwhile, from Fig. 9, when the values of M are setto our derived optimized number of partitions on thefour datasets, the CPU’s running time is minimum inall cases which brings up to gains. In addition, therandom reads of the current mainstream SSD are about5k IOPS to 500k IOPS. In this case, the efficiency gainsachieved by the number of the I/O reductions can benegligible compared to the gains in the CPU’s runningtime. Therefore, the experimental results validate that ourderived value of M is optimized. In this section, we evaluate how PCCP influences theefficiency of our method by testing it in both cases withand without PCCP and the results are shown in Fig. 10.The default value of k is 20. From the experimental results,both the I/O cost and the running time are reduced by to when PCCP is applied, which indicates thatPCCP can reduce the number of candidate points leadingto lower I/O costs and time consumption. Moreover,the experimental results demonstrate our proposed indexstructure, BB-forest, can avoid some invalid disk accessesrelying on PCCP. We also evaluate the impact of randomly
100 1000 10000 100000 Audio Fonts Deep Sift I/ O C o s t Dataset
NonePCCP (a) I/O cost
100 1000 10000 100000 1e+006 Audio Fonts Deep Sift R unn i ng T i m e ( m s ) Dataset
NonePCCP (b) Running time
Fig. 10: Impact of PCCPselecting the first dimension in PCCP on the performance.The experimental results are shown in the Section 7 of thesupplementary file.
In this part, we evaluate the I/O cost by varying k from20 to 100 and show the experimental results in Fig. 11.We extend the memory-resident BB-tree to a disk-residentindex structure following the idea of our proposed BB-forest. As k increases, BrePartition outperforms the othertwo methods in I/O cost almost in all testing cases. It ben-efits from the derived bound’s good pruning effect, whichleads to a small quantity of candidate points. Moreover,BrePartition possesses good performance on all datasets,which shows that our proposed method can be applied toboth large-scale and high-dimensional datasets.From Fig. 11, we find that the I/O cost of VA-file islower than that of BB-tree owing to two aspects. On theone hand, the lower and upper bounds computed by theirproposed search algorithm are reasonable, even thoughless tighter than ours. On the other hand, the whole spacehas been compactly divided using large number of bits viaVA-file which can reduce the size of the candidate set andfewer points will be loaded into main memory. Even so,loading more candidate points and all the disk-residentvector approximations causes more extra I/O costs. More-over, BB-tree performs the worst among all the testingmethods. This is because large overlaps among clustersin the high-dimensional space causes more candidates,which should be checked for the k NN results.
In Fig. 12, we evaluate the CPU’s running time by varying k to test the efficiency of our method. In all testing cases,BrePartition performs consistently the best regardless ofthe dimension and the size of datasets, which showsour method’s high scalability. This is mainly because wepartition the original high-dimensional space into severallow-dimensional subspaces and build our proposed inte-grated index structure, BB-forest to accelerate the searchprocessing.Similarly, compared to BrePartition, VA-file doesn’tshow satisfying results on the CPU’s running time. This isbecause it’s required to scan all the vector approximationsrepresenting the data points when pruning the uselesspoints for the candidate points. Besides, the reverse map-ping from the vector approximations to their correspond-ing original points when verifying the candidate points is time-consuming as well. Similar to the I/O cost, BB-treeshows the worst results, because too many candidates tobe checked cause performance degradation in the high-dimensional space. In this section, we evaluate how the dimensionality ofa dataset influences the efficiency of these methods. Weselect Fonts as the testing dataset and evaluate the I/Ocost and the CPU’s running time in different dimensionsof the dataset. We vary the dimensionality from 10 to400 and the experimental results are shown in Fig. 13. InBrePartition, when the dimensionality of testing changes,the number of partitions changes as well. According toTheorem 4, we set the computed optimized number ofpartitions to 3, 9, 13, 29 and 50, when the dimensionalityis 10, 50, 100, 200 and 400, respectively.The experimental results illustrate that both the I/Ocost and the running time increase monotonically asthe dimensionality of the dataset increases. BrePartitiondemonstrates the best performance as before, and theindicators show slighter increases as the dimensionalityincreases, which illustrates that it’s applicable to variousdimensional spaces. This is because BrePartition can de-rive appropriate bounds for different dimensional spaces.The growth rates of I/O cost and running time in VA-file increase as the dimensionality increases, which showsits poor scalability in high-dimensional space. BB-treeillustrates good performances in 10-dimensional space,but its I/O cost and running time increase significantly,which indicates it’s only an efficient method in the low-dimensional space.
In this part, we select the dataset Sift and set the datasize from 2000000 to 10000000 to test these algorithms’I/O cost and CPU’s running time in different scales ofdatasets. The results are illustrated in Fig. 14. Based onthe above Theorem 4, the value of data size n impacts littleon the value of partition’s number M , so we consistentlyselect 22 as the number of partitions regardless of changesin the data size.From the experimental results, the I/O cost and therunning time almost increase linearly as the size of thedataset increases. BrePartition requires the lowest I/O costand running time indicating its good scalability with theincreasing data amount. In addition, VA-file demonstratescomparable performances in I/O cost and running timeas well. Compared to BrePartition, BB-tree’s I/O cost andrunning time are multiplied since it isn’t suitable for high-dimensional spaces as explained in Section 9.4 and 9.5. We evaluate our proposed approximate solution in thissection. Our solution is compared with the state-of-the-artapproximate solution proposed in [34] denoted as ”Var” inour paper. We firstly define overall ratio denoted as OR = k P ki =1 D f ( p i ,q ) D f ( p ∗ i ,q ) , where p i is the i -th point returned and p ∗ i is the exact i -th NN point. Overall ratio describes the I/ O C o s t ( ) k BPVAFBBT (a) Audio I/ O C o s t ( ) k BPVAFBBT (b) Fonts I/ O C o s t ( ) k BPVAFBBT (c) Deep I/ O C o s t ( ) k BPVAFBBT (d) Sift
Fig. 11: I/O Cost R unn i ng T i m e ( s ) k BPVAFBBT (a) Audio
20 40 6020 40 60 80 100 R unn i ng T i m e ( s ) k BPVAFBBT (b) Fonts
20 40 6020 40 60 80 100 R unn i ng T i m e ( s ) k BPVAFBBT (c) Deep
100 200 300 40020 40 60 80 100 R unn i ng T i m e ( s ) k BPVAFBBT (d) Sift
Fig. 12: Running Time I/ O C o s t ( ) Dimensionality
BPVAFBBT (a) I/O cost R unn i ng T i m e ( s ) Dmensionality
BPVAFBBT (b) Running time
Fig. 13: Impact of Dimensionality (Fonts) I/ O C o s t ( ) Data Size (10 ) BPVAFBBT (a) I/O cost R unn i ng T i m e ( s ) Data Size (10 ) BPVAFBBT (b) Running time
Fig. 14: Impact of Data Size (Sift)accuracy of the approximate solutions. A smaller overallratio means a higher accuracy. Since ”Var” is based on thetraditional BB-tree, we also extend the memory-residentBB-tree to a disk-resident index structure following theidea of our proposed BB-forest to test the I/O cost. Inaddition, the running time is also tested for evaluating itsefficiency. The results are shown in Fig. 15.Fig. 15(a) shows the overall ratios by varying k from20 to 100. Generally, a larger k can lead to a larger overall ratio. Moreover, we evaluate the overall ratio when theprobability guarantee is set to 0.7, 0.8 and 0.9, respectively.From the experimental results, the overall ratio decreasesas p increases which indicates that a higher probabilityguarantee leads to a higher accuracy. Compared with”Var”, our solution almost performs better in all caseson Normal. Similarly, we evaluate the I/O cost and therunning time by varying k and p . The experimental resultsare shown in Figs. 15(b) and 15(c). Generally, the I/Ocost and the running time increase as k increases. Andwe observe reverse trends from the results when varying p because the searching range will be extended as p in-creases. Since ”Var” reduce the number of nodes checkedin the searching process using data’s distributions, the I/Ocost and the running time are reduced. Even though, theI/O cost and the running time of ”Var” are larger than oursolution in most cases. It indicates that our approximatesolution can yield the higher efficiency while ensuring theaccuracy. The experimental results on Uniform are similarto those on Normal, they are shown in Section 6 in thesupplementary file.
10 C
ONCLUSION
In this paper, we address the important problem ofhigh-dimensional k NN search with Bregman distancesand propose a dimensionality partitioning approachnamed BrePartition. BrePartition follows a partition-filter-refinement framework. We have proposed a number ofnovel techniques to overcome the challenges of the prob-lem. First, we derive an effective upper bound based onCauchy inequality as the pruning condition to signifi-cantly reduce the number of candidates we need to checkclosely. Second, we optimize the dimensionality partition-ing by computing the optimized number of partitions to O v e r a ll R a t i o k p=0.7p=0.8p=0.9Var (a) Overall Ratio I/ O C o s t ( ) k BPABP (p=0.9)ABP (p=0.8)ABP (p=0.7)Var (b) I/O Cost
450 500 550 600 65020 40 60 80 100 R unn i ng T i m e ( m s ) k BPABP (p=0.9)ABP (p=0.8)ABP (p=0.7)Var (c) Running Time
Fig. 15: Evaluations of Approximate Solution (Normal)reduce the running time and devising a strategy calledPCCP to further reduce the size of the candidate set.Third, we design an integrated index structure, namedBB-forest, which consists of BB-trees for all the individualsubspaces. In addition, we extend our exact solution to anapproximate version via data’s distribution. Experimentalresults demonstrate that our method can yield significantperformance improvement in CPU’s running time andI/O cost. In the future work, we will further improve theexisting brief approximate solution and propose a moreefficient solution by converting the Bregman distanceinto Euclidian distance and employing traditional metricsearching methods to solve the high-dimensional k NNsearch with Bregman distances. In addition, we will alsoimprove our designed BB-forest so that it can supportinserting or deleting large-scale data more efficiently. R EFERENCES [1] J. Goldberger, S. Gordon, and H. Greenspan, “An efficient imagesimilarity measure based on approximations of kl-divergencebetween two gaussian mixtures,” in , 2003, pp. 487–493.[2] F. Nielsen and R. Nock, “On approximating the smallest enclos-ing bregman balls,” in
Proceedings of the 22nd ACM Symposiumon Computational Geometry, Sedona, Arizona, USA, June 5-7, 2006 ,2006, pp. 485–486.[3] B. Long, Z. M. Zhang, and P. S. Yu, “Graph partitioning basedon link distributions,” in
Proceedings of the Twenty-Second AAAIConference on Artificial Intelligence, July 22-26, 2007, Vancouver,British Columbia, Canada , 2007, pp. 578–583.[4] Y. Mu and S. Yan, “Non-metric locality-sensitive hashing,” in
Proceedings of the Twenty-Fourth AAAI Conference on ArtificialIntelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010 ,2010.[5] J. Laub, J. H. Macke, K. M ¨uller, and F. A. Wichmann, “Inducingmetric violations in human similarity judgements,” in
Advancesin Neural Information Processing Systems 19, Proceedings of theTwentieth Annual Conference on Neural Information ProcessingSystems, Vancouver, British Columbia, Canada, December 4-7, 2006 ,2006, pp. 777–784.[6] Z. Zhang, B. C. Ooi, S. Parthasarathy, and A. K. H. Tung,“Similarity search on bregman divergence: Towards non-metricindexing,”
PVLDB , vol. 2, no. 1, pp. 13–24, 2009.[7] J. Puzicha, Y. Rubner, C. Tomasi, and J. M. Buhmann, “Empiricalevaluation of dissimilarity measures for color and texture,” in
Proceedings of the International Conference on Computer Vision,Kerkyra, Corfu, Greece, September 20-25, 1999 , 1999, pp. 1165–1172.[8] X. Li, T. Pi, Z. Zhang, X. Zhao, M. Wang, X. Li, and P. S. Yu,“Learning bregman distance functions for structural learning torank,”
IEEE Trans. Knowl. Data Eng. , vol. 29, no. 9, pp. 1916–1927,2017. [9] N. Rasiwasia, P. J. Moreno, and N. Vasconcelos, “Bridgingthe gap: Query by semantic example,”
IEEE Trans. Multimedia ,vol. 9, no. 5, pp. 923–938, 2007.[10] R. Gray, A. Buzo, A. Gray, and Y. Matsuyama, “Distortionmeasures for speech processing,”
IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 28, no. 4, pp. 367–376, 1980.[11] X. Hua, Y. Cheng, H. Wang, Y. Qin, and D. Chen, “Geometrictarget detection based on total bregman divergence,”
DigitalSignal Processing , vol. 75, pp. 232–241, 2018.[12] D. P. L. Ferreira, E. Ribeiro, and C. A. Z. Barcelos, “A variationalapproach to non-rigid image registration with bregman diver-gences and multiple features,”
Pattern Recognition , vol. 77, pp.237–247, 2018.[13] D. Cheng, D. Shi, F. Tian, and X. Liu, “A level set method forimage segmentation based on bregman divergence and multi-scale local binary fitting,”
Multimedia Tools Appl. , vol. 78, no. 15,pp. 20 585–20 608, 2019.[14] A. Okuno and H. Shimodaira, “Hyperlink regression via breg-man divergence,”
CoRR , vol. abs/1908.02573, 2019.[15] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metriclearning for large margin nearest neighbor classification,” in
Advances in Neural Information Processing Systems 18 [NeuralInformation Processing Systems, NIPS 2005, December 5-8, 2005,Vancouver, British Columbia, Canada] , 2005, pp. 1473–1480.[16] S. Ramaswamy and K. Rose, “Fast adaptive mahalanobisdistance-based search and retrieval in image databases,” in
Proceedings of the International Conference on Image Processing,ICIP 2008, October 12-15, 2008, San Diego, California, USA , 2008,pp. 181–184.[17] T. Dong, Y. Ishikawa, and C. Xiao, “Top- k similarity searchover gaussian distributions based on kl-divergence,” JIP , vol. 24,no. 1, pp. 152–163, 2016.[18] L. Cayton, “Fast nearest neighbor retrieval for bregman di-vergences,” in
Machine Learning, Proceedings of the Twenty-FifthInternational Conference (ICML 2008), Helsinki, Finland, June 5-9,2008 , 2008, pp. 112–119.[19] J. L. Bentley, “Multidimensional binary search trees used forassociative searching,”
Commun. ACM , vol. 18, no. 9, pp. 509–517, 1975.[20] A. Guttman, “R-trees: A dynamic index structure for spa-tial searching,” in
SIGMOD’84, Proceedings of Annual Meeting,Boston, Massachusetts, USA, June 18-21, 1984 , 1984, pp. 47–57.[21] H. V. Jagadish, B. C. Ooi, K. Tan, C. Yu, and R. Zhang, “idis-tance: An adaptive b+-tree based indexing method for nearestneighbor search,”
ACM Trans. Database Syst. , vol. 30, no. 2, pp.364–397, 2005.[22] A. Arora, S. Sinha, P. Kumar, and A. Bhattacharya, “Hd-index:Pushing the scalability-accuracy boundary for approximate knnsearch in high-dimensional spaces,”
PVLDB , vol. 11, no. 8, pp.906–919, 2018.[23] R. Zhang, B. C. Ooi, and K. Tan, “Making the pyramid techniquerobust to query types and workloads,” in
Proceedings of the 20thInternational Conference on Data Engineering, ICDE 2004, 30 March- 2 April 2004, Boston, MA, USA , 2004, pp. 313–324.[24] J. Heo, Z. L. Lin, and S. Yoon, “Distance encoded productquantization for approximate k-nearest neighbor search in high-dimensional space,”
IEEE Trans. Pattern Anal. Mach. Intell. ,vol. 41, no. 9, pp. 2084–2097, 2019. [25] Y. Liu, H. Cheng, and J. Cui, “PQBF: i/o-efficient approximatenearest neighbor search by product quantization,” in Proceedingsof the 2017 ACM on Conference on Information and KnowledgeManagement, CIKM 2017, Singapore, November 06 - 10, 2017 , 2017,pp. 667–676.[26] Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “SK-LSH:an efficient index structure for approximate nearest neighborsearch,”
PVLDB , vol. 7, no. 9, pp. 745–756, 2014.[27] J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi, “DSH: data sensi-tive hashing for high-dimensional k-nnsearch,” in
InternationalConference on Management of Data, SIGMOD 2014, Snowbird, UT,USA, June 22-27, 2014 , 2014, pp. 1127–1138.[28] L. Chen, Y. Gao, B. Zheng, C. S. Jensen, H. Yang, and K. Yang,“Pivot-based metric indexing,”
PVLDB , vol. 10, no. 10, pp. 1058–1069, 2017.[29] W. Liu, H. Wang, Y. Zhang, W. Wang, and L. Qin, “I-LSH:I/O efficient c-approximate nearest neighbor search in high-dimensional space,” in , 2019, pp.1670–1673.[30] Y. Hwang, B. Han, and H. Ahn, “A fast nearest neighbor searchalgorithm by nonlinear embedding,” in , 2012, pp. 3053–3060.[31] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin, “SRS: solvingc-approximate nearest neighbor queries in high dimensionaleuclidean space with a tiny index,”
PVLDB , vol. 8, no. 1, pp.1–12, 2014.[32] Y. Gu, Y. Guo, Y. Song, X. Zhou, and G. Yu, “Approximate order-sensitive k-nn queries over correlated high-dimensional data,”
IEEE Trans. Knowl. Data Eng. , vol. 30, no. 11, pp. 2037–2050, 2018.[33] F. Nielsen, J. Boissonnat, and R. Nock, “On bregman voronoidiagrams,” in
Proceedings of the Eighteenth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2007, New Orleans,Louisiana, USA, January 7-9, 2007 , 2007, pp. 746–755.[34] E. Coviello, A. Mumtaz, A. B. Chan, and G. R. G. Lanckriet,“That was fast! speeding up NN search of high dimensionaldistributions,” in
Proceedings of the 30th International Conferenceon Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June2013 , 2013, pp. 468–476.[35] L. Cayton, “Efficient bregman range search,” in
Advances inNeural Information Processing Systems 22: 23rd Annual Conferenceon Neural Information Processing Systems 2009. Proceedings of ameeting held 7-10 December 2009, Vancouver, British Columbia,Canada. , 2009, pp. 243–251.[36] M. R. Ackermann and J. Bl¨omer, “Coresets and approximateclustering for bregman divergences,” in
Proceedings of the Twenti-eth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA2009, New York, NY, USA, January 4-6, 2009 , 2009, pp. 1088–1097.[37] A. Abdullah, J. Moeller, and S. Venkatasubramanian, “Approx-imate bregman near neighbors in sublinear time: beyond thetriangle inequality,”
Int. J. Comput. Geometry Appl. , vol. 23, no.4-5, pp. 253–302, 2013.[38] D. P. L. Ferreira, B. M. Rocha, and C. A. Z. Barcelos, “Nearestneighbor search on total bregman balls tree,” in
Proceedings of theSymposium on Applied Computing, SAC 2017, Marrakech, Morocco,April 3-7, 2017 , 2017, pp. 120–124.[39] C. Faloutsos and K. Lin, “Fastmap: A fast algorithm for index-ing, data-mining and visualization of traditional and multime-dia datasets,” in
Proceedings of the 1995 ACM SIGMOD Interna-tional Conference on Management of Data, San Jose, California, USA,May 22-25, 1995. , 1995, pp. 163–174.[40] X. Wang, J. T. Wang, K. Lin, D. E. Shasha, B. A. Shapiro, andK. Zhang, “An index structure for data mining and clustering,”
Knowl. Inf. Syst. , vol. 2, no. 2, pp. 161–184, 2000.[41] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios, “Boostmap: Amethod for efficient approximate similarity rankings,” in , 2004, pp. 268–275.[42] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff,“Query-sensitive embeddings,” in
Proceedings of the ACM SIG-MOD International Conference on Management of Data, Baltimore,Maryland, USA, June 14-16, 2005 , 2005, pp. 706–717. [43] T. Skopal, “Unified framework for fast exact and approximatesearch in dissimilarity spaces,”
ACM Trans. Database Syst. ,vol. 32, no. 4, p. 29, 2007.[44] T. Skopal and J. Lokoc, “Nm-tree: Flexible approximate simi-larity search in metric and non-metric spaces,” in
Database andExpert Systems Applications, 19th International Conference, DEXA2008, Turin, Italy, September 1-5, 2008. Proceedings , 2008, pp. 312–325.[45] D. Liu and K. A. Hua, “Transfer non-metric measures intometric for similarity search,” in
Proceedings of the 17th Interna-tional Conference on Multimedia 2009, Vancouver, British Columbia,Canada, October 19-24, 2009 , 2009, pp. 693–696.[46] L. Chen and X. Lian, “Efficient similarity search in nonmetricspaces with local constant embedding,”
IEEE Trans. Knowl. DataEng. , vol. 20, no. 3, pp. 321–336, 2008.[47] P. Jangyodsuk, P. Papapetrou, and V. Athitsos, “Optimizinghashing functions for similarity indexing in arbitrary metric andnonmetric spaces,” in
Proceedings of the 2015 SIAM InternationalConference on Data Mining, Vancouver, BC, Canada, April 30 - May2, 2015 , 2015, pp. 828–836.[48] K. Goh, B. Li, and E. Y. Chang, “Dyndex: a dynamic andnon-metric space indexer,” in
Proceedings of the 10th ACM In-ternational Conference on Multimedia 2002, Juan les Pins, France,December 1-6, 2002. , 2002, pp. 466–475.[49] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approxi-mate nearest neighbor search using hierarchical navigable smallworld graphs,”
CoRR , vol. abs/1603.09320, 2016.
Yang Song received the BE degree in au-tomation from Northeastern University, China,in 2016. He is currently working toward thePh.D degree in computer application technol-ogy at Northeastern University, China. His re-search interests include query processing andquery optimization.
Yu Gu received his Ph.D. degree in com-puter software and theory from NortheasternUniversity, China, in 2010. Currently, he is aprofessor at Northeastern University, China.His current research interests include big dataprocessing, spatial data management andgraph data management. He is a senior mem-ber of China Computer Federation (CCF).
Rui Zhang is a Professor at the School ofComputing and Information Systems of theUniversity of Melbourne. His research inter-ests include big data and machine learn-ing, particularly in spatial and temporal dataanalytics, database indexing, chatbots andrecommender systems. Professor Zhang haswon several awards including the Future Fel-lowship by the Australian Research Councilin 2012, Chris Wallace Award for OutstandingResearch by the Computing Research andEducation Association of Australasia (CORE) in 2015, and GoogleFaculty Research Award in 2017.