A Model of Optimal Network Structure for Decentralized Nearest Neighbor Search
AA Model of Optimal Network Structure for Decentralized Nearest Neighbor Search
Alexander Ponomarenko, Irina Utkina, Mikhail Batsyn
National Research University Higher School of Economics [email protected], [email protected], [email protected]
Abstract
One of the approaches for the nearest neighbor search problem is to build a network which nodes correspond to the given set of indexed objects. In this case the search of the closest object can be thought as a search of a node in a network. A procedure in a network is called decentralized if it uses only local information about visited nodes and its neighbors. Networks, which structure allows efficient performing the nearest neighbor search by a decentralized search procedure started from any node, are of particular interest especially for pure distributed systems. Several algorithms that construct such networks have been proposed in literature. However, the following questions arise: “Are there network models in which decentralized search can be performed faster?”; “What are the optimal networks for the decentralized search?”; “What are their properties?”. In this paper we partially give answers to these questions. We propose a mathematical programming model for the problem of determining an optimal network structure for decentralized nearest neighbor search. We have found an exact solution for a regular lattice of size 4x4 and heuristic solutions for sizes from 5x5 to 7x7. As a distance function we use L , L and ∞ L met-rics. We hope that our results and the proposed model will initiate study of op-timal network structures for decentralized nearest neighbor search. Introduction
The nearest neighbor search appears in many fields of computer science. A problem of building data structure for the nearest neighbor search is formulated as follows. Let D be a domain and [0; ) : d D D R + ∞ × → be a distance function. One needs to preprocess a finite set X D ⊆ so that the search of the closest object for any given query q D ∈ in the set X will be as fast as possible. A huge number of methods have been proposed. Of particular interest is the case when the search of nearest neighbor should run in a distributed environment without any central coordination point. For this case a natural approach for organizing nearest neighbor search is to build a net-work, which nodes correspond to the given set X . In this case the search of the clos-est object can be thought as a search of a node in a network. Moreover a distributed nvironment, especially for p2p case, requires that all procedures that are involved in the search or indexing processes should be decentralized. This means that all proce-dures have only local information about visited nodes and its neighbors and don’t have access to the information about the whole structure of the network. As a rule such an approach implies searching via greedy walk algorithm [1,3,7,8] or its modification [6,9]. So, many p2p systems including DHT protocols [2,4,5] use the same search algorithm, but employ different distance functions and have different network structures. In the present paper we address the problem of optimal network structure for NNS. We emphasize that for any fixed input set there exists an optimal network struc-ture with respect to the chosen search algorithm. To study the properties of such net-works, we present a mathematical Boolean non-linear programming model of optimal network structure. The objective is to minimize the expected number distance compu-tations made by the greedy walk algorithm to find the nearest neighbor for an arbi-trary query starting from an arbitrary node. As a first step we solve this problem for the case when the input set X corre-sponds to the set of nodes of a two-dimensional regular lattice. We have found an exact solution for size 4x4 and heuristic solutions for sizes from 5x5 to 7x7. As a distance function we use , and metrics. Mathematical formulation
We consider a network as a graph ),(
EVG with vertex set {1,..., }
V X n = = and edge set
VVE ×⊂ . Let ( , ) d i q be a distance function between vertex i and query q . The neighborhood of vertex i is defined as }),(:{)( EjiVjiN ∈∈ = . We denote the probability function for a query as q f for a discrete domain and as ( ) f q - the probability density function for a query in continuous domain. Decentralized Search Algorithm - Greedy Walk
The goal of the search algorithm is to find the vertex (target vertex) in the graph G which is the closest to the query, going from one vertex to another through the set of edges E of G . The search is based on the information related to the verti-ces. During the search process the algorithm can calculate the distance between the query and the vertices which it knows. Below is the pseudo code of the greedy walk algorithm. GreedyWalk ( s V ∈ , q V ∈ )// s -starting vertex, q - query ( ) argmin( ( , )) y N s c d y q ∈ ← if ),(),( qsdqcd < then L L ∞ L return GreedyWalk ( c , q ) 4 else return s Starting from vertex s the algorithm calculates the value of the distance function ),( qyd between query q and every neighbor y of s . After that the algo-rithm is recursively called for vertex c closest to the q . The algorithm stops at the vertex which neighborhood contains no vertices closer to the query than itself. The greedy walk algorithm can be also considered as a process of routing a search mes-sage in a network. At each step the node (vertex) which has received a message (mes-sage holder) passes it to the neighbor closest to the query according to the function d . Mathematical programming model
By no means all graphs have proper structure for searching via greedy walk. In our model we require from the structure of graph G that search of any vertex by the greedy walk will reach the target vertex starting from an arbitrary vertex. In general this requires that the graph need to have the Delone graph as a subgraph. Similar to the Kleinberg model [1] in this paper we consider a particular case when vertices are nodes of a regular lattice with integer coordinates. In this case the Delone graph is just the set of the edges of the regular lattice. The complexity of the search algorithm is measured as the number of different vertices for which the distance to the query has been calculated. We take this number as an objective function. Equations (1-9) define Boolean non-linear programming formulation for optimal graph structure. Decision variables
1, if edge ( , ) belongs to the solution0, otherwise ij i jx ⎧ = ⎨⎩ (1)
1, if vertex belongs to the greedy walk from to 0, otherwise kij k i jy ⎧ = ⎨⎩ (2) Objective function n q qi q D O i j fn = ∈ ∑ ∑ (discrete domain) (3a) n qi D O i j f q dqn = ∑ ∫ , (continuous domain), (3b) here arg min ( , ) q j n j d j q = = (4) O ( i , j q ) = l ∈ V : ∃ k x lk = y ij q k = { } (5) Constraints ii x i V = ∀ ∈ (6) y iji = y ij q j = ∀ i , j q ∈ V (7) x lk y ij q k ≥ y ij q lk = n ∑ ∀ i , j q , l ∈ V (8) l * = arg min l ∈ V : x kl = ( d ( l , q )) ⇒ y ij q l * ≥ y ij q k ∀ q ∈ D ∀ i , k ∈ V (9) Decision variables ij x (1) determine the adjacency matrix of the optimal graph, which we want to find. Indicator variables kij y (2) are used to calculate the number of the operations ( , ) q O i j performed during the search process from vertex i to vertex q j , which is the closest vertex (target vertex) to the query q (4). In our case it is the number of different vertices for which the distance to the query has been cal-culated. This is equal to the cardinality of the union set of neighborhoods of vertices k for which kij y = (5). Since we want to find the optimal graph in general case (for any starting vertex and any query) our objective is to minimize the average number of operations re-quired for the search algorithm to reach a target vertex (3a, 3b, 4, 5). Constraint (6) guarantees that there are no loops in the graph and constraint (7) requires GreedyWalk( i , j ) to start from vertex i and stop at vertex j . Constraint (8) links varia-bles ij x and kij y and requires that the search algorithm (the greedy walk) will go through one of vertex l neighbors if it goes through this vertex l . Constraint (9) de-scribes the greedy strategy of the greedy walk algorithm: if vertex k belongs to the greedy walk from vertex i to vertex q j ( q kij y = ) then its neighbor * l , closest to the query q among all its neighbors l , should also belong to this greedy walk ( * q lij y = ). he presented model is applicable for an arbitrary metric space. In the next section we present the results for a particular case when vertices are the nodes of a two-dimensional regular lattice and the distance functions are L , L , or L ∞ . Computational Experiments and Results
In this work we suppose that the input set corresponds to the nodes of a two-dimensional regular lattice and we have a domain such that all nodes have the same probability to be the nearest neighbor for a query. In this case the nearest neighbor search can be thought as a node discovery procedure, which means that we need to find the given node in the network. Obviously, we can find the optimal graph structure if we check all possible configurations of the set of edges. However the number of all possible configurations grows as − nn . To find an exact solution we have implemented a branch and bound algorithm. The exact solutions found by algorithm for regular lattice 4x4 are presented at Fig. 1. The solutions found by our heuristic are presented at Fig. 2-4. (a) L , f ≈ (b) L , f ≈ (c) ∞ L , f ≈ Fig. 1. Exact solutions found by our branch and bound algorithm for regular lattice 4x4 (a) L , f ≈ (b) L , f = (c) ∞ L , f ≈ Fig. 2. Solutions found by our heuristic for regular lattice 5x5 (a) L , f ≈ (b) L , f ≈ (c) ∞ L , f ≈ Fig 3. Solutions founded by heuristic for a regular lattice 6x6 (a) L , f ≈ (b) L , f ≈ (c) ∞ L , f ≈ Fig. 4. Solutions found by our heuristic for regular lattice 7x7 Conclusion and Future Work
We have proposed a Boolean non-linear programming model to determine an optimal graph structure, which minimizes the complexity of the nearest neighbor earch by the greedy walk algorithm. We have found an exact solution for a regular lattice of size 4x4 and presented the results found by our heuristic for sizes from 5x5 to 7x7 with the three most popular distances: L , L and ∞ L . However, we realize that the most important characteristic which should be studied is the asymptotical behavior of the objective function. Therefore our future work will be focused on improving the efficiency of our exact and heuristic algo-rithms. We also have plans to develop models describing optimal network structures for approximate nearest neighbor search. We hope that this work will draw attention to the study of graph structures optimal for decentralized nearest neighbor search. Acknowledgments
This research is conducted in LATNA Laboratory, National Research University Higher School of Economics and supported by RSF grant 14-41-00039. References Kleinberg, J. (2000, May). The small-world phenomenon: An algorithmic perspective. In
Proceedings of the thirty-second annual ACM symposium on Theory of computing (pp. 163-170). ACM. 2.
Maymounkov, Petar, and David Mazieres. "Kademlia: A peer-to-peer information system based on the xor metric." In
Peer-to-Peer Systems , pp. 53-65. Springer Berlin Heidelberg, 2002. 3.
Beaumont, Olivier, Anne-Marie Kermarrec, Loris Marchal, and Etienne Riviere. "Voro-Net: A scalable object network based on Voronoi tessellations." In
Parallel and Distribut-ed Processing Symposium, 2007. IPDPS 2007. IEEE International , pp. 1-10. IEEE, 2007. 4.
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications.
ACM SIGCOMM Computer Communication Review , (4), 149-160. 5. Rowstron, A., & Druschel, P. (2001, November). Pastry: Scalable, decentralized object lo-cation, and routing for large-scale peer-to-peer systems. In
Middleware 2001 (pp. 329-350). Springer Berlin Heidelberg. 6.
Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2014). Approximate nearest neighbor algorithm based on navigable small world graphs.
Information Systems , , 61-68. 7. Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2012). Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In
Similarity Search and Applications (pp. 132-147). Springer Berlin Hei-delberg. 8.
Beaumont, O., Kermarrec, A. M., & Rivière, É. (2007). Peer to peer multidimensional overlays: Approximating complex structures. In
Principles of Distributed Systems (pp. 315-328). Springer Berlin Heidelberg. .
Ruiz, G., Chávez, E., Graff, M., & Téllez, E. S. (2015). Finding Near Neighbors Through Local Search. In
Similarity Search and Applications (pp. 103-109). Springer International Publishing.(pp. 103-109). Springer International Publishing.