[PDF] Toward Metric Indexes for Incremental Insertion and Querying

Abstract

In this work we explore the use of metric index structures, which accelerate nearest neighbor queries, in the scenario where we need to interleave insertions and queries during deployment. This use-case is inspired by a real-life need in malware analysis triage, and is surprisingly understudied. Existing literature tends to either focus on only final query efficiency, often does not support incremental insertion, or does not support arbitrary distance metrics. We modify and improve three algorithms to support our scenario of incremental insertion and querying with arbitrary metrics, and evaluate them on multiple datasets and distance metrics while varying the value of k for the desired number of nearest neighbors. In doing so we determine that our improved Vantage-Point tree of Minimum-Variance performs best for this scenario.

Full PDF

PPre-Print (2017) Submitted Submitted; Published Published

Toward Metric Indexes for Incremental Insertion andQuerying

Edward Raﬀ [email protected]

Laboratory for Physical SciencesBooz Allen HamiltonUniversity of Maryland, Baltimore County

Charles Nicholas [email protected]

University of Maryland, Baltimore County

Abstract

In this work we explore the use of metric index structures, which accelerate nearest neighborqueries, in the scenario where we need to interleave insertions and queries during deployment.This use-case is inspired by a real-life need in malware analysis triage, and is surprisinglyunderstudied. Existing literature tends to either focus on only ﬁnal query eﬃciency, oftendoes not support incremental insertion, or does not support arbitrary distance metrics.We modify and improve three algorithms to support our scenario of incremental insertionand querying with arbitrary metrics, and evaluate them on multiple datasets and distancemetrics while varying the value of k for the desired number of nearest neighbors. In doingso we determine that our improved Vantage-Point tree of Minimum-Variance performs bestfor this scenario. Keywords: nearest neighbor, incremental, search, metric index, metric space.

1. Introduction

Many applications are built on top of distance metrics and nearest neighbor queries, andhave achieved better performance through the use of metric indexes. A metric index is adata structure used to answer neighbor queries that accelerates these queries by avoidingunnecessary distance computations. The indexes we will look at in this work require the useof a valid distance metric (i.e., obeys triangle inequality, symmetry, and indiscernibility) andreturns exact results.Such indexes can be used to accelerate basic classiﬁcation and similarity search, as wellas many popular clustering algorithms like k-Means (Lloyd, 1982; Kanungo et al., 2002),density based clustering algorithms like DBSCAN (Biçici and Yuret, 2007; Campello et al.,2013), and visualization algorithms like t-SNE (van der Maaten, 2014; Maaten and Hinton,2008; Tang et al., 2016; Narayan et al., 2015). However, most works assume that the data tobe indexed is static, and that there will be no need to update the index over time. Evenwhen algorithms are developed with incremental updates, the evaluation of such methodsis not done in such a context. In this work we seek to evaluate metric indexes for the caseof incremental insertion and querying. Because these methods are not readily available, wemodify three existing indexes to support incremental insertion and querying. a r X i v : . [ c s . D S ] J a n aff and Nicholas Our interest in this area is particularly motivated by an application in malware analysis,where we maintain a database of known malware of interest. Malware may be inserted intothe database with information about malware type, method of execution, suspected origin,or suspected author. When an analyst is given new malware to dissect, the process can bemade more eﬃcient if a similar malware sample has already been processed, and so we wantto eﬃciently query the database to retrieve potentially related binaries. This triaging task isa common problem in malware analysis, often related to malware family detection (Hu et al.,2013; Gove et al., 2014; Walenstein et al., 2007; Jang et al., 2011). Once done, the analystmay decide the binary should be added to the database. In this situation our index wouldbe built once, and have insertions into the database regularly intermixed with queries. Thisread/write ratio may depend on workload, but is unfortunately not supported by currentindex structures that support arbitrary distance metrics. This scenario inspires our workto build and develop such indexes, which we test on a wider array of problems than justmalware. We do this in part because the feature representations that are informative formalware analysis may change, along with the distance metrics used, and so a system thatworks with a wide variety of distance measures is appropriate.To emphasize the importance of such malware triage, we note it is critical from a timesaving perspective. Such analysis requires extensive expertise, and it take an expert analysisupward of 10 hours to dissect a single binary (Mohaisen and Alrawi, 2013). Being able toidentify a related binary that has been previously analyzed may yield signiﬁcant time savings.The scale of this problem is also signiﬁcant. A recent study of 100 million computers foundthat 94% of ﬁles were unique (Li et al., 2017), meaning exact hashing approaches such asMD5 sums will not help, and similarity measures between ﬁles are necessary. In terms ofincremental addition of ﬁles, in 2014 most anti-virus vendors were adding 2 to 3 million newbinaries each month (Spaﬀord, 2014).Given our motivation, we will review the related work to our own in section 2. We willreview and modify three algorithms for incremental insertion and querying in section 3,followed by the evaluation details, datasets and distance metrics in section 4. Evaluations ofour modiﬁcations and their impact will be done in section 5, followed by an evaluation ofthe incremental insertion and querying scenario in section 6. Finally, we will present ourconclusions in section 7.

2. Related Work

There has been considerable work in general for retrieval methods based on k nearest neighborqueries, and many of the earlier works in this area did support incremental insertion andquerying, but did not support arbitrary distance metrics. One of the earliest methods wasthe Quad-Tree(Finkel and Bentley, 1974), which was limited to two-dimensional data. Thiswas quickly extended with the kd-tree, which also supported insertions, but additionallysupported arbitrary dimensions and deletions as well(Bentley, 1975). However, the kd-treedid not support arbitrary metrics, and was limited to the euclidean and similar distances.Similar work was done for the creation of R-trees, which supported the insertion and queryingof shapes, and updating the index should an entry’s shape change(Guttman, 1984). Howeverimproving the query performance of R-trees involved inserting points in a speciﬁc order, oward Metric Indexes for Incremental Insertion and Querying which requires having the whole dataset available from the onset(Kamel and Faloutsos, 1994),and still did not support arbitrary metrics.The popular ball-tree algorithm was one of the ﬁrst eﬀorts to devise and evaluate multipleconstruction schemes, some which required all the data to be available at the onset, whileothers which could be done incrementally as data became available (Omohundro, 1989).This is similar to our work in that we devise new incremental insertion strategies for twoalgorithms, though Omohundro (1989) do not evaluate incremental insertions and querying.This ball-tree approach was limited to the euclidean distance primarily from the use of amean data-point computed at every node. Other early work that used the triangle inequalityto avoid distance computations had this same limitation (Fukunage and Narendra, 1975).While almost all of these early works in metric indexes supported incremental insertion,none contain evaluation of the indexes under the assumption of interleaved insertions andqueries. These works also do not support arbitrary distance metrics.The ﬁrst algorithm for arbitrary metrics was the metric-tree structure (Uhlmann, 1991a,b),which used the distance to a randomly selected point to create a binary tree. This wasindependently developed, slightly extended, and more throughly evaluated to become theVantage-Point tree we explore in this work(Yianilos, 1993). However, these methods did notsupport incremental insertion. We will modify and further improve the Vantage-Point treein section 3.Toward the creation of provable bounds for arbitrary distance metrics, the concept ofthe expansion constant c was made by Karger and Ruhl (2002). The expansion constant isa property of the current dataset under a given metric, and describes a linear relationshipbetween the radius around a point, and the number of points contained within that radius.That is to say, if the radius from any arbitrary point doubles, the number of points containedwithin that radius should increase by at most a constant factor. Two of the algorithms welook at in this work, as discussed in section 3, make use of this property.The ﬁrst practical algorithm to make use of the expansion constant was the Cover-tree(Beygelzimer et al., 2006), which showed practical speed-ups across multiple datasets andvalues of k ∈ [1 , . Their results were generally shown under L p norm distances, butalso included an experiment using the string edit distance. Later work then simpliﬁed theCover-tree algorithm and improved performance, demonstrating its beneﬁt on a wider varietyof dataset and distance metrics (Izbicki and Shelton, 2015). Of the algorithms for metricindexes, the Cover-tree is the only one we are aware of with an incremental constructionapproach, and so we consider it one of our metrics of interest in section 3. While theCover-tree construction algorithm is described as an incremental insertion process, the moreeﬃcient variant proposed by Izbicki and Shelton (2015) includes a bound which requiresthe whole dataset in advance to calculate bounds, preventing the eﬃcient interleaving ofinsertions and queries .Another algorithm we consider is the Random Ball Cover (RBC), which was designed formaking eﬀective use of GPUs with the euclidean distance (Cayton, 2012). Despite testingon only the euclidean distance, the algorithm and proof does not rely on this assumption –and will work with any arbitrary distance metric. We consider the RBC in this work due toits random construction, which allows us to devise an incremental construction procedure

1. The original Cover-tree did not have this issue, and so would meet our requirements for incrementalinsertion. We consider the newer variant since it is the most eﬃcient. aff and Nicholas that closely matches the original design and maintains the same performance characteristics.While the Random Ball Cover has inspired a number of GPU based follow ups (Li andAmenta, 2015; Kim et al., 2013; Gieseke et al., 2014), we do not assume that a GPU will beused in our work.Li and Malik (2016) develop an indexing scheme that supports incremental updates,but only works for the euclidean distance. They also do not evaluate the performance asinsertions and queries are interleaved.

3. Metric Indexes Used

Given the existing literature of metric indexes there appear to be no readily available methodsthat suit our needs. For this reason we take three algorithms and modify them for incrementalindex construction and querying. In particular, we adapt the Random Ball Cover, VantagePoint tree, and Cover-tree algorithms for incremental insertion. As classically presented, theﬁrst two methods methods are not designed for this use case. While the original cover treealgorithm did support incremental insertions, its improved variants do not. More importantly,as we will show in section 5, the Cover-tree has worse than brute-force performance withone of our distance metrics. With our modiﬁcations we satisfy three goals that have not yetbeen achieved in a single data structure:1. New datapoints can be added to the index at any point2. We can eﬃciently query the index after every insertion3. The index can be eﬃciently used with any distance metric (a)

Cover-trees produce a heiarchyof circles, but each node mayhave a variable number of children.Each node has a radius that up-per bounds the distance to all of itschildren, and may partially overlap. (b)

Vantage-Point trees divide thespace using a hierarchy of circles.The in/outside of each space acts asa hard boundary when subdividing. (c)

RBC selects a subset of rep-resentatives, and each point is as-signed to its nearest representative(relationships marked with dashedblue line).

Figure 1.

Example partitionings for all three algorithms. Red circles indicate the radius from whichone node covers out in the space. 4 oward Metric Indexes for Incremental Insertion and Querying

While the latter point would seem satisﬁed by the original Cover-tree algorithm, ourresults indicate a degenerate case where the Cover-tree performs signiﬁcantly worse than abrute force search. For this reason we consider it to have not satisﬁed our goals.We also contribute improvements to both the Random Ball Cover and Vantage PointTree structures that further reduce the number distance computations needed by improvingthe rate at which points are pruned out. These improvements can dramatically increase theireﬀective pruning rate, which leads us to alter our conclusions about which method should beused in the general case.In the below descriptions, we will use S to refer to the set of points currently in theindex, and n = | S | as the number of such points. A full review of all details related to thethree methods is beyond this scope of this work, but we will provide the details necessary tounderstand what our contributions are to each approach. The Cover-tree (Beygelzimer et al., 2006) is a popular method for accelerating nearestneighbor queries, and one of the ﬁrst practical metric indexes to have a provable boundusing the expansion constant c (Karger and Ruhl, 2002). The Cover-tree can be constructedin O ( c n log n ) time, and answer queries in O ( c log n ) time. Izbicki and Shelton (2015)developed the Simpliﬁed Cover Tree, which reduces the practical implementation details andincreases eﬃciency in both runtime and avoiding distance computations. To reproduce theSimpliﬁed Cover Tree algorithm without any nearest-neighbor errors, we had to make twoslight modiﬁcations to the algorithm as originally presented. These adjustments are detailedin section A.The Cover-tree algorithm, as its name suggests, stores the data as a tree structure whereeach node represents only one data point and may have any number of children nodes . Thetree is constructed via incremental insertions, which means we require no modiﬁcations tothe construction algorithm to support our use case. However, at query time it is necessaryfor each node p in the tree to compute a maxdist , which is the maximum distance from thepoint represented by node p to any of its descendant nodes. This maxdist value is used atevery level of the tree to prune children nodes from the search path. Insertions can causere-organizations of the tree, resulting in the need to re-compute maxdist bounds. For thisreason the Simpliﬁed Cover-tree can not be used to eﬃciently query the index betweenconsecutive insertions.Because of the re-balancing and re-organization that occurs during tree construction, it isnot trivial to selectively update the maxdist value based on the changes that have occurred.Instead we will use an upper bound on the value of maxdist. Each node in the tree maintainsa maximum child radius of the form l , where l is an integer. This also upper bounds themaxdist value of any node by l +1 (Izbicki and Shelton, 2015). This will allow us to answerqueries without having to update maxdist, but results in a loosening of the bound. Theperformance of this upper bounded version of the Cover-tree we will refer to as Cover B , andis more naturally suited to the use case of interleaved insertions and queries.

2. Izbicki and Shelton also introduced a Nearest Ancestor Cover Tree, but we were unable to replicate theseresults. The reported performance diﬀerence between these two variants was not generally large, and sowe use only the simpliﬁed variant.3. The maximum number of children is actually bounded by the expansion constant c . aff and Nicholas We note as well that this relaxation on the maxdist based bound represents a compromisebetween the simpliﬁed approach proposed by Izbicki and Shelton and the original formulationby Beygelzimer et al.. In the later case, the l +1 bound is used to prune branches, but allbranches are traversed simultaneously. In the former, the maxdist bound is used to descendthe tree one branch at a time, and the nearest neighbor found so far is used to prune outnew branches. By replacing maxdist with l +1 , we fall somewhere in-between the approaches.Using a looser bound to prune, but still avoiding traversing all branches. In our extensivetests of these algorithms, we discovered two issues with the original speciﬁcation of thesimpliﬁed Cover-tree. These are detailed in section A, along with our modiﬁcations thatrestore the Cover-tree’s intended behavior. The Vantage Point tree (Yianilos, 1993; Uhlmann, 1991a) (VP-tree) is one of the ﬁrst datastructures proposed for accelerating neighbor searches using an arbitrary distance metric.The construction of the VP-tree results in a binary tree, where each node p represents onepoint from the dataset, the "vantage point". The vantage point splits its descendant into alow and high range based on their distance from the aforementioned vantage point, with halfof the child vectors in each range. For each range, we also have a nearest and farthest value,and an example of how these are used is given in Figure 2. Figure 2.

Example of a node in a vp-tree, with the vantage point in the center. The low-near boundis in red, the distance to the point closest to the center. The low-far (blue) and high-near (green)braket the boundry of the median. No points can fall between these bounds. The farthest away pointprovides the high-far bound in orange.

This tree structure is built top-down, and iteratively splits the remaining points intotwo groups at each node in the tree. Rather than continue splitting until each node hasno children, there is instead a minimum split size b . This is because there are likely toofew points for which we can obtain good low/high bounds. Instead, once the number ofdatapoints is ≤ b , we create a "bucket" leaf node that stores the points together and usesthe distance from each point to its parent node to do additional pruning.At construction time, since each split is done by breaking the tree in half, the maximumdepth of the tree is O (log n ) and construction takes O ( n log n ) time. Assuming the boundsare successful in pruning most branches, the VP-tree then answers queries in O (log n ) time. oward Metric Indexes for Incremental Insertion and Querying The bucketing behavior can provide practical runtime performance improvements aswell. Some of this comes from better caching behavior, as bucket values will be accessed ina sequential pattern, and avoids search branches that can be more diﬃcult to accuratelypredict for hardware with speculative execution. This can be done for the VP-tree becauseits structure is static as it is created, where the Cover-tree cannot create bucket nodes dueto the re-balancing done during construction.

While the Cover-tree required minimal changes since its construction is already incremental,we must deﬁne a new method to support such a style for the VP-tree. To support incrementalinsertions into a VP-tree, we must ﬁrst ﬁnd a location with which to store the new datapoint x . This can be done quite easily by descending the tree via the low/high bounds stored foreach point, and updating the bounds as we make the traversal. One we reach a leaf node, x is simply inserted into the bucket list. However, we do not expand the leaf node when itssize exceeds b .Ideally, these bounds will be changed infrequently as we insert new points. Getting abetter estimate of the initial bound values should minimize this occurrence. For this reasonwe expand a bucket b once it reaches a size of b . This gives us a larger sample size withwhich to estimate the four bound values. We use the value b as a simple heuristic thatfollows our intuition that a larger sample is needed for better estimates, allows us to maintainthe fast construction time of the VP algorithm, and results in an easy to implement andreplicate procedure. Algorithm 1

Insert into VP-tree

Require: vp-tree root node p , and new datapoint x to insert into tree. while p is not a leaf node do dist ← d ( x, p.vp ) if dist < ( p. low far + p. high near ) / then p. low far ← max ( dist, p. low far ) p. low near ← min ( dist, p. low near ) p ← p. lowChild else p. high far ← max ( dist, p. high far ) p. high near ← min ( dist, p. high near ) p ← p. highChild Add x to bucket leaf node p if | p. bucket | > b then Select vantage point from p. bucket and create a new split, adding two children nodes to p . return Thus our insertion procedure is given in Algorithm 1, and is relatively simple. Assumingthe tree remains relatively balanced, we will have an insertion time of O (log n ) . This willalso maintain the query time of O (log n ) . aff and Nicholas We also introduce a new modiﬁcation to the VP-tree construction procedure that reducessearch time by enhancing the ability of the standard VP-tree search procedure to prune outbranches of the tree. This is done by using an extension of the insight from subsubsection 3.2.1,that we want to make our splits only when we have enough information to do so. That is,once we have enough data to make a split, choosing the median distance from the vantagepoint may not be the smartest split.

Original splitBetter split vp Figure 3.

Example on how the split can be improved, with vantage point in black and other pointssorted by distance to it. Colors correspond to Figure 2.

Instead, we can use the distribution of points from the vantage point to choose a splitthat better bifurcates the data based on the distribution. An example of this is given inFigure 3, where the data may naturally form a binary split. This increases the gap betweenthe low far and high near bounds, which then allows the search procedure to more easily pruneone of the branches.To do this quickly, so to minimize any increase in construction time, we borrow from theCART algorithm used to construct a regression tree(Breiman et al., 1984). Given a set of n distances to the vantage-point, we ﬁnd the split that minimizes the weighted variance ofeach split arg min s s · σ s + ( n − s ) · σ s : n (1)Where σ s : n indicates the variance of the points in the range of [ s, n ) when sorted bydistance to the vantage point. Because (1) can be solved with just two passes over the n points (Welford, 1962; Chan et al., 1983), we can solve this quickly with only an incrementalincrease in runtime.The original VP tree selects the median distance of all points from the vantage point.This requires n distance computations, and an O ( n ) quick-select search. Finding the splitof median variance still requires n distance computations, so that cost remains unchanged.However, a sort of O ( n log n ) must be done to ﬁnd the split of minimum variance. The Random Ball Cover (Cayton, 2012) (RBC) algorithm was originally proposed as anaccelerating index that would make eﬃcient use of many-core systems, such as GPUs.This was motivated by the euclidean distance metric, which can be computed with higheﬃciency when computing multiple distances simultaneously. This can be done by exploitinga decomposition of the euclidean distance into matrix operations, for which optimized BLASroutines are readily available. To exploit batch processing while also pruning distances, the oward Metric Indexes for Incremental Insertion and Querying RBC approach organizes data into large groups and uses the triangle inequality sparingly toprune out whole groups at a time. Compared to the VP and Cover Tree, the RBC algorithmis unique in that it aims to answer queries in O ( √ n ) time and perform construction in O ( n √ n ) time.The training procedure of the RBC algorithm is to randomly select O ( √ n ) centers fromthe dataset, and denote that set of points as R . These are the R random balls of thealgorithm. Each representative r i ∈ R will own, or cover , all the datapoints for which it isthe nearest neighbor, arg min x d ( x, r i ) ∀ x ∈ S \ R , which is denoted as L r i . It is expectedthat each r i will then own O ( √ n ) datapoints. Querying is done ﬁrst against the subset ofpoints R , from which many of the representatives are pruned. Then a second query is doneagainst the points owned by the non-pruned representatives. To do this pruning, we needthe representatives to be sorted by their distance to the query point q . We will denote thisas r ( q ) i , which would be the i ’th nearest representative to q . Pruning for k nearest neighborqueries is then done using two bounds, d ( q, r i ) < d ( q, r ( q ) k ) + ψ r i (2) d ( q, r i ) < · d ( q, r ( q ) k ) (3)Where ψ r i = max x ∈ L ri d ( r i , x ) is the radius of each representative, such that all datapointsfall within that radius. Each bound must be true for any r i to have the k ’th nearest neighborto query q , and the overall procedure is given in Algorithm 2. Theoretically the RBC boundsare interesting in that they provide a small dependency on the expansion constant c of thedata, where queries can be answered in O ( c / √ n ) time. This is considerably smaller thanthe c term in cover trees, but has the larger √ n dependence on n instead of logarithmic.However, the RBC proof depends on setting the number of representatives | R | = O ( c / √ n ) as well, which we would not know in advance in practice. Instead we will use | R | = √ n inall experiments. Algorithm 2

Original RBC Search Procedure

Require:

Query q , desired number of neighbors k Compute sorted order r ( q ) i ∀ r ∈ R by d ( r, q ) FinalList ← ∅ for all r i ∈ R do if Bounds (2) and (3) are True then FinalList ← FinalList ∪ L r i k -NN ← BruteForceSearch ( q , R ∪ FinalList) (cid:46) distances for R do not need to be re-computed return k -NN If our goal was to build a static index, the random selection of R may lead to a sub-optimalselection. It is possible that diﬀerent representatives will have widely varying numbers ofmembers. For our goal of incrementally adding to an index, this stochastic constructionbecomes a beneﬁt. Because the representatives are selected randomly without replacement, aff and Nicholas it is possible to incrementally add to the RBC index while maintaining the same quality ofresults. Algorithm 3

Insert into RBC Index

Require:

RBC representatives R , associated lists L r , ∀ r ∈ R , and new datapoint x to add to RBC. Compute sorted order r ( x ) i ∀ r ∈ R by d ( r, x ) L r ( x )1 ← L r ( x )1 ∪ x ψ r ( x )1 ← max (cid:16) d ( r ( x )1 , x ) , ψ r ( x )1 (cid:17) (cid:46) keep radius information correct if ceil ( √ n ) (cid:54) = n then return (cid:46) else, expand R set select randomly a datapoint l new from (cid:83) ∀ r ∈ R L r let r old be the representative that owns l new , i.e., l new ∈ L r old L r old ← L r old \ l new r new ← l new potentialChildren ← RadiusSearchRBC ( r new , arg max r, ∀ r ∈ R ψ r ) L r new ← ∅ R ← R ∪ r new ψ r new ← for all y ∈ potentialChildren do Let r y be the representative that owns y if d ( y, r y ) > d ( y, r new ) then (cid:46) change ownership L r y ← L r y \ y L r new ← L r new ∪ y ψ r y ← arg max ∀ z ∈ L ry d ( r y , z ) (cid:46) update radius info ψ r new ← max ( ψ r new , d ( y, r new )) The details of our approach are given in Algorithm 3. Whenever we add a new datapointto the index, we ﬁnd its representative and add it to the appropriate list L . This can bedone in O ( √ n ) time, consistent with the query time of RBC. Once the closest representativeis found, the radius to the farthest point may need to be updated, which is trivial. For themajority ( n − √ n ) of insertions, this is all the work that needs to be done.For the remaining √ n insertions, the total number of datapoints will reach a size suchthat we should have a new representative. The new representative will be selected randomlyfrom all the points in S \ R . We can ﬁnd the all the datapoints that may belong to thisnew representative using a "range" or "radius" search. A radius search is given a query andradius, and returns all datapoints within the speciﬁed radius of the query. In this case wegive the new representative as the query and specify the range as the maximum ψ r in theRBC so far. This is by deﬁnition the maximum distance of any point to its representative,so any point that will be owned by the new representative must have a smaller distance. Inthe worst case scenario, we cannot prune any points using a radius search. This means atmost n other points must be considered. But since this scenario can only occur √ n times,we maintain the same construction time complexity of O ( n √ n ) in all cases. We can alsostate that this approach yields an amortized O ∗ ( √ n ) insertion time. oward Metric Indexes for Incremental Insertion and Querying While the original RBC search is fast and eﬃcient on GPUs and similar many-core machines,it is not as eﬃcient for our use case. Our scenario of interleaved insertions and queries meanswill be querying with only a few datapoints at a time. This means we will not obtain alarge enough group of queries points to obtain the batch and SIMD eﬃciencies that werethe original goal of Cayton (2012). Further, when we consider arbitrary distance metrics,we can not expect the same eﬃcient method of grouping calculations as can be done withthe euclidean distance. Thus we have developed an improved querying method for the RBCsearch to make it more eﬃcient in our incremental insertion and querying scenario. Ourimprovements to the RBC search procedure can be broken down into three steps.First, we modify the search to create the k -NN list incrementally as we visit eachrepresentative r ∈ R . In particular we can improve the application of bound (2) by doingthis. First, we note that in (2), the d ( q, r ( q ) k ) term serves as an upper bound on the distanceto the k ’th nearest neighbor. By building the k -NN list incrementally, we can instead usethe current best candidate for k ’th nearest neighbor as a bound on the distance to the k ’thnearest neighbor. This works intuitively, as the true k ’th neighbor, if not yet found, must bydeﬁnition have a smaller distance than our current candidate.Second, when visiting the points owned by each representative, l ∈ L r , we can apply thisbound again and tighten the bound further. This is done by replacing the ψ r i term of (2) bythe distance of l to its representative r . Since this distance d ( l, r ) had to be computed whenbuilding the RBC in the ﬁrst place, these distances can simply be cached at construction —avoiding any additional overhead.Third, to increase the likelihood of ﬁnding the k ’th neighbor earlier in the process, wevisit the representatives in sorted order by their distance to the query. Because our ﬁrstmodiﬁcation tightens the bound as we ﬁnd better k ’th candidates, this will accelerate therate at which we tighten the bound.The complete updated procedure is given in Algorithm 4. A similar treatment canimprove the RBC search procedure for range queries. We note that one lines 2 through4, we add all the children points of the closest representative L r ( q )1 unconditionally. Thissatisﬁes requirements of the RBC search algorithm’s correctness in the k nearest neighborcase, rather than just one nearest neighbor. We refer the reader to Cayton (2012) for details.The essence of its purposes is to pre-populate the k -NN list with values for the bounds checksdone in lines 8 and 10.The ﬁrst step of our new algorithm must still compute the distances for each r i , and | R | = √ n . In addition, we add all the children of the closest represent r ( q )1 , which is expectedto own O ( √ n ) points. Thus this modiﬁed RBC search is still an O ( √ n ) search algorithm.Our work does not improve the algorithmic complexity but does improve its eﬀectiveness atpruning.

4. Datasets and Methodology

We use a number of datasets and distance metrics to evaluate our changes and the eﬃciencyof our incremental addition strategies. For all methods we have conﬁrmed that the correctnearest neighbors are returned compared to a naive brute-force search. Our evaluation will aff and Nicholas Algorithm 4

New RBC Search Procedure

Require:

Query q , desired number of neighbors k Compute sorted order r ( q ) i ∀ r ∈ R by d ( r, q ) k -NN ← { r ( q )1 } (cid:46) sorted list implicitly maintains max size of k for all l ∈ L r ( q )1 do (cid:46) Add the children of the nearest representative k -NN ← k -NN ∪ l for i ∈ . . . | R | do (cid:46) visit representatives in sorted order qr ← d ( q, r ( q ) i ) Add tuple r ( q ) i , d ( r ( q ) i , q ) to k -NN if qr < k -NN[k].dist + ψ r i and (3) are True then for all l ∈ L r ( q ) i do if qr < k -NN[k].dist + d ( l, r ( q ) i ) then (cid:46)d ( l, r ( q ) i ) is pre-computed Add tuple l , d ( l, q ) to k -NN return k -NN cover multiple aspects of performance, such as construction time, query time, and the impactof incremental insertions of index eﬃciency. We will use multiple values of k in the nearestneighbor search so that our results are relevant to multiple use-cases. Toward this end wewill also use multiple datasets and distance metrics to further validate our ﬁndings. The approach used in most prior works to evaluate metric indexes is to create the indexfrom all of the data, and then query each datapoint in the index search for the single nearestneighbor (Izbicki and Shelton, 2015). For consistency we replicate this experiment style, butdo not use every datapoint as a query point. This results in worst case O ( n ) runtime forsome of our tests, preventing us from comparing on our larger datasets. Since our interest isin if the index allows for faster queries, we can instead determined this the average pruningeﬃciency with extreme accuracy by using only small sample of query points. In tests usinga sample of 1000 points for testing, versus using all data points, we found no diﬀerence inconclusions or results . Thus we will use 1000 test points in all experiments. This will allowus to run any individual test in under a week, and evaluate the insertion-query performancein a more timely manner.When using various datasets, if the dataset has a standard validation set, it will notbe used. Instead points from the training set will be used for querying. This is done forconstituency since not every dataset has a standard validation or testing set. Our experimentswill be performed searching for the k nearest neighbors with k ∈ { , , , } . Evaluatingfor multiple values of k is often ignored in most works, which focus on the k = 1 case in theirexperiments (e.g. Izbicki and Shelton, 2015; Cayton, 2012; Yianilos, 1993), or will test ononly a few small value of k ≤ (Beygelzimer et al., 2006). This is despite many applications,such as embeddings for visualization (Tarlow et al., 2013; Maaten and Hinton, 2008; van derMaaten, 2014; Tang et al., 2016), using values of k as large as 100. By testing a range of

4. The largest observed discrepancy was of 0.3 percentage points oward Metric Indexes for Incremental Insertion and Querying values for k we can determine if one algorithm is uniformly better for all values of k , or ifdiﬀerent algorithms have an advantage in one regime over the others.To evaluate the impact of incremental index construction on the quality of the ﬁnal index,each index will be constructed in three diﬀerent ways. Diﬀerences in performance betweenthese three versions of the index will indicate the relative impact that incremental insertionshave.1. Using the whole dataset and performing the classic batch construction method, bywhich we mean the original index construction process for each algorithm (referred toas batch construction)2. Using half the dataset to construct an initial index using the classic batch method, andincrementally inserting the second half of the data (referred to as half-batch)3. Constructing the entire dataset incrementally (referred to as incremental).For these experiments, the Cover-tree is excluded — as its original batch construction isalready incremental (though does not support eﬃcient queries between insertions). In ourresults we will expect the RBC algorithm to have minimal change in performance, due tothe stochastic nature of representative selection. The expected performance impact of theVP-tree is unknown, though we would expect the tree to perform best in batch construction,second best when using half-batch construction, and worst when fully incremental. Resultswill consider both the number of distance computations when including and excludingdistanced performed during index construction. We note that runtime of all methods andtests correlates directly with number of distance computations done for our code. Comparingdistance computations is preferred so that we observe the true impact of pruning, ratherthan eﬃciency of micro optimizations, and is thus comparable to implementations written inother languages.We will also test the eﬀectiveness of each method when interleaving queries and insertions.This will be evaluated in a manner analogous to common data structures, where we havediﬀerent number of possible read (query) and write (insert) ratios. Now that we have reviewed how we will evaluate our methods, we will list the datasets anddistance metrics used in such evaluations. A summary of which is presented in Table 1.Datasets and distance metrics were selected to cover a wide range of data and metric types,include common baselines, and so that experiments would ﬁnish within a one-week executionwindow.Our ﬁrst three datasets will all use the familiar euclidean distance(4). The ﬁrst of whichis the well known MNIST dataset (Lecun et al., 1998), which is a commonly used benchmarkfor machine learning in general. Due to its small size we also include a larger version of thedataset, MNIST8m, which contains 8 million points produced by random transformations tothe original dataset (Loosli et al., 2007). We also evaluate the Forest Cover Type (Covtype)datasets (Blackard and Dean, 1999), which has historically been used for metric indexes. d ( x, y ) = (cid:107) x − y (cid:107) (4) aff and Nicholas Dataset Samples Distance MetricMNIST 60,000 EuclideanMNIST8m 8,000,000 EuclideanCovtype 581,012 EuclideanVxHeaven 271,095 LZJDVirusShare5m 5,000,000 LZJDILSVRC 2012 Validation 50,000 EMDIMDB Movie Titles 143,337 Levenshtein

Table 1.

Datasets used in experiments, including the number of points in each dataset and thedistance metric used.

Finding nearest neighbors and similar examples is important for malware analysis (Janget al., 2011; Hu et al., 2009). The VxHeaven corpus has been widely used for research inmalware analysis (vxh), and so we use it in our work for measuring the similarity of binaries.VxHeaven contains 271k binaries, but malware datasets are routinely reaching the hundredsof millions to billions of samples. For this reason we also select a random 5 million elementset from the VirusShare corpus (Roberts, 2011), which shares real malware with interestedresearchers. As the distance metric for these datasets, we will use the Lempel-Ziv JaccardDistance (LZJD) (Raﬀ and Nicholas, 2017a), which was designed for measuring binarysimilarity and is based upon the Jaccard distance. LZJD uses the Lempel-Ziv algorithm tobreak a byte sequence up into a set of sub-sequences, and then uses the Jaccard distance (5)to measure the distance between these sets. Recent work has used LZJD for related taskssuch as similarity digests for digital forensics, where prior tools could not be accelerated inthe same manner since they lacked the distance metric properties (Raﬀ and Nicholas, 2017b). d ( A, B ) = 1 − | A ∩ B || A ∪ B | (5)One of the metrics measured in the original Cover-tree paper was the a string editdistance (Beygelzimer et al., 2006). They compared to the dataset and methods used inClarkson (2002), however the available data contains only 200 test strings. Instead we usethe Levenshtein edit distance on IMDB movie titles (Behm et al., 2011), which contains bothlonger strings and is three orders of magnitude larger.The simpliﬁed Cover-tree paper evaluated a larger range of distance metrics (Izbickiand Shelton, 2015), including the Earth Mover’s Distance (EMD) (Rubner et al., 2000).The EMD provides a distance measure between histograms, and was originally proposedfor measuring the similarity of images. We follow the same procedure as for using the"thresholded" EMD (Pele and Werman, 2009), except we use the RGB color space . We usethe 2012 validation set of the ImageNet challenge (Russakovsky et al., 2015) for this distancemetric, as it is the most computationally demanding metric of the ones we evaluate in thiswork.

5. Our software did not support the LabCIE color space previously used, and we did not notice any signiﬁcantdiﬀerence in results for other color spaces. oward Metric Indexes for Incremental Insertion and Querying

5. Evaluation of Construction Time and Pruning Improvements

We ﬁrst evaluate the impact of our changes to each of the three algorithms. For RBC andVP-trees, we have made alterations that aim to improve the ability of these algorithms toavoid unnecessary distance computations at query time. For the Cover-tree, we have madea modiﬁcation that will negatively impact its ability to perform pruning, but will make itviable for interleaved insertions and queries. We will evaluate the impact of our changes onconstruction time, query eﬃciency under normal construction, and the impact incrementalconstruction has on the eﬃciency of the complete index.

To determine the impact of the incremental construction and our modiﬁcations, we willcompare each algorithm in terms of the number of distance computations needed to constructthe index. We will do this for all three construction options, batch, half-batch, and incremental,as discussed in section 4. The time for only constructing the indices in these three waysare shown in Figure 4. We note that there is no distinction between the Cover and Cover B construction times, and that the cover-tree is always incremental in construction. For thisreason we only show one bar to represent Cover and Cover B across all three constructionscenarios to avoid graph clutter.Here we see the two performance characteristics observed. On datasets like MNIST,where we use the euclidean distance, RBC is the slowest to construct. This is expected, as italso has the highest complexity at O ( n √ n ) time. We also note that the RBC radius searchis not as eﬃcient at pruning, and fails to do so on most datasets. Only on datasets thatare most accelerated, such as the Covtype dataset, does the RBC incremental constructionavoid distance computations during construction. This empirically supports the theoreticaljustiﬁcation that we maintain the same construction time for the RBC algorithm, as discussedin subsubsection 3.3.1.The second slowest to construct is the Cover-tree, followed by the VP-trees which is fastest.On the VxHeaven dataset, with the LZJD metric, the construction time performance of theCover-tree degrades dramatically, using two orders of magnitude more distance computationsthan the RBC. We believe this performance degradation is an artifact of the expansionconstant c that occurs when using the LZJD metric. The VP tree has no construction timeimpact with c , and the RBC algorithm has a small O ( c / ) dependency compared to theCover-tree’s O ( c ) dependence. On the VirusShare5m dataset, the Cover-tree couldn’t beconstructed given over a month of compute time. We also note that the Cover-tree haddegraded construction performance on the IMDB Movies dataset using the Levenshteindistance. These results present a potential weakness in the Cover-tree algorithm.Barring the performance behavior of the Cover-tree, both the RBC and VP-tree havemore consistent performance on various datasets. We note of particular interest that theincremental construction procedure for the RBC results in almost no change in the numberof distance computations needed to build the index . The radius search is rarely able to doany pruning for the RBC algorithm, and so the brute force degrades to the same number of

6. The same cannot be said for wall clock time, which is expected. aff and Nicholas R B C R B C I m p V P V P M V C o v e r MNIST R B C R B C I m p V P V P M V C o v e r MNIST8m R B C R B C I m p V P V P M V C o v e r VxHeavenBatchHalf-BatchIncremental R B C R B C I m p V P V P M V C o v e r Covtype R B C R B C I m p V P V P M V C o v e r . . IMDB Movies R B C R B C I m p V P V P M V C o v e r ILSVRC

Figure 4.

Construction performance for each algorithm on each dataset. The y-axis representsthe number of distance computations performed to build each index. Each algorithm is plottedthree times, once using classic batch construction, half-batch, and incremental. The Cover-tree’sconstruction algorithm is equivalent in all scenarios, so only one bar is shown. distance computations as the batch insertion. The Covtype dataset is the one for which eachalgorithm was able to do the most pruning, and thus has the most pronounced eﬀect of this.The VP MV variant of the VP-tree also matches the construction proﬁle of the standardVP-tree on each dataset, with slightly increased or decreased computations depending onthe dataset. This is to be expected, as the standard VP-tree always produces balanced splitsduring batch construction. The incremental construction can also cause lopsided splits forboth the VP and VP MV -tree, which results in a longer recurrence during construction, andthus increased construction time and distances. The VP MV -tree may also encourage suchlopsided splits, increasing the occurrence of this behavior. Simultaneously, the incrementalconstruction requires fewer distance computations to determine early splits, and so can resultin fewer overall computations if the splits happen to come out near balanced. The dataand metric dependent properties will determine which impact is stronger for a given case.The impact of incremental construction on the VP-trees is also variable, and can increase ordecrease construction time. In either direction, the change in VP construction time is minorrelative to the costs for Cover-trees and the RBC algorithm.Overall we can draw the following conclusions about construction time eﬃciency. 1) thatthe VP-trees are fastest in all cases, and the proposed VP MV variant has no detrimentalimpact. 2) the RBC algorithms are the most consistent, but often slowest, and that the oward Metric Indexes for Incremental Insertion and Querying RBC

Imp has no detrimental impact. 3) the Cover-tree is not consistent in its performancerelative to the other two algorithms, but when it works well, is in the middle of the road.

We now look at the impact of our changes to the three search procedures on querying theindex, when the index is built in the standard batch manner. This isolates the change inperformance to only our modiﬁcations of the three algorithms. Our goal here is to show thatRBC

Imp and VP MV are improvements over the standard RBC and VP-tree methods. Wealso want to quantify the negative impact of using the looser bounds in Cover B that willallow for incremental insertion and querying, which is not easy with the standard simpliﬁedCover-tree due to its use of the maxdist bound and potential restructuring on insertions(Izbicki and Shelton, 2015). . . MNIST − − MNIST8m . . . VxHeaven − − − − k Covtype . . k IMDB Movies − k ILSVRC F r a c t i o n o f D i s t a n ce C o m pu t a t i o n s N ee d e d RBC RBC

Imp

VP VP MV Cover Cover B Figure 5.

Number of distance computations needed as a function of the desired number of neighbors k . The y-axis is the ratio of distance computations compared to a brute-force search (shown at 1.0as a dotted black line). Considering only batch construction, we can see the query eﬃciency of these methods inFigure 5, where we look at the fraction of distance computations needed compared to a brute-force search. This ﬁgure factors in the distance computations needed during constructiontime, so the query eﬃciency is with respect to the whole process.We remind the reader that this plot is construed from a random sample of 1000 randomlyselected query points, and then scaled to have the same weight as if all test points were used.That is to say, if a corpus as n data points, we compute the average number of distance aff and Nicholas computations from a sample of 1000 points. The total number of distance computationsis then treated as this average times n . This closely mimics the same results that wouldhave been achieved by using all n points as queries, but keeps runtime manageable given ourcompute resources. In extended testing on corpora where it is feasible to compute this for all n points, just 100 samples reliably estimated the ratio to two signiﬁcant ﬁgures, so our 1000point estimates should allow us to reach the same conclusions with conﬁdence.One can see that for the RBC and VP-tree algorithms, our enchantments to the searchprocedure are eﬀective. For the RBC algorithm in particular, more distance computationswere done than the brute force search in most cases, but RBC Imp dramatically improvesthe competitiveness of the approach. This comes at a loss of compute eﬃciency when usingthe euclidean metric, which is where the RBC obtains its original speed improvements. Butour work is looking at the general eﬃciencies of the RBC for arbitrary distance metrics,which may not have the same eﬃciency advantages when answering queries in batches. Inthis respect the pruning improvements of RBC

Imp are dramatic and important if the RBCalgorithm is to be used.The VP MV reduces the number of computations needed compared to the standard VP-tree in all cases. The amount of improvement varies by dataset, ranging from almost noimprovement, to nearly an order of magnitude less distance computations for the Covtypedataset. Given these results our choice to produce unbalanced splits during construction isempirically validated.As expected, the Cover B variant of the simpliﬁed Cover-tree had a detrimental impacton eﬃciency, as it is relaxing the bound to the same one used in the original Cover-treework(Beygelzimer et al., 2006). Among all tests, the Cover B -tree required 1.6 to 6.7 timesas many distance computations as the standard Cover-tree, with the exact values given inTable 2 for all tested values of k . The few distance computations avoided for determiningthe tighter bound clearly make up for a considerable portion of the simpliﬁed Cover-tree’simproved performance. Table 2.

For each dataset, the this table shows the multipler on the number of distance computationsCover B had to perform compared to a normal Cover-tree. Datasetk MNIST MNIST8m ILSVRC Covtype IMDB VxHeaven1 1.57 6.73 2.07 2.27 1.70 0.975 1.38 5.71 1.96 2.16 1.44 0.9825 1.25 2.75 1.81 1.97 1.29 0.98100 1.16 2.44 1.67 1.73 1.20 0.98While the Cover-tree was the most eﬃcient at avoiding distance computations on theMNIST dataset, the Cover-tree is the worst performer by far on the VxHeaven dataset.The increased construction time results in the Cover-tree performing 20% more distancecomputations than would be necessary with the brute force approach. We also see aninteresting artifact that more distance computations were done on VxHeaven when usingthe tighter maxdist bound than the looser Cover B approach. This comes from the extracomputations needed to obtain the maxdist bound in the ﬁrst place, and indicates that more oward Metric Indexes for Incremental Insertion and Querying distances computations are being done to obtain that bound then are saved in more eﬃcientpruning. . . . k Figure 6.

Query performance on the VirusShare5m dataset.

We also note that the VxHeaven dataset, using the LZJD distance, had the worst queryperformance amongst all datasets, with LZJD barely managing to avoid 5% of the distancecomputations compared to a brute-force search. By testing this on the larger VirusShare5mdataset, as seen in Figure 6, we can see that increasing the corpus size does lead to pruningeﬃciencies. While the Cover-tree couldn’t be built on this corpus, both the RBC and VPalgorithms are able to perform reasonably well. The VP MV did best, avoiding between 57%and 40% of the distance computations a brute-force search would require.Viewing these results as a whole, we would have to recommend the VP MV algorithm asthe best choice in terms of query eﬃciency. In all cases it either prunes the most distancesfor all values of k , or is a close second to the Cover-tree (which has an extreme failure casewith LZJD). For the last part of this section, we examine the impact on query pruning based on how theindex was constructed. That is to say, does half-batch or incremental construction of theindex negatively impact the ability to prune distance computations, and if so, by how much?Such evaluation will be shown for only the more eﬃcient RBC

Imp and VP MV algorithmsthat we will further evaluate in section 6. We do not consider the Cover-tree variants in thisportion. As noted in subsection 3.1, the Cover-tree’s construction is already incremental.Thus these indexes will be equivalent when given the same insertion ordering. The onlychange in Cover-tree eﬃciency would be from random variance caused by changes in insertionorder.The diﬀerence between the ratio of distance computations done for Half-Batch (H)and Incremental (I) index construction is shown in Figure 7. That is to say, if r H = Distance Computations with Half-BatchDistance Computations Brute Force , and r B has the same deﬁnition but for the Batch construc-tion, then the y-axis of the ﬁgure shows r B − r H . This is also plot for the diﬀerence between aff and Nicholas · − k MNIST − − − − − k MNIST8m · − k VxHeaven RBC

Imp , HRBC

Imp , IVP MV , HVP MV , I · − k Covtype · − k IMDB Movies . . . . · − k ILSVRC

Figure 7.

Diﬀerence in the number of distance computations needed as a function of the desirednumber of neighbors k . The y-axis is the diﬀerence in the ratio of distance computations comparedto a brute-force search. We note that the scale on the y-axis is diﬀerent for various ﬁgures, and thesmall scale indicates that incremental construction has little impact on query eﬃciency. incremental construction, i.e., r B − r I . When this value is near zero, it means that both theBatch and Half-Batch or Incremental construction approaches have avoided a similar numberof distance computations.We remind the reader that Half-Batch is where the dataset is constructed using thestandard batch construction approach for the ﬁrst n/ data-points, and the remaining n/ are inserted incrementally. Incremental construction builds the index from empty to fullusing only the incremental insertions.Positive values indicate an increase in the number of distance queries needed. Negativevalues indicate a reduction in the number of distance queries needed, and are generally anindication of problem variance. That is to say, when the diﬀerence in ratios can go negative,it’s because the natural variance (caused by insertion order randomness) is greater thanthe impact of the incremental construction. Such scenarios would generally be consideredfavorable, as it would indicate that our modiﬁcations have no particular positive or negativeimpact.We ﬁrst note a general pattern in that the diﬀerence in query eﬃciency can go up ordown with changes in the desired number of neighbors k . This will be an artifact of both thedataset and distance metric used, and highlights the importance of testing metric structures oward Metric Indexes for Incremental Insertion and Querying over a large range of k . Testing over a wide range of k has not been historically done inprevious works, usually performing only the − nn search.In our results we can see that the RBC algorithm performs best in these tests. TheRBC Imp approach’s pruning ability is minimally impacted by changes in construction for alldatasets and values of k . The largest increase is on MNIST for k = 1 , where the Half-Batchinsertion scenario increases from 59.4% to 60.6%, an increase of only 1.2 percentage points.It makes sense that the RBC Imp approach would have a consistent minimal degradation inquery eﬃciency, as the structure of the RBC is coarse, and our incremental insertion strategyclosely matches the behavior of the batch creation strategy.The VP MV -tree does not perform as well as the RBC Imp , and we can see that incre-mental construction always has a more larger, but still small, impact on its performancefor all datasets. The only case where this exceeds a two percentage point diﬀerence is onthe MNIST8m dataset, where a ≈ . point gap occurs for incremental and half-batchconstruction. The larger impact on the VP MV ’s performance is understandable given thatour insertion procedure does not have the same information available for choosing splits,which may cause sub-optimal choices.Our expectation would be that the VP MV ’s performance would degrade more when usingincremental (I) insertion rather than half-batch (H), as the half-batch insertion will getto use more datapoints to estimate the split point for nodes higher up in the tree. Ourresults generally support this hypothesis, with VP MV (I) causing more distance queries tobe performance than the (H) case. However, for MNIST8m, VxHeaven, and ILSVRC, theperformance gap is not that large across the tested values of k . This suggests that theloosened bounds during insertion may also be an issue impacting the eﬃciency after insertions.One possible way to reduce this impact would be to add multiple vantage points dynamicallyduring insertion, to avoid impacting the existing low/high bounds of the VP-tree. SuchMulti-Vantage-Point (MVP) trees have been explored previously(Bozkaya and Ozsoyoglu,1999) in a batch construction context. We leave research in exploiting such extensions to tofuture work.Regarding the impact on query eﬃciency given incremental insertions, we can conﬁdentlystate that the RBC approach is well poised to this part of the problem, with almost nonegative impact to eﬃciency. The VP-tree does not fair quite as well, but is still more eﬃcientthan the RBC Imp algorithm in all of these cases after construction from only incrementalinsertions.Overall, we can draw some immediate conclusions with respect to our proposed changesto Cover-trees, VP-trees, and the RBC index. First, that VP-trees in general strike a strongbalance between construction time cost and query time eﬃciency across many datasets withdiﬀering metrics. For both the RBC and VP tree, we can improve their query time eﬃciencyacross the board. These improvements come with minimal cost, and so we can considerthem exclusively in section 6 where we look at incremental insertions and querying. We alsoobserve that the Cover-tree is signiﬁcantly degraded at insertion/construction time by whenusing the LZJD distance. aff and Nicholas

6. Evaluation of Incremental Insertion-Query Eﬃciency

At this point we have shown that RBC

Imp and VP MV are improvements over the originalRBC and VP-tree algorithms in terms of query eﬃciency, with no signiﬁcant impact on theconstruction time. We have also shown that the indexes constructed by them are still eﬀectiveare pruning distance computations, which encourages their use. We can now evaluate theiroverall eﬀectiveness when we interleave insertions and queries in a single system.In this section we now consider the case of evaluating each index from the context ofincremental insertion and querying. Contrasting with the standard scenario, where webuild an index and immediately query it (usually for k-nearest neighbor classiﬁcation, orsome similar purpose), we will be building an index and evaluating the number of distancecomputations performed after construction. This scenario corresponds to many realistic usecases, where a large training set is deployed for use, and new data added to the index overtime.Given a dataset with n items in it, our evaluation procedure will consider r queries (or"reads") and w insertions (or "writes") to the index. The naive case, where we performbrute force search, there is no cost to writing to the index, only when we perform a query.This brute force approach also represents our baseline for the maximum number of distancecomputations needed to answer the queries.Similar to data structures for storing and accessing data and concurrency tools, we mayalso explore diﬀering ratios of reads to writes. In our experiments we evaluated insert/queryratios from 100:1 to 1:100. In all cases, we found that the most challenging scenario was whenwe had 100 insertions for each query. This is not surprising, as all of our data structureshave a non-zero cost for insertions, and in the case of RBC and Cover-trees, can be quitesigniﬁcant. Thus, below we will only present results for the case where we have 100 insertionsfor each query, and our tests will limited to 1000 insertions due to runtime constraints . Weconstruct each initial index on half of the data points, using the batch construction method.For the Cover-tree, only Cover B produces reasonable insertion/query performance, as the maxdist bound can’t be maintained when re-balancing occurs. Using the original loose boundcauses a considerable reducing in eﬃciency at query time. By recording the multiplicativediﬀerence between the tighter bound Cover-tree and the original looser bound in Cover B inTable 2, we can plot the performance of the ideal Cover-tree as a function of Cover B . Thisgives us a measure of what the best possible performance of the Cover-tree would be inthis scenario, as it ignores all overheads in any potential scheme for selectively updating theCover-tree bound as items are inserted that would cause re-balancing. We will indicate thisideal Cover-tree as Cover I .The results of our methods are presented in Figure 8. Amongst the RBC Imp , VP MV , andCover B algorithms, the VP MV dominates all other approaches. It successfully avoids themost total distance computations to answer nearest neighbor queries for all values of k on alldatasets. This is not surprising given the cumulative results of section 5, which found theVP MV to require the fewest distance computations during construction time and was alwayseither the most eﬃcient at avoiding distance computations, or nearly behind the Cover-treeapproach.

7. We allowed a maximum of one week runtime for tests to complete in this scenario. oward Metric Indexes for Incremental Insertion and Querying . . . . . k MNIST RBC

Imp VP MV Cover B Cover I − − k MNIST8m . . k VxHeaven − − k Covtype . . k IMDB Movies · − . . . . k ILSVRC

Figure 8.

Fraction of distance computations needed (relative to naive approach) in incrementalscenario, with 100 insertions for every query. Does not include initial construction costs, onlysubsequent insertion costs.

If we had an ability to obtain the maxdist bound for free, we can also see that the Cover I approach is still not very competitive with the VP MV -tree. While Cover I does have betterperformance than VP MV on some datasets, it often trails behind on the Covtype by nearlyan order of magnitude. Especially when we consider the failure of the Cover-trees to performwith the LZJD distance on VxHeaven and VirusShare5m. This variability in performancemakes the Cover-tree less desirable to use for arbitrary distance metrics.While the VP MV appear to be the best overall ﬁt to our task, we note that our RBC Imp also makes a strong showing despite the O ( √ n ) complexity target instead of O (log( n )) .RBC Imp consistently performs better than random guessing, which can’t be said for theCover-tree. On the more diﬃcult datasets, it is often not far behind the VP MV -tree inperformance, though it is an order of magnitude less eﬃcient on the Covtype and ILSVRCdatasets. The biggest weakness of the RBC approach is that the incremental insertions willhave an amortized cost, with the insertion time increasing dramatically every √ n insertionsto expand the representative set. If the number of insertions is known to be bounded, thismay be an avoidable cost – thus increasing the RBC’s practicality. We note as well that inthe case of datasets stored in a distributed index across multiple server’s, the RBC’s coarsestructure may allow for more eﬃcient parallelization. This may be an important factor infuture work when we consider datasets larger than what can be stored on a single machine. aff and Nicholas While we have modiﬁed three algorithms for our scenario of incremental querying andinsertion, we note that there is a further unexplored area for improvement in the "Readwrite" ratio. In our case it was most challenging for all algorithms to handle more "Writes"per "read", as each insertion required multiple distance computations and the insertions didnot dramatically change the performance at query time. This is in part because we havemodiﬁed existing algorithms to support this scenario, and so the performance interleavinginsertions and queries closely follows the performance when we evaluate query by includingthe construction cost, as we did in section 5.Of the algorithms we have tested the VP MV performs best with the lowest constructiontime, and is almost always the fastest at query time. This is also in the context of evaluationin a single-threaded scenario. When we consider a multi-threaded scenario, the VP MV can utilize multiple threads for index construction using the batch-construction approach.However, insertion of a single data-point cannot easily be parallelized. The Cover-tree alsohas this challenge.Our RBC Imp approach presents a potential advantage over both of these algorithmswhen we consider the multi-thread or distributed scenario. As a consequence of how theRBC algorithm achieves its O ( √ n ) insertion and query time, we can readily parallelize line1 of Algorithm 3 on up to √ p processors, requiring only a reduce operation to determinewhich processor had the closest representative. It may then be more practical than theVP MV approach for extremely large indexes if suﬃcient compute resources are available.The downside to the RBC algorithm comes when the representative set must be increased,requiring more work and presenting a insertion cost that will periodically spike. This couldbe remedied by amortizing the cost of increasing the representative set across the precedinginsertions, but we leave this to future work as we must consider the real-world eﬃciency ofan implementation to determine how practical a solution it would be.In future work we hope to develop new algorithms that are speciﬁcally designed forincremental insertion and querying. We note two potential high level strategies in which onemay develop methods that perform better for read and write heavy use-cases. We considerthese beyond the scope of our current work, which looks at modifying existing algorithms,but may be fruitful inspiration for specialized methods. When we have multiple datapoints inserted before each query, it may become possible to usethe index itself to accelerate the insertion process. Say that there will be a set of Z pointsinserted into the index at a time. We can cluster the members of Z by their density/closeness,and insert each cluster together as a group. One option may be to ﬁnd the medoid of thegroup and its radius, which can then be used as a proxy point that represents the groupas a whole. One could then insert the sub-groups into the index with a reduced number ofdistance computations if the triangle inequality can be used to determine that all membersof the group belong in the same region of the index. The group may then be dissolved assuch macro level pruning becomes impossible, or reduced into smaller sub-groups to continuethe process. The dual-tree query approach (Curtin and Ram, 2014), at a high level, presentsa similar strategy for eﬃciently answering multiple queries at a time. oward Metric Indexes for Incremental Insertion and Querying Another scenario is that insertions into the index will be relatively rare, compared to theamount of nearest neighbor queries given to the index. In this case it may be desired tohave the query process itself build and restructure the tree. This notion is in a similar spiritto splay trees and the union-ﬁnd algorithm (Tarjan and van Leeuwen, 1984; Tarjan, 1975;Hopcroft and Ullman, 1973). Insertions to the dataset would be placed in a convenientlocation, and their ﬁrst distances computed when a new query is given. Say that x i wasa previously inserted point. Once we have a new query x q , the distance to the query isobtained for the x i and for x q ’s nearest neighbors. If d ( x i , x q ) ≈ c · d ( x q , x ( k ) ) , where x ( k ) is x q ’s k ’th nearest neighbor and c is some constant, we can then infer that x i should be placedin a similar location in the index. As multiple insertions are performed, we can use thesedistances with respect to the query to determine which points are related and should be keptclose in the index.

7. Conclusions and Future Work

We have now evaluated and improved three diﬀerent algorithms, Cover Trees, Vantage-PointTrees, Random Ball Covers, for the use case of incremental insertions and querying. We havesigniﬁcantly improved the query eﬃciency of the later two with our new RBC

Imp and VP MV variants, and introduced schemes to incrementally add to these collections. Evaluation of allthese methods was done with a number of datasets with varying sizes using four diﬀerentdistance metrics. In doing so, we can conclude that the VP MV tree provides the best overallperformance for our task. It requires the fewest distance computations during construction,is consistently one of the fastest at query time, and this balance produces the best overallresults when interleaving insertions and queries.While already successful, the VP MV tree still has room for improvement. It has thehighest degradation to performance from insertions, which could perhaps be remediedby a smarter update algorithm or the use of multiple vantage points. While the Cover B algorithm could be improved by obtaining a better alternative bound than maxdist , it appearsobtaining a computational cheaper version maxdist bound itself is not suﬃcient to remedythe performance gap when using the LZJD distance. References

VX Heaven. URL https://vxheaven.org/ .A. Behm, C. Li, and M. J. Carey. Answering Approximate String Queries on Large Data SetsUsing External Memory. In

Proceedings of the 2011 IEEE 27th International Conferenceon Data Engineering , ICDE ’11, pages 888–899, Washington, DC, USA, 2011. IEEEComputer Society. ISBN 978-1-4244-8959-6. doi: 10.1109/ICDE.2011.5767856. URL http://dx.doi.org/10.1109/ICDE.2011.5767856 .J. L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching.

Commun.ACM , 18(9):509–517, 9 1975. ISSN 0001-0782. doi: 10.1145/361002.361007. URL http://doi.acm.org/10.1145/361002.361007 . aff and Nicholas A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In

InternationalConference on Machine Learning , pages 97–104, New York, 2006. ACM. URL .E. Biçici and D. Yuret. Locally scaled density based clustering. In B. Beliczynski, A. Dzielinski,M. Iwanowski, and B. Ribeiro, editors,

Adaptive and Natural Computing Algorithms , page739–748, Warsaw, Poland, 2007. Springer-Verlag. URL .J. A. Blackard and D. J. Dean. Comparative accuracies of artiﬁcial neural networksand discriminant analysis in predicting forest cover types from cartographic vari-ables.

Computers and Electronics in Agriculture , 24(3):131–151, 1999. ISSN 01681699.doi: 10.1016/S0168-1699(99)00046-0. URL .T. Bozkaya and M. Ozsoyoglu. Indexing large metric spaces for similarity search queries.

ACM Transactions on Database Systems (TODS) , 24(3):361–404, 1999. URL http://dl.acm.org/citation.cfm?id=328959 .L. Breiman, J. Friedman, C. J. Stone, and R. Olshen.

Classiﬁcation and Regression Trees .CRC press, 1984.R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-Based Clustering Based onHierarchical Density Estimates. In J. Pei, V. Tseng, L. Cao, H. Motoda, and G. Xu,editors,

Advances in Knowledge Discovery and Data Mining , pages 160–172. SpringerBerlin Heidelberg, 2013. ISBN 978-3-642-37455-5. doi: 10.1007/978-3-642-37456-2{\_}14.URL http://link.springer.com/10.1007/978-3-642-37456-2_14 .L. Cayton. Accelerating Nearest Neighbor Search on Manycore Systems.

IEEE 26thInternational Parallel and Distributed Processing Symposium , pages 402–413, 5 2012. doi:10.1109/IPDPS.2012.45. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6267877 .T. F. Chan, G. H. Golub, and R. J. LeVeque. Algorithms for Computing the SampleVariance: Analysis and Recommendations.

The American Statistician , 37(3):242, 8 1983.ISSN 00031305. doi: 10.2307/2683386. URL .K. L. Clarkson. Nearest neighbor searching in metric spaces: Experimental Results for sb(S).Technical report, Bell Laboratories, Lucent Technologies, New Jersey, 2002.R. R. Curtin and P. Ram. Dual-tree Fast Exact Max-kernel Search.

Statistical Analysisand Data Mining , 7(4):229–253, 8 2014. ISSN 1932-1864. doi: 10.1002/sam.11218. URL http://dx.doi.org/10.1002/sam.11218 .R. A. Finkel and J. L. Bentley. Quad Trees a Data Structure for Retrieval on CompositeKeys.

Acta Informatica , 4(1):1–9, 3 1974. ISSN 0001-5903. doi: 10.1007/BF00288933.URL http://dx.doi.org/10.1007/BF00288933 . oward Metric Indexes for Incremental Insertion and Querying K. Fukunage and P. Narendra. A Branch and Bound Algorithm for Computing k-NearestNeighbors.

IEEE Transactions on Computers , C-24(7):750–753, 1975. ISSN 0018-9340.doi: 10.1109/T-C.1975.224297. URL http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1672890 .F. Gieseke, J. Heinermann, C. Oancea, and C. Igel. Buﬀer K-d Trees: Processing MassiveNearest Neighbor Queries on GPUs. In

Proceedings of the 31st International Conference onInternational Conference on Machine Learning - Volume 32 , ICML’14, pages I–172–I–180.JMLR.org, 2014. URL http://dl.acm.org/citation.cfm?id=3044805.3044826 .R. Gove, J. Saxe, S. Gold, A. Long, and G. Bergamo. SEEM: A Scalable Visualization forComparing Multiple Large Sets of Attributes for Malware Analysis. In

Proceedings ofthe Eleventh Workshop on Visualization for Cyber Security , VizSec ’14, pages 72–79, NewYork, NY, USA, 2014. ACM. ISBN 978-1-4503-2826-5. doi: 10.1145/2671491.2671496.URL http://doi.acm.org/10.1145/2671491.2671496 .A. Guttman. R-trees: a dynamic index structure for spatial searching. In , pages 47–57, New York, NY,1984. ACM. ISBN 0897911288. URL http://dl.acm.org/citation.cfm?id=602266 .J. E. Hopcroft and J. D. Ullman. Set Merging Algorithms.

SIAM Journal on Computing , 2(4):294–303, 12 1973. ISSN 0097-5397. doi: 10.1137/0202024. URL http://epubs.siam.org/doi/10.1137/0202024 .X. Hu, T.-c. Chiueh, and K. G. Shin. Large-scale Malware Indexing Using Function-callGraphs. In

Proceedings of the 16th ACM Conference on Computer and CommunicationsSecurity , CCS ’09, pages 611–620, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-894-0. doi: 10.1145/1653662.1653736. URL http://doi.acm.org/10.1145/1653662.1653736 .X. Hu, K. G. Shin, S. Bhatkar, and K. Griﬃn. MutantX-S: Scalable Malware ClusteringBased on Static Features. In

Presented as part of the 2013 USENIX Annual TechnicalConference (USENIX ATC 13) , pages 187–198, San Jose, CA, 2013. USENIX. ISBN 978-1-931971-01-0. URL .M. Izbicki and C. R. Shelton. Faster Cover Trees. In

Proceedings of the Thirty-SecondInternational Conference on Machine Learning , volume 37, 2015.J. Jang, D. Brumley, and S. Venkataraman. BitShred: Feature Hashing Malware forScalable Triage and Semantic Analysis. In

Proceedings of the 18th ACM conference onComputer and communications security - CCS , pages 309–320, New York, New York,USA, 2011. ACM Press. ISBN 9781450309486. doi: 10.1145/2046707.2046742. URL http://dl.acm.org/citation.cfm?doid=2046707.2046742 .I. Kamel and C. Faloutsos. Hilbert R-tree: An Improved R-tree Using Fractals. In

Proceedingsof the 20th International Conference on Very Large Data Bases , VLDB ’94, pages 500–509,San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. ISBN 1-55860-153-8.URL http://dl.acm.org/citation.cfm?id=645920.673001 . aff and Nicholas T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An eﬃcientk-means clustering algorithm: analysis and implementation.

IEEE Transactions onPattern Analysis and Machine Intelligence , 24(7):881–892, 2002. ISSN 0162-8828. doi:10.1109/TPAMI.2002.1017616.D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-restricted Metrics.In

Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing ,STOC ’02, pages 741–750, New York, NY, USA, 2002. ACM. ISBN 1-58113-495-9. doi:10.1145/509907.510013. URL http://doi.acm.org/10.1145/509907.510013 .J. Kim, S.-G. Kim, and B. Nam. Parallel multi-dimensional range query processing withR-trees on GPU.

Journal of Parallel and Distributed Computing , 73(8):1195–1207, 2013.ISSN 07437315. doi: 10.1016/j.jpdc.2013.03.015. URL .Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 11 1998. ISSN 0018-9219. doi:10.1109/5.726791.B. Li, K. Roundy, C. Gates, and Y. Vorobeychik. Large-Scale Identiﬁcation of MaliciousSingleton Files. In ,2017.K. Li and J. Malik. Fast k-Nearest Neighbour Search via Dynamic Continuous Indexing. In

Proceedings of The 33rd International Conference on Machine Learning , pages 671–679,2016. URL http://arxiv.org/abs/1512.00442 .S. Li and N. Amenta. Brute-Force k-Nearest Neighbors Search on the GPU. In

Proceedingsof the 8th International Conference on Similarity Search and Applications - Volume9371 , SISAP 2015, pages 259–270, New York, NY, USA, 2015. Springer-Verlag NewYork, Inc. ISBN 978-3-319-25086-1. doi: 10.1007/978-3-319-25087-8{\_}25. URL http://dx.doi.org/10.1007/978-3-319-25087-8_25 .S. P. Lloyd. Least squares quantization in PCM.

IEEE Transactions on InformationTheory , 28(2):129–137, 3 1982. ISSN 0018-9448. doi: 10.1109/TIT.1982.1056489. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1056489 .G. Loosli, S. Canu, and L. Bottou. Training Invariant Support Vector Machines usingSelective Sampling. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors,

Large Scale Kernel Machines , pages 301–320. MIT Press, Cambridge, MA., 2007. URL http://leon.bottou.org/papers/loosli-canu-bottou-2006 .L. V. D. Maaten and G. Hinton. Visualizing Data using t-SNE.

Journal of Machine LearningResearch , 9:2579–2605, 2008.A. Mohaisen and O. Alrawi. Unveiling Zeus: Automated Classiﬁcation of Malware Samples.In

Proceedings of the 22Nd International Conference on World Wide Web , WWW ’13Companion, pages 829–832, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2038-2.doi: 10.1145/2487788.2488056. URL http://doi.acm.org/10.1145/2487788.2488056 . oward Metric Indexes for Incremental Insertion and Querying K. Narayan, A. Punjani, and P. Abbeel. Alpha-Beta Divergences Discover Micro and MacroStructures in Data. In

Proceedings of The 32nd International Conference on MachineLearning , pages 796–804, 2015.S. M. Omohundro. Five Balltree Construction Algorithms. Technical report, InternationalComputer Science Institute, 1989.O. Pele and M. Werman. Fast and robust Earth Mover’s Distances. In , pages 460–467. IEEE, 9 2009. ISBN 978-1-4244-4420-5. doi: 10.1109/ICCV.2009.5459199. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5459199 .E. Raﬀ and C. Nicholas. An Alternative to NCD for Large Sequences, Lempel-Ziv JaccardDistance. In

Proceedings of the 23rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining - KDD ’17 , pages 1007–1015, New York, New York, USA,2017a. ACM Press. ISBN 9781450348874. doi: 10.1145/3097983.3098111. URL http://dl.acm.org/citation.cfm?doid=3097983.3098111 .E. Raﬀ and C. K. Nicholas. Lempel-Ziv Jaccard Distance, an Eﬀective Alternative to Ssdeepand Sdhash. arXiv preprint arXiv:1708.03346 , 8 2017b. URL https://arxiv.org/abs/1708.03346 .J.-M. Roberts. Virus Share, 2011. URL https://virusshare.com/ .Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as a Metric forImage Retrieval.

International Journal of Computer Vision , 40(2):99–121, 2000. ISSN09205691. doi: 10.1023/A:1026543900054. URL http://link.springer.com/10.1023/A:1026543900054 .O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge.

International Journal of Computer Vision (IJCV) , 115(3):211–252,2015. doi: 10.1007/s11263-015-0816-y.E. C. Spaﬀord. Is Anti-virus Really Dead?

Computers & Security , 44:iv, 2014. ISSN0167-4048. doi: http://dx.doi.org/10.1016/S0167-4048(14)00082-0. URL .J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing Large-scale and High-dimensional Data.In

Proceedings of the 25th International Conference on World Wide Web , WWW ’16, pages287–297, Republic and Canton of Geneva, Switzerland, 2016. International World Wide WebConferences Steering Committee. ISBN 978-1-4503-4143-1. doi: 10.1145/2872427.2883041.URL http://dx.doi.org/10.1145/2872427.2883041 .R. E. Tarjan. Eﬃciency of a Good But Not Linear Set Union Algorithm.

Journal of theACM (JACM) , 22(2):215–225, 4 1975. ISSN 0004-5411. doi: 10.1145/321879.321884. URL http://doi.acm.org/10.1145/321879.321884 . aff and Nicholas R. E. Tarjan and J. van Leeuwen. Worst-case Analysis of Set Union Algorithms.

Journal ofthe ACM (JACM) , 31(2):245–281, 3 1984. ISSN 0004-5411. doi: 10.1145/62.2160. URL http://doi.acm.org/10.1145/62.2160 .D. Tarlow, K. Swersky, L. Charlin, I. Sutskever, and R. Zemel. Stochastic k-NeighborhoodSelection for Supervised and Unsupervised Learning. In S. Dasgupta and D. McAllester,editors,

Proceedings of the 30th International Conference on Machine Learning , volume 28of

Proceedings of Machine Learning Research , pages 199–207, Atlanta, Georgia, USA, 2013.PMLR. URL http://proceedings.mlr.press/v28/tarlow13.html .J. K. Uhlmann. Satisfying general proximity / similarity queries with metric trees.

InformationProcessing Letters , 40(4):175–179, 11 1991a. ISSN 00200190. doi: 10.1016/0020-0190(91)90074-R. URL http://linkinghub.elsevier.com/retrieve/pii/002001909190074R .J. K. Uhlmann. Implementing Metric Trees to Satisfy General Proximity / Similarity Queries.Technical report, Naval Research Laboratory, Washington, D.C., 1991b.L. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms.

Journal of Ma-chine Learning Research , 15:3221–3245, 2014. URL http://jmlr.org/papers/v15/vandermaaten14a.html .A. Walenstein, M. Venable, M. Hayes, C. Thompson, and A. Lakhotia. Exploiting similaritybetween variants to defeat malware. In

Proc. BlackHat DC Conf , 2007.B. P. Welford. Note on a Method for Calculating Corrected Sums of Squares and Products.

Technometrics , 4(3):419, 8 1962. ISSN 00401706. doi: 10.2307/1266577. URL .P. Yianilos. Data structures and algorithms for nearest neighbor search in general metricspaces. In

Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms ,page 311–321. Society for Industrial and Applied Mathematics, 1993. URL http://dl.acm.org/citation.cfm?id=313789 . Appendix A. Corrections to Simpliﬁed Cover Tree

We encountered two diﬃculties in replicating the simpliﬁed cover tree results of Izbickiand Shelton (2015). We detail these two issues and their remediations in this section forcompleteness and reproducibility. In the below algorithm descriptions we will use the sameterminology and description as the algorithm’s original paper, but note our changes in green.We now review some of the properties needed to understand our corrections. Thesimplistic such property is that each node p in the Cover tree has an associated level l ,which we can obtain as l = level ( p ) . Each child c p of p must also satisfy the property thatlevel ( p ) = level ( c p ) + 1 .Using a node’s level, we can deﬁne its coverdist as coverdist(p) = 2 level ( p ) . Each child c p of p will satisfy the covering invariant property, d ( c p , p ) ≤ coverdist ( p ) , ∀ c p ∈ children ( p ) .We also must make use of the maxdist bound discussed in subsection 3.1, which wemake more explicit as: maxdist ( p ) = arg max d p ∈ descendants ( p ) d ( d p , p ) . This is the maximum oward Metric Indexes for Incremental Insertion and Querying distance from one node p to any descendant note of p . If p is a leaf node, meaning it has nochildren, then maxdist ( p ) = 0 . A.1 Nearest Neighbor Correction

We present the revised nearest neighbor search procedure for the simpliﬁed Cover-tree inAlgorithm 5. The green d ( x, p ) term was originally presented to be d ( y, q ) . We show thatthis is not correct using a simple counter example using scalar node values and the euclideandistance. Algorithm 5

Cover Tree Find Nearest Neighbor

Require: cover tree p , query point x , nearest neighbor so far y if d ( p, x ) < d ( y, x ) then y ← p for each child q of p sorted by distance to x do if d ( y, x ) > d ( x, q ) − maxdist ( q ) then (cid:46) Original paper used d ( y, q ) y ← ﬁndNearestNeighbor ( q, x, y ) return y Consider the Cover-tree with root α , that stores value 5. α has one child, β , which hasthe value − . This is the whole tree.We would begin on line one of the algorithm, with p ← α and we will use our query point x to have a value of . d ( p, x ) is 5, and we have no nearest neighbor so far, so y ← p (whichis α ) becomes the nearest neighbor so far.We will obtain q ← β as it is the only child of α , which leads us to evaluate the originalexpression d ( y, x ) (cid:124) (cid:123)(cid:122) (cid:125) =5 − > d ( y, q ) (cid:124) (cid:123)(cid:122) (cid:125) =5 − ( − − maxdist ( q ) (cid:124) (cid:123)(cid:122) (cid:125) =0 Because > is false, the if statement fails, and we then break from the loop, returning y as the nearest neighbor to x with a distance of 5. But x ’s value is 0, and β ’s is − , whichis a distance of only two away. A.2 Insertion Correction

We also provide a correction to the insertion procedure of the simpliﬁed Cover-tree. Our ﬁxedversion is presented in Algorithm 6, with the green text indicating only added statements tothe algorithm.The issue with the original procedure occurs when an outlier x is inserted into the index,the distance from which to any point in the dataset is larger than the largest pairwise distanceof any two points in the existing Cover-tree. This is because the coverdist ( p ) ≥ maxdist ( p ) in all cases. If x is farther than the maximum pairwise distance, then the simple bound online four may be true for a all points in a valid cover tree. This means the loop will neverexit, and will simply continue re-structuring the tree in search of a non-existing node thatcan satisfy the loop condition. aff and Nicholas We ﬁx this by keeping track of the points visited in the tree, and only loop while thereis a potential candidate remaining. If no such candidate occurs because we have visited allpossible leaf nodes, the loop must exit so that the outlier may be inserted as the new root ofthe tree.

Algorithm 6

Simpliﬁed Cover Tree Insertion

Require:

Query q , desired number of neighbors k procedure insert (cover tree p , data point x ) if d ( p, x ) > covdist ( p ) then z ← ∅ while d ( p, x ) > covdist ( p ) and | descendants ( p ) | > | z | do Remove any leaf q from p \ z p (cid:48) ← tree with root q and p as only child p ← p (cid:48) return tree with x as root and p as only child return INSERT_ ( p, x ) procedure insert_ (cover tree p , data point x ) for all q ∈ children ( p ) do if d ( q, x ) ≤ covdist ( q ) then q (cid:48) ← INSERT_ ( q, x ) p (cid:48) ← p with child q replaced with q (cid:48) return p (cid:48) return p with x added as a child From a practical implementation perspective, we note two additional choices. First,rather than attempt to remove leaf nodes in the speciﬁed form above, it is easier to deﬁnea speciﬁc leaf removal order and leaf insertion order. For example, if one always removesthe least recently added leaf node, we will obtain a consistent ordering of the leaf nodes aswe iterate line four of the algorithm. This makes it easy to use simple cycle detection todetermine that the all possible children have been visited, and then escape the loop whenthis occurs.To speed up insertion of outlier points, we also note that the covering invariant canbe used to catch extreme outliers. If d ( p, x ) > covdist ( p ) , then we can skip the loopentirely and proceed directly to line eight of the algorithm. This bound is easy to see, as covdist ( p ) ≥ maxdist ( p ) . Assuming that there exists a descendant point γ that is maximallyfar from p . Let ζ be the point maximally far from γ , and let d ( γ, ζ ) be the maximal pairwisedistance for all points in the Cover-tree. Direct application of the triangle inequality gives us d ( γ, ζ ) ≤ d ( γ, p ) + d ( p, ζ ) This bounds the distance between these points by their distance to the root. The coveringinvariant tells us that coverdist ( p ) ≥ maxdist ( p ) . Therefore it must be the case that d ( γ, ζ ) ≤ coverdist ( p ) + 2 coverdist ( p ) Which reduces to the bound d ( γ, ζ ) ≤ coverdist ( p ) . Thus if a new query violates thisbound, we know that no point in the whole tree can satisfy the loop on line 4.. Thus if a new query violates thisbound, we know that no point in the whole tree can satisfy the loop on line 4.