Leonid Boytsov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leonid Boytsov is active.

Explore More

Publication

Featured researches published by Leonid Boytsov.

Software - Practice and Experience | 2015

Decoding billions of integers per second through vectorization

Daniel Lemire; Leonid Boytsov

In many important applications—such as search engines and relational database systems—data are stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and single‐instruction, multiple‐data (SIMD) instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD‐BP128⋆ that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint‐G8IU and PFOR). At the same time, SIMD‐BP128⋆ saves up to 2 bits/int. For even better compression, we propose another new vectorized scheme (SIMD‐FastPFOR) that has a compression ratio within 10% of a state‐of‐the‐art scheme (Simple‐8b) while being two times faster during decoding.

meeting of the association for computational linguistics | 2014

Metaphor Detection with Cross-Lingual Model Transfer

Yulia Tsvetkov; Leonid Boytsov; Anatole Gershman; Eric Nyberg; Chris Dyer

We show that it is possible to reliably discriminate whether a syntactic construction is meant literally or metaphorically using lexical semantic features of the words that participate in the construction. Our model is constructed using English resources, and we obtain state-of-the-art performance relative to previous work in this language. Using a model transfer approach by pivoting through a bilingual dictionary, we show our model can identify metaphoric expressions in other languages. We provide results on three new test sets in Spanish, Farsi, and Russian. The results support the hypothesis that metaphors are conceptual, rather than lexical, in nature.

Software - Practice and Experience | 2016

SIMD compression and the intersection of sorted integers

Daniel Lemire; Leonid Boytsov; Nathan Kurz

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single‐instruction, multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4‐BP128‐D4 scheme uses as little as 0.7 CPU cycles per decoded 32‐bit integer while still providing state‐of‐the‐art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32‐bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections, GOV2 and ClueWeb09 (category B), using logs from the TREC million‐query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state‐of‐the‐art approach. Copyright

similarity search and applications | 2013

Engineering Efficient and Effective Non-metric Space Library

Leonid Boytsov; Bilegsaikhan Naidan

We present a new similarity search library and discuss a variety of design and performance issues related to its development. We adopt a position that engineering is equally important to design of the algorithms and pursue a goal of producing realistic benchmarks. To this end, we pay attention to various performance aspects and utilize modern hardware, which provides a high degree of parallelization. Since we focus on realistic measurements, performance of the methods should not be measured using merely the number of distance computations performed, because other costs, such as computation of a cheaper distance function, which approximates the original one, are oftentimes substantial. The paper includes preliminary experimental results, which support this point of view. Rather than looking for the best method, we want to ensure that the library implements competitive baselines, which can be useful for future work.

very large data bases | 2015

Permutation search methods are efficient, yet faster search is possible

Bilegsaikhan Naidan; Leonid Boytsov; Eric Nyberg

We survey permutation-based methods for approximate k-nearest neighbor search. In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point. Such ranked lists are called permutations. The underpinning assumption is that, for both metric and non-metric spaces, the distance between permutations is a good proxy for the distance between original points. Thus, it should be possible to efficiently retrieve most true nearest neighbors by examining only a tiny subset of data points whose permutations are similar to the permutation of a query. We further test this assumption by carrying out an extensive experimental evaluation where permutation methods are pitted against state-of-the art benchmarks (the multi-probe LSH, the VP-tree, and proximity-graph based retrieval) on a variety of realistically large data set from the image and textual domain. The focus is on the high-accuracy retrieval methods for generic spaces. Additionally, we assume that both data and indices are stored in main memory. We find permutation methods to be reasonably efficient and describe a setup where these methods are most useful. To ease reproducibility, we make our software and data sets publicly available.

international acm sigir conference on research and development in information retrieval | 2013

Deciding on an adjustment for multiplicity in IR experiments

Leonid Boytsov; Anna Belova; Peter H. Westfall

We evaluate statistical inference procedures for small-scale IR experiments that involve multiple comparisons against the baseline. These procedures adjust for multiple comparisons by ensuring that the probability of observing at least one false positive in the experiment is below a given threshold. We use only publicly available test collections and make our software available for download. In particular, we employ the TREC runs and runs constructed from the Microsoft learning-to-rank (MSLR) data set. Our focus is on non-parametric statistical procedures that include the Holm-Bonferroni adjustment of the permutation test p-values, the MaxT permutation test, and the permutation-based closed testing. In TREC-based simulations, these procedures retain from 66% to 92% of individually significant results (i.e., those obtained without taking other comparisons into account). Similar retention rates are observed in the MSLR simulations. For the largest evaluated query set size (i.e., 6400), procedures that adjust for multiplicity find at most 5% fewer true differences compared to unadjusted tests. At the same time, unadjusted tests produce many more false positives.

conference on information and knowledge management | 2016

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Leonid Boytsov; David Novak; Yury Malkov; Eric Nyberg

Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online.

similarity search and applications | 2012

Super-Linear indices for approximate dictionary searching

Leonid Boytsov

We present experimental analysis of approximate search algorithms that involve indexing of deletion neighborhoods. These methods require huge indices whose sizes grow exponentially with respect to the maximum allowable number of errors k. Despite extraordinary space requirements, the super-linear indices are of great interest, because they provide some of the shortest retrieval times. A straightforward implementation that creates a hash index directly over residual strings (obtained by deletions from dictionary words) is not space efficient. Rather than memorizing complete residual strings, we record only deleted characters and their respective positions. These data are indexed using a perfect hash function computed for a set of residual dictionary strings [2]. We carry out an experimental evaluation of this approach against several well-known benchmarks (including FastSS, which stores residual strings directly [3]). Experiments show that our implementation has a comparable or superior performance to that of the fastest benchmarks. At the same time, our implementation requires 4-8 times less space as compared to FastSS.

neural information processing systems | 2013