Daniel Lemire | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Lemire is active.

Explore More

Publication

Featured researches published by Daniel Lemire.

Pattern Recognition | 2009

Faster retrieval with a two-pass dynamic-time-warping lower bound

Daniel Lemire

The dynamic time warping (DTW) is a popular similarity measure between time series. The DTW fails to satisfy the triangle inequality and its computation requires quadratic time. Hence, to find closest neighbors quickly, we use bounding techniques. We can avoid most DTW computations with an inexpensive lower bound (LB_Keogh). We compare LB_Keogh with a tighter lower bound (LB_Improved). We find that LB_Improved-based search is faster. As an example, our approach is 2-3 times faster over random-walk and shape time series.

Software - Practice and Experience | 2015

Decoding billions of integers per second through vectorization

Daniel Lemire; Leonid Boytsov

In many important applications—such as search engines and relational database systems—data are stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and single‐instruction, multiple‐data (SIMD) instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD‐BP128⋆ that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint‐G8IU and PFOR). At the same time, SIMD‐BP128⋆ saves up to 2 bits/int. For even better compression, we propose another new vectorized scheme (SIMD‐FastPFOR) that has a compression ratio within 10% of a state‐of‐the‐art scheme (Simple‐8b) while being two times faster during decoding.

data and knowledge engineering | 2010

Sorting improves word-aligned bitmap indexes

Daniel Lemire; Owen Kaser; Kamel Aouiche

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate row-reordering heuristics. Simply permuting the columns of the table can increase the sorting efficiency by 40%. Secondary contributions include efficient algorithms to construct and aggregate bitmaps. The effect of word length is also reviewed by constructing 16-bit, 32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are slightly faster than 32-bit indexes despite being nearly twice as large.

Information Retrieval | 2005

Scale and Translation Invariant Collaborative Filtering Systems

Daniel Lemire

Collaborative filtering systems are prediction algorithms over sparse data sets of user preferences. We modify a wide range of state-of-the-art collaborative filtering systems to make them scale and translation invariant and generally improve their accuracy without increasing their computational cost. Using the EachMovie and the Jester data sets, we show that learning-free constant time scale and translation invariant schemes outperforms other learning-free constant time schemes by at least 3% and perform as well as expensive memory-based schemes (within 4%). Over the Jester data set, we show that a scale and translation invariant Eigentaste algorithm outperforms Eigentaste 2.0 by 20%. These results suggest that scale and translation invariance is a desirable property.

siam international conference on data mining | 2005

A Better Alternative to Piecewise Linear Time Series Segmentation

Daniel Lemire

Time series are difficult to monitor, summarize and predict. Segmentation organizes time series into few intervals having uniform characteristics (flatness, linearity, modality, monotonicity and so on). For scalability, we require fast linear time algorithms. The popular piecewise linear model can determine where the data goes up or down and at what rate. Unfortunately, when the data does not follow a linear model, the computation of the local slope creates overfitting. We propose an adaptive time series model where the polynomial degree of each interval vary (constant, linear and so on). Given a number of regressors, the cost of each interval is its polynomial degree: constant intervals cost 1 regressor, linear intervals cost 2 regressors, and so on. Our goal is to minimize the Euclidean (l_2) error for a given model complexity. Experimentally, we investigate the model where intervals can be either constant or linear. Over synthetic random walks, historical stock market prices, and electrocardiograms, the adaptive model provides a more accurate segmentation than the piecewise linear model without increasing the cross-validation error or the running time, while providing a richer vocabulary to applications. Implementation issues, such as numerical stability and real-world performance, are discussed.

Interactive Technology and Smart Education | 2005

Collaborative filtering and inference rules for context‐aware learning object recommendation

Daniel Lemire; Harold Boley; Sean McGrath; Marcel Ball

Learning objects strive for reusability in e‐Learning to reduce cost and allow personalization of content. We show why learning objects require adapted Information Retrieval systems. In the spirit of the Semantic Web, we discuss the semantic description, discovery, and composition of learning objects. As part of our project, we tag learning objects with both objective (e.g., title, date, and author) and subjective (e.g., quality and relevance) metadata. We present the RACOFI (Rule‐Applying Collaborative Filtering) Composer prototype with its novel combination of two libraries and their associated engines: a collaborative filtering system and an inference rule system. We developed RACOFI to generate context‐aware recommendation lists. Context is handled by multidimensional predictions produced from a database‐driven scalable collaborative filtering algorithm. Rules are then applied to the predictions to customize the recommendations according to user profiles. The RACOFI Composer architecture has been deve...

Journal of the Association for Information Science and Technology | 2015

Measuring academic influence: Not all citations are equal

Xiaodan Zhu; Peter D. Turney; Daniel Lemire; André Vellino

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work, we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features, among those we evaluated, were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence‐primed h‐index (the hip‐index). Unlike the conventional h‐index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip‐index is a better indicator of researcher performance than the conventional h‐index.

Software - Practice and Experience | 2016

Better bitmap performance with Roaring bitmaps

Samy Chambi; Daniel Lemire; Owen Kaser; Robert Godin

Bitmap indexes are commonly used in databases and search engines. By exploiting bit‐level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus, we might prefer compressed bitmap indexes. Following Oracles lead, bitmaps are often compressed using run‐length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high‐performance RLE‐based bitmap encoding techniques: Word Aligned Hybrid compression scheme and Compressed ‘n’ Composable Integer Set. On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2×) and (2) are faster than the compressed alternatives (up to 900× faster for intersections). Our results challenge the view that RLE‐based bitmap compression is best. Copyright

Software - Practice and Experience | 2016

SIMD compression and the intersection of sorted integers

Daniel Lemire; Leonid Boytsov; Nathan Kurz

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single‐instruction, multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4‐BP128‐D4 scheme uses as little as 0.7 CPU cycles per decoded 32‐bit integer while still providing state‐of‐the‐art compression. However, if the subsequent processing of the integers is slow, the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so, we (1) vectorize and optimize the intersection of posting lists; (2) introduce the SIMD GALLOPING algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32‐bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections, GOV2 and ClueWeb09 (category B), using logs from the TREC million‐query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs, our techniques for conjunctive queries can double the speed of a state‐of‐the‐art approach. Copyright

Advanced Data Analysis and Classification | 2012

Time series classification by class-specific Mahalanobis distance measures

Zoltán Prekopcsák; Daniel Lemire

To classify time series by nearest neighbors, we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately—for time series data—the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse, covariance shrinking or limit the matrix to its diagonal. We review these alternatives and benchmark them against competitive methods such as the related Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time Warping (DTW) distance. As we expected, we find that the DTW is superior, but the Mahalanobis distance measures are one to two orders of magnitude faster. To get best results with Mahalanobis distance measures, we recommend learning one distance measure per class using either covariance shrinking or the diagonal approach.

Explore More