David C. Anastasiu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David C. Anastasiu is active.

Explore More

Publication

Featured researches published by David C. Anastasiu.

international conference on data engineering | 2014

L2AP: Fast cosine similarity search with prefix L-2 norm bounds

David C. Anastasiu; George Karypis

The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithms performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.

conference on information and knowledge management | 2011

A framework for personalized and collaborative clustering of search results

David C. Anastasiu; Byron J. Gao; David Buttler

How to organize and present search results plays a critical role in the utility of search engines. Due to the unprecedented scale of the Web and diversity of search results, the common strategy of ranked lists has become increasingly inadequate, and clustering has been considered as a promising alternative. Clustering divides a long list of disparate search results into a few topic-coherent clusters, allowing the user to quickly locate relevant results by topic navigation. While many clustering algorithms have been proposed that innovate on the automatic clustering procedure, we introduce ClusteringWiki, the first prototype and framework for personalized clustering that allows direct user editing of the clustering results. Through a Wiki interface, the user can edit and annotate the membership, structure and labels of clusters for a personalized presentation. In addition, the edits and annotations can be shared among users as a mass-collaborative way of improving search result organization and search engine utility.

conference on information and knowledge management | 2015

L2Knng: Fast Exact K-Nearest Neighbor Graph Construction with L2-Norm Pruning

David C. Anastasiu; George Karypis

The k-nearest neighbor graph is often used as a building block in information retrieval, clustering, online advertising, and recommender systems algorithms. The complexity of constructing the exact k-nearest neighbor graph is quadratic on the number of objects that are compared, and most existing methods solve the problem approximately. We present L2Knng, an efficient algorithm that finds the exact cosine similarity k-nearest neighbor graph for a set of sparse high-dimensional objects. Our algorithm quickly builds an approximate solution to the problem, identifying many of the most similar neighbors, and then uses theoretic bounds on the similarity of two vectors, based on the L2-norm of part of the vectors, to find each objects exact k-neighborhood. We perform an extensive evaluation of our algorithm, comparing against both exact and approximate baselines, and demonstrate the efficiency of our method across a variety of real-world datasets and neighborhood sizes. Our approximate and exact L2Knng variants compute the k-nearest neighbor graph up to an order of magnitude faster than their respective baselines.

World Wide Web | 2013

A novel two-box search paradigm for query disambiguation

David C. Anastasiu; Byron J. Gao; Xing Jiang; George Karypis

Precision-oriented search results such as those typically returned by the major search engines are vulnerable to issues of polysemy. When the same term refers to different things, the dominant sense is preferred in the rankings of search results. In this paper, we propose a novel two-box technique in the context of Web search that utilizes contextual terms provided by users for query disambiguation, making it possible to prefer other senses without altering the original query. A prototype system, Bobo, has been implemented. In Bobo, contextual terms are used to capture domain knowledge from users, help estimate relevance of search results, and route them towards a user-intended domain. A vast advantage of Bobo is that a wide range of domain knowledge can be effectively utilized, where helpful contextual terms do not even need to co-occur with query terms on any page. We have extensively evaluated the performance of Bobo on benchmark datasets that demonstrates the utility and effectiveness of our approach.

Frequent Pattern Mining | 2014

Big Data Frequent Pattern Mining

David C. Anastasiu; Jeremy Iverson; Shaden Smith; George Karypis

Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

irregular applications: architectures and algorithms | 2016

Fast parallel cosine K-nearest neighbor graph construction

David C. Anastasiu; George Karypis

The k-nearest neighbor graph is an important structure in many data mining methods for clustering, advertising, recommender systems, and outlier detection. Constructing the graph requires computing up to n2 similarities for a set of n objects. This has led researchers to seek approximate methods, which find many but not all of the nearest neighbors. In contrast, we leverage shared memory parallelism and recent advances in similarity joins to solve the problem exactly, via a filtering based approach. Our method considers all pairs of potential neighbors but quickly filters those that could not be a part of the k-nearest neighbor graph, based on similarity upper bound estimates. We evaluated our solution on several real-world datasets and found that, using 16 threads, our method achieves up to 12.9x speedup over our exact baseline and is sometimes faster even than approximate methods. Moreover, an approximate version of our method is up to 21.7× more efficient than the best approximate state-of-the-art baseline at similar high recall.

irregular applications: architectures and algorithms | 2015

PL2AP: fast parallel cosine similarity search

David C. Anastasiu; George Karypis

Solving the AllPairs similarity search problem entails finding all pairs of vectors in a high dimensional sparse dataset that have a similarity value higher than a given threshold. The output form this problem is a crucial component in many real-world applications, such as clustering, online advertising, recommender systems, near-duplicate document detection, and query refinement. A number of serial algorithms have been proposed that solve the problem by pruning many of the possible similarity candidates for each query object, after accessing only a few of their non-zero values. The pruning process results in unpredictable memory access patterns that can reduce search efficiency. In this context, we introduce pL2AP, which efficiently solves the AllPairs cosine similarity search problem in a multi-core environment. Our method uses a number of cache-tiling optimizations, combined with fine-grained dynamically balanced parallel tasks, to solve the problem 1.5x-238x faster than existing parallel baselines on datasets with hundreds of millions of non-zeros.

ieee international conference on data science and advanced analytics | 2016

Efficient Identification of Tanimoto Nearest Neighbors

David C. Anastasiu; George Karypis

Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds.

conference on information and knowledge management | 2018

Data Structure for Efficient Line of Sight Queries

Swapnil Gaikwad; Melody Moh; David C. Anastasiu

Given the great amounts of data being transmitted between devices in the 21st century, existing channels of wireless communication are getting congested. In the wireless space, the focus up to now has been on the microwave frequency range. An alternative for high-speed medium- and long-range communication is the millimeter wave spectrum, which is most effectively used through point-to-point links. In this paper, we develop and compare methods for verifying the Line of Sight (LOS) constraint between two points in a city. To be useful for online wireless network planning systems, the methods must be able to process terabytes of 3D city geolocation data and provide answers in milliseconds. We evaluate our methods using data for the city of San Jose, a major metropolitan area in Silicon Valley, California. Our results indicate that our Hierarchical Polygon Aggregation (HPA) method is able to achieve millisecond-level query times with very little loss of precision.

Archive | 2017

Cosine Approximate Nearest Neighbors

David C. Anastasiu

Kosinus-Ahnlichkeitsgraphenerstellung, oder All-Pairs-Ahnlichkeitssuche, ist ein wichtiger Systemkern vieler Methoden der Datengewinnung und des maschinellen Lernens. Die Graphenerstellung ist eine schwierige Aufgabe. Bis zu n2 Objektpaare sollten intuitiv verglichen werden, um das Problem fur eine Reihe von n Objekten zu losen. Fur grose Objektreihen wurden Naherungslosungen fur dieses Problem vorgeschlagen, welche die Komplexitat der Aufgabe thematisieren, indem die meisten, aber nicht unbedingt alle, nachsten Nachbarn abgefragt werden. Wir schlagen eine neue Naherungsgraphen-Erstellungsmethode vor, welche Eigenschaften der Objektvektoren kombiniert, um effektiv weniger Vergleichskandidaten auszuwahlen, welche wahrscheinlich Nachbarn sind. Auserdem kombiniert unsere Methode Filterstrategien, welche vor kurzem entwickelt wurden, um Vergleichskandidaten, die nicht vielversprechend sind, schnell auszuschliesen, was zu weniger allgemeinen Ahnlichkeitsberechnungen und erhohter Effizienz fuhrt. Wir vergleichen unsere Methode mit mehreren gangigen Annaherungs- und exakten Grundwerten von sechs Datensatzen aus der Praxis. Unsere Ergebnisse zeigen, dass unser Ansatz einen guten Kompromiss zwischen Effizienz und Effektivitat darstellt, mit einer 35,81-fachen Effizienzsteigerung gegenuber der besten Alternative bei 0,9 Recall.

Explore More