Aapo Kyrola | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aapo Kyrola is active.

Explore More

Publication

Featured researches published by Aapo Kyrola.

very large data bases | 2012

Distributed GraphLab: a framework for machine learning and data mining in the cloud

Yucheng Low; Danny Bickson; Joseph E. Gonzalez; Carlos Guestrin; Aapo Kyrola; Joseph M. Hellerstein

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

european conference on computer systems | 2012

Kineograph: taking the pulse of a fast-changing and connected world

Raymond Cheng; Ji Hong; Aapo Kyrola; Youshan Miao; Xuetian Weng; Ming Wu; Fan Yang; Lidong Zhou; Feng Zhao; Enhong Chen

Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graph-mining algorithms to extract timely insights from the fast-changing graph structure. To accommodate graph-mining algorithms that assume a static underlying graph, Kineograph creates a series of consistent snapshots, using a novel and efficient epoch commit protocol. To keep up with continuous updates on the graph, Kineograph includes an incremental graph-computation engine. We have developed three applications on top of Kineograph to analyze Twitter data: user ranking, approximate shortest paths, and controversial topic detection. For these applications, Kineograph takes a live Twitter data feed and maintains a graph of edges between all users and hashtags. Our evaluation shows that with 40 machines processing 100K tweets per second, Kineograph is able to continuously compute global properties, such as user ranks, with less than 2.5-minute timeliness guarantees. This rate of traffic is more than 10 times the reported peak rate of Twitter as of October 2011.

acm symposium on parallel algorithms and architectures | 2012

Brief announcement: the problem based benchmark suite

Julian Shun; Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Aapo Kyrola; Harsha Vardhan Simhadri; Kanat Tangwongsan

This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a problem specification and a set of input distributions. No requirements are made in terms of algorithmic approach, programming language, or machine architecture. The goal of the benchmarks is not only to compare runtimes, but also to be able to compare code and other aspects of an implementation (e.g., portability, robustness, determinism, and generality). As such the code for an implementation of a benchmark is as important as its runtime, and the public PBBS repository will include both code and performance results. The benchmarks are designed to make it easy for others to try their own implementations, or to add new benchmark problems. Each benchmark problem includes the problem specification, the specification of input and output file formats, default input generators, test codes that check the correctness of the output for a given input, driver code that can be linked with implementations, a baseline sequential implementation, a baseline multicore implementation, and scripts for running timings (and checks) and outputting the results in a standard format. The current suite includes the following problems: integer sort, comparison sort, remove duplicates, dictionary, breadth first search, spanning forest, minimum spanning forest, maximal independent set, maximal matching, K-nearest neighbors, Delaunay triangulation, convex hull, suffix arrays, n-body, and ray casting. For each problem, we report the performance of our baseline multicore implementation on a 40-core machine.

conference on recommender systems | 2013

DrunkardMob: billions of random walks on just a PC

Aapo Kyrola

Random walks on graphs are a staple of many ranking and recommendation algorithms. Simulating random walks on a graph which fits in memory is trivial, but massive graphs pose a problem: the latency of following walks across network in a cluster or loading nodes from disk on-demand renders basic random walk simulation unbearably inefficient. In this work we propose DrunkardMob, a new algorithm for simulating hundreds of millions, or even billions, of random walks on massive graphs, on just a single PC or laptop. Instead of simulating one walk a time it processes millions of them in parallel, in a batch. Based on DrunkardMob and GraphChi we further propose a framework for easily expressing scalable algorithms based on graph walks.

symposium on experimental and efficient algorithms | 2014

Beyond Synchronous: New Techniques for External-Memory Graph Connectivity and Minimum Spanning Forest

Aapo Kyrola; Julian Shun; Guy E. Blelloch

GraphChi [16] is a recent high-performance system for external memory (disk-based) graph computations. It uses the Parallel Sliding Windows (PSW) algorithm which is based on the so-called Gauss-Seidel type of iterative computation, in which updates to values are immediately visible within the iteration. In contrast, previous external memory graph algorithms are based on the synchronous model where computation can only observe values from previous iterations. In this work, we study implementations of connected components and minimum spanning forest on PSW and show that they have a competitive I/O bound of O(sort(E)log(V/M)) and also work well in practice. We also show that our MSF implementation is competitive with a specialized algorithm proposed by Dementiev et al. [10] while being much simpler.

parallel computing | 2016

Experimental Analysis of Space-Bounded Schedulers

Harsha Vardhan Simhadri; Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Aapo Kyrola

The running time of nested parallel programs on shared-memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processor cores on the machine. Recent work proposed “space-bounded schedulers” for scheduling such programs on the multilevel cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, which can result in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice. In this article, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a head-to-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel® Cilk™ Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon® 7560 comparing work stealing, hierarchy-minded work stealing, and two variants of space-bounded schedulers on both divide-and-conquer microbenchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25% to 65% for most of the benchmarks, but incur up to 27% additional scheduler and load-imbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements varying the available bandwidth per core (the “bandwidth gap”) and show up to 50% improvements in the running times of kernels as this gap increases fourfold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees) and explore implementation tradeoffs.

operating systems design and implementation | 2012