Is this you? Create Your Porfile

Julian Shun

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Julian Shun is active.

Explore More

Publication

Featured researches published by Julian Shun.

acm sigplan symposium on principles and practice of parallel programming | 2013

Ligra: a lightweight graph processing framework for shared memory

Julian Shun; Guy E. Blelloch

There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts. In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

acm symposium on parallel algorithms and architectures | 2012

Brief announcement: the problem based benchmark suite

Julian Shun; Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Aapo Kyrola; Harsha Vardhan Simhadri; Kanat Tangwongsan

This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a problem specification and a set of input distributions. No requirements are made in terms of algorithmic approach, programming language, or machine architecture. The goal of the benchmarks is not only to compare runtimes, but also to be able to compare code and other aspects of an implementation (e.g., portability, robustness, determinism, and generality). As such the code for an implementation of a benchmark is as important as its runtime, and the public PBBS repository will include both code and performance results. The benchmarks are designed to make it easy for others to try their own implementations, or to add new benchmark problems. Each benchmark problem includes the problem specification, the specification of input and output file formats, default input generators, test codes that check the correctness of the output for a given input, driver code that can be linked with implementations, a baseline sequential implementation, a baseline multicore implementation, and scripts for running timings (and checks) and outputting the results in a standard format. The current suite includes the following problems: integer sort, comparison sort, remove duplicates, dictionary, breadth first search, spanning forest, minimum spanning forest, maximal independent set, maximal matching, K-nearest neighbors, Delaunay triangulation, convex hull, suffix arrays, n-body, and ray casting. For each problem, we report the performance of our baseline multicore implementation on a 40-core machine.

acm sigplan symposium on principles and practice of parallel programming | 2012

Internally deterministic parallel algorithms can be fast

Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Julian Shun

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable. In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

Statistical Science | 2010

Connected Spatial Networks over Random Points and a Route-Length Statistic

David Aldous; Julian Shun

Author(s): Aldous, DJ; Shun, J | Abstract: We review mathematically tractable models for connected networks on random points in the plane, emphasizing the class of proximity graphs which deserves to be better known to applied probabilists and statisticians. We introduce and motivate a particular statistic R measuring shortness of routes in a network. We illustrate, via Monte Carlo in part, the trade-off between normalized network length and R in a one-parameter family of proximity graphs. How close this family comes to the optimal trade-off over all possible networks remains an intriguing open question. The paper is a write-up of a talk developed by the first author during 2007-2009.

international conference on data engineering | 2015

Multicore triangle computations without tuning

Julian Shun; Kanat Tangwongsan

Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs. This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges. Our algorithms are provably cache-friendly, easy to implement in a language that supports dynamic parallelism, such as Cilk Plus or OpenMP, and do not require parameter tuning. On a 40-core machine with two-way hyper-threading, our parallel exact global and local triangle counting algorithms obtain speedups of 17-50x on a set of real-world and synthetic graphs, and are faster than previous parallel exact triangle counting algorithms. We can compute the exact triangle count of the Yahoo Web graph (over 6 billion edges) in under 1.5 minutes. In addition, for approximate triangle counting, we are able to approximate the count for the Yahoo graph to within 99.6% accuracy in under 10 seconds, and for a given accuracy we are much faster than existing parallel approximate triangle counting implementations.

data compression conference | 2015

Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+

Julian Shun; Laxman Dhulipala; Guy E. Blelloch

We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs. We integrate the compression techniques into Ligra, a recent shared-memory graph processing system. This system, which we call Ligra+, is able to represent graphs using about half of the space for the uncompressed graphs on average. Furthermore, Ligra+ is slightly faster than Ligra on average on a 40-core machine with hyper-threading. Our experimental study shows that Ligra+ is able to process graphs using less memory, while performing as well as or faster than Ligra.

acm symposium on parallel algorithms and architectures | 2012

Greedy sequential maximal independent set and matching are parallel on average

Guy E. Blelloch; Jeremy T. Fineman; Julian Shun

The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertexs previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log^2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallel algorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism.

acm symposium on parallel algorithms and architectures | 2014

A simple and practical linear-work parallel algorithm for connectivity

Julian Shun; Laxman Dhulipala; Guy E. Blelloch

Graph connectivity is a fundamental problem in computer science with many important applications. Sequentially, connectivity can be done in linear work easily using breadth-first search or depth-first search. There have been many parallel algorithms for connectivity, however the simpler parallel algorithms require super-linear work, and the linear-work polylogarithmic-depth parallel algorithms are very complicated and not amenable to implementation. In this work, we address this gap by describing a simple and practical expected linear-work, polylogarithmic depth parallel algorithm for graph connectivity. Our algorithm is based on a recent parallel algorithm for generating low-diameter graph decompositions by Miller et al., which uses parallel breadth-first searches. We discuss a (modest) variant of their decomposition algorithm which preserves the theoretical complexity while leading to simpler and faster implementations. We experimentally compare the connectivity algorithms using both the original decomposition algorithm and our modified decomposition algorithm. We also experimentally compare against the fastest existing parallel connectivity implementations (which are not theoretically linear-work and polylogarithmic-depth) and show that our implementations are competitive for various input graphs. In addition, we compare our implementations to sequential connectivity algorithms and show that on 40 cores we achieve good speedup relative to the sequential implementations for many input graphs. We discuss the various optimizations used in our implementations and present an extensive experimental analysis of the performance. Our algorithm is the first parallel connectivity algorithm that is both theoretically and practically efficient.

acm symposium on parallel algorithms and architectures | 2015

Sorting with Asymmetric Read and Write Costs

Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Yan Gu; Julian Shun

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in O(n) writes, O(n log n) reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges k > 1 for writing a block of size B to the secondary memory, and present variants of three EM sorting algorithms (multi-way merge sort, sample sort, and heap sort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly k block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient,cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield provably good bounds for parallel machines with private caches or with a shared cache, respectively.

parallel computing | 2014

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction

Julian Shun; Guy E. Blelloch

We present a simple linear work and space, and polylogarithmic time parallel algorithm for generating multiway Cartesian trees. We show that bottom-up traversals of the multiway Cartesian tree on the interleaved suffix array and longest common prefix array of a string can be used to answer certain string queries. By adding downward pointers in the tree (e.g. using a hash table), we can also generate suffix trees from suffix arrays on arbitrary alphabets in the same bounds. In conjunction with parallel suffix array algorithms, such as the skew algorithm, this gives a rather simple linear work parallel, O(n ε) time (0<ε<1), algorithm for generating suffix trees over an integer alphabet Σ{1, ..., n}, where n is the length of the input string. It also gives a linear work parallel algorithm requiring O(log2 n) time with high probability for constant-sized alphabets. More generally, given a sorted sequence of strings and the longest common prefix lengths between adjacent elements, the algorithm will generate a patricia tree (compacted trie) over the strings. Of independent interest, we describe a work-efficient parallel algorithm for solving the all nearest smaller values problem using Cartesian trees, which is much simpler than the work-efficient parallel algorithm described in previous work. We also present experimental results comparing the performance of the algorithm to existing sequential implementations and a second parallel algorithm that we implement. We present comparisons for the Cartesian tree algorithm on its own and for constructing a suffix tree. The results show that on a variety of strings our algorithm is competitive with the sequential version on a single processor and achieves good speedup on multiple processors. We present experiments for three applications that require only the Cartesian tree, and also for searching using the suffix tree.

Explore More