Oded Green
Georgia Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Oded Green.
international conference on supercomputing | 2012
Oded Green; Robert McColl; David A. Bader
Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types of memory access patterns. Merging two sorted arrays is a useful primitive and is a basic building block for numerous applications such as joining database queries, merging adjacency lists in graphs, and set intersection. An efficient parallel merging algorithm partitions the sorted input arrays into sets of non-overlapping sub-arrays that can be independently merged on multiple cores. For optimal performance, the partitioning should be done in parallel and should divide the input arrays such that each core receives an equal size of data to merge. In this paper, we present an algorithm that partitions the workload equally amongst the GPU Streaming Multi-processors (SM). Following this, we show how each SM performs a parallel merge and how to divide the work so that all the GPUs Streaming Processors (SP) are utilized. All stages in this algorithm are parallel. The new algorithm demonstrates good utilization of the GPU memory hierarchy. This approach demonstrates an average of 20X and 50X speedup over a sequential merge on the x86 platform for integer and floating point, respectively. Our implementation is 10X faster than the fast parallel merge supplied in the CUDA Thrust library.
privacy security risk and trust | 2012
Oded Green; Robert McColl; David A. Bader
Analysis of social networks is challenging due to the rapid changes of its members and their relationships. For many cases it impractical to recompute the metric of interest, therefore, streaming algorithms are used to reduce the total runtime following modifications to the graph. Centrality is often used for determining the relative importance of a vertex or edge in a graph. The vertex Betweenness Centrality is the fraction of shortest paths going through a vertex among all shortest paths in the graph. Vertices with a high betweenness centrality are usually key players in a social network or a bottleneck in a communication network. Evaluating the betweenness centrality for a graph G = (V, E) is computationally demanding and the best known algorithm for unweighted graphs has an upper bound time complexity of O(V2 + VE). Consequently, it is desirable to find a way to avoid a full re-computation of betweenness centrality when a new edge is inserted into the graph. In this work, we give a novel algorithm that reduces computation for the insertion of an edge into the graph. This is the first algorithm for the computation of betweenness centrality in a streaming graph. While the upper bound time complexity of the new algorithm is the same as the upper bound for the static graph algorithm, we show significant speedups for both synthetic and real graphs. For synthetic graphs the speedup varies depending on the type of graph and the graph size. For synthetic graphs with 16384 vertices the average speedup is between 100X - 400X. For five different real world collaboration networks the average speedup per graph is in range of 36X - 148X.
irregular applications: architectures and algorithms | 2014
Oded Green; Pavan Yalamanchili; Lluís-Miquel Munguía
Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPUs streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIAs K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.
acm symposium on parallel algorithms and architectures | 2015
Oded Green; Marat Dukhan; Richard W. Vuduc
This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementations that avoid branches. As a proof-of-concept, we devise such branch-avoiding implementations of the Shiloach-Vishkin algorithm for computing connected components. We evaluate these implementations on current x86 and ARM-based processors to show the efficacy of the approach. Our results suggest how both compiler writers and architects might exploit this insight to improve graph processing systems more broadly and create better systems for such problems.
acm sigplan symposium on principles and practice of parallel programming | 2014
Oded Green; Lluís-Miquel Munguía; David A. Bader
Clustering coefficients is a building block in network sciences that offers insights on how tightly bound vertices are in a network. Effective and scalable parallelization of clustering coefficients requires load balancing amongst the cores. This property is not easy to achieve since many real world networks are scale free, which leads to some vertices requiring more attention than others. In this work we show two scalable approaches that load balance clustering coefficients. The first method achieves optimal load balancing with an Ο(|E|) storage requirement. The second method has a lower storage requirement of Ο(|V|) at the cost of some imbalance. While both methods have a similar time complexity, they represent a tradeoff between maintaining a balanced workload and memory complexity. Using a 40-core system we show that our load balancing techniques outperform the widely used and simple parallel approach by a factor of 3X-7.5X for real graphs and 1.5X-4X for random graphs. Further, we achieve 25X-35X speedup over the sequential algorithm for most of the graphs.
international conference on social computing | 2013
Oded Green; David A. Bader
Clustering coefficients, also called triangle counting, is a widely-used graph analytic for measuring the closeness in which vertices cluster together. Intuitively, clustering coefficients can be thought of as the ratio of common friends versus all possible connections a person might have in a social network. The best known time complexity for computing clustering coefficients uses adjacency list intersection and is O(V · dmax2), where dmax is the size of the largest adjacency list of all the vertices in the graph. In this work, we show a novel approach for computing the clustering coefficients in an undirected and unweighted graphs by exploiting the use of a vertex cover, V̂ ⊆ V. This new approach reduces the number of times that a triangle is counted by as many as 3 times per triangle. The complexity of the new algorithm is O(V̂ · ĥmax2 + tVC) where d̂max is the size of the largest adjacency list in the vertex cover and tVC is the time needed for finding the vertex cover. Even for a simple vertex cover algorithm this can reduce the execution time 10-30% while counting the exact number of triangles (3-circuits). We extend the use of the vertex cover to support counting squares (4-circuits) and clustering coefficients for dynamic graphs.
ieee high performance extreme computing conference | 2016
Oded Green; David A. Bader
cuSTINGER, a new graph data structure targeting NVIDIA GPUs is designed for streaming graphs that evolve over time. cuSTINGER enables algorithm designers greater productivity and efficiency for implementing GPU-based analytics, relieving programmers of managing memory and data placement. In comparison with static graph data structures, which may require transferring the entire graph back and forth between the device and the host memories for each update or require reconstruction on the device, cuSTINGER only requires transferring the updates themselves; reducing the total amount of data transferred. cuSTINGER gives users the flexibility, based on application needs, to update the graph one edge at a time or through batch updates. cuSTINGER supports extremely high update rates, over 1 million updates per second for mid-size batched with 10k updates and 10 million updates per second for large batches with millions of updates.
ieee high performance extreme computing conference | 2017
Oded Green; James Fox; Euna Kim; Federico Busato; Nicola Bombieri; Kartik Lakhotia; Shijie Zhou; Shreyas G. Singapura; Hanqing Zeng; Rajgopal Kannan; Viktor K. Prasanna; David A. Bader
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale | 2016
Piyush Sao; Oded Green; Chirag Jain; Richard W. Vuduc
We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the (LP) algorithm resilient to transient soft faults, we apply an algorithmic design principle that we refer to as self-correction. Briefly, a self-correcting algorithm detects if it has reached an invalid state given that it was previously in a known valid state; and if so, restores itself back to a valid state assuming the availability of a selective guaranteed-reliable mode. Our self-correcting algorithm, FT-LP, has relatively small storage and computation overheads: in empirical tests on a variety of input graphs, we observe execution time overheads of 10-35% in FT-LP compared to LP even at high fault rates, with the computation overhead increasing gracefully as fault rates increase.
irregular applications: architectures and algorithms | 2014
Oded Green
Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch prediction accuracy go hand-in-hand, these have not been studied in the context of merging. We thoroughly test merging using multiple input array sizes and values using the same code and compile optimizations. As the number of possible keys increase, so the do the number of branch mis-predictions - resulting in reduced performance. The reduction in performance can be as much as 5X. We explain this phenomenon using a visualization technique called Merge Path that intuitively shows this. We support this visualization approach with modeling, thorough testing, and analysis on multiple systems.