Benjamin Lipshitz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Lipshitz is active.

Explore More

Publication

Featured researches published by Benjamin Lipshitz.

acm symposium on parallel algorithms and architectures | 2013

Communication optimal parallel multiplication of sparse random matrices

Grey Ballard; Aydin Buluç; James Demmel; Laura Grigori; Benjamin Lipshitz; Oded Schwartz; Sivan Toledo

Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdős-Rényi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.

international parallel and distributed processing symposium | 2013

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

James Demmel; David Eliahu; Armando Fox; Shoaib Kamil; Benjamin Lipshitz; Oded Schwartz; Omer Spillinger

Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

acm symposium on parallel algorithms and architectures | 2012

Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Grey Ballard; James Demmel; Olga Holtz; Benjamin Lipshitz; Oded Schwartz

A parallel algorithm has perfect strong scaling if its running time on

international parallel and distributed processing symposium | 2013

Perfect Strong Scaling Using No Additional Energy

James Demmel; Andrew Gearhart; Benjamin Lipshitz; Oded Schwartz

arXiv: Data Structures and Algorithms | 2012

Graph expansion analysis for communication costs of fast rectangular matrix multiplication

Grey Ballard; James Demmel; Olga Holtz; Benjamin Lipshitz; Oded Schwartz

processors is linear in

acm symposium on parallel algorithms and architectures | 2013

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Grey Ballard; James Demmel; Benjamin Lipshitz; Oded Schwartz; Sivan Toledo

1/P

SIAM Journal on Matrix Analysis and Applications | 2016

Improving the numerical stability of fast matrix multiplication

Grey Ballard; Austin R. Benson; Alex Druinsky; Benjamin Lipshitz; Oded Schwartz

, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassens fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.

2016 First International Workshop on Communication Optimizations in HPC (COMHPC) | 2016

Network topologies and inevitable contention

Grey Ballard; James Demmel; Andrew Gearhart; Benjamin Lipshitz; Yishai Oltchik; Oded Schwartz; Sivan Toledo

Energy efficiency of computing devices has become a dominant area of research interest in recent years. Most previous work has focused on architectural techniques to improve power and energy efficiency; only a few consider saving energy at the algorithmic level. We prove that a region of perfect strong scaling in energy exists for matrix multiplication (classical and Strassen) and the direct n-body problem via the use of algorithms that use all available memory to replicate data. This means that we can increase the number of processors by some factor and decrease the runtime (both computation and communication) by the same factor, without changing the total energy use.

acm symposium on parallel algorithms and architectures | 2012

Communication-optimal parallel algorithm for strassen's matrix multiplication

Grey Ballard; James Demmel; Olga Holtz; Benjamin Lipshitz; Oded Schwartz

Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassens and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal.

ieee international conference on high performance computing data and analytics | 2012

Communication-avoiding parallel strassen: implementation and performance

Benjamin Lipshitz; Grey Ballard; James Demmel; Oded Schwartz

High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a block-recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.

Explore More