Shaden Smith | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shaden Smith is active.

Explore More

Publication

Featured researches published by Shaden Smith.

international parallel and distributed processing symposium | 2015

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Shaden Smith; Niranjay Ravindran; Nicholas D. Sidiropoulos; George Karypis

Multi-dimensional arrays, or tensors, are increasingly found in fields such as signal processing and recommender systems. Real-world tensors can be enormous in size and often very sparse. There is a need for efficient, high-performance tools capable of processing the massive sparse tensors of today and the future. This paper introduces SPLATT, a C library with shared-memory parallelism for three-mode tensors. SPLATT contains algorithmic improvements over competing state of the art tools for sparse tensor factorization. SPLATT has a fast, parallel method of multiplying a matricide tensor by a Khatri-Rao product, which is a key kernel in tensor factorization methods. SPLATT uses a novel data structure that exploits the sparsity patterns of tensors. This data structure has a small memory footprint similar to competing methods and allows for the computational improvements featured in our work. We also present a method of finding cache-friendly reordering and utilizing them with a novel form of cache tiling. To our knowledge, this is the first work to investigate reordering and cache tiling in this context. SPLATT averages almost 30x speedup compared to our baseline when using 16 threads and reaches over 80x speedup on NELL-2.

irregular applications: architectures and algorithms | 2015

Tensor-matrix products with a compressed sparse tensor

Shaden Smith; George Karypis

The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

asilomar conference on signals, systems and computers | 2014

Memory-efficient parallel computation of tensor and matrix products for big tensor decomposition

Niranjay Ravindran; Nicholas D. Sidiropoulos; Shaden Smith; George Karypis

Low-rank tensor decomposition has many applications in signal processing and machine learning, and is becoming increasingly important for analyzing big data. A significant challenge is the computation of intermediate products which can be much larger than the final result of the computation, or even the original tensor. We propose a scheme that allows memory-efficient in-place updates of intermediate matrices. Motivated by recent advances in big tensor decomposition from multiple compressed replicas, we also consider the related problem of memory-efficient tensor compression. The resulting algorithms can be parallelized, and can exploit but do not require sparsity.

Frequent Pattern Mining | 2014

Big Data Frequent Pattern Mining

David C. Anastasiu; Jeremy Iverson; Shaden Smith; George Karypis

Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

very large data bases | 2017

Bridging the gap between HPC and big data frameworks

Michael R. Anderson; Shaden Smith; Narayanan Sundaram; Mihai Capotă; Zheguang Zhao; Subramanya R. Dulloor; Nadathur Satish; Theodore L. Willke

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

international parallel and distributed processing symposium | 2017

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory

Shaden Smith; Jongsoo Park; George Karypis

HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intels recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.

ieee international conference on high performance computing data and analytics | 2016

An exploration of optimization algorithms for high performance tensor completion

Shaden Smith; Jongsoo Park; George Karypis

Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are an SGD algorithm which combines stratification with asynchronous communication, an ALS algorithm rich in level-3 BLAS routines, and a communication-efficient CCD++ algorithm. We evaluate our optimizations on a variety of real datasets using a modern supercomputer and demonstrate speedups through 1024 cores. These improvements effectively reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ will provide the lowest time-to-solution on large-scale distributed systems.

international parallel and distributed processing symposium | 2016

A Medium-Grained Algorithm for Sparse Tensor Factorization

Shaden Smith; George Karypis

Modeling multi-way data can be accomplished using tensors, which are data structures indexed along three or more dimensions. Tensors are increasingly used to analyze extremely large and sparse multi-way datasets in life sciences, engineering, and business. The canonical polyadic decomposition (CPD) is a popular tensor factorization for discovering latent features and is most commonly found via the method of alternating least squares (CPD-ALS). The computational time and memory required to compute CPD limits the size and dimensionality of the tensors that can be solved on a typical workstation, making distributed solution approaches the only viable option. Most methods for distributed-memory systems have focused on distributing the tensor in a coarse-grained, one-dimensional fashion that prohibitively requires the dense matrix factors to be fully replicated on each node. Recent work overcomes this limitation by using a fine-grained decomposition of the tensor nonzeros, at the cost of computationally expensive hypergraph partitioning. To that effect, we present a medium-grained decomposition that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps. We use a hybrid MPI+OpenMP implementation that exploits multi-core architectures with a low memory footprint. We theoretically analyze the scalability of the coarse-, medium-, and fine-grained decompositions and experimentally compare them across a variety of datasets. Experiments show that the medium-grained decomposition reduces communication volume by 36-90% compared to the coarse-grained decomposition, is 41-76x faster than a state-of-the-art MPI code, and is 1.5-5.0x faster than the fine-grained decomposition with 1024 cores.

international conference on parallel processing | 2017

Constrained Tensor Factorization with Accelerated AO-ADMM

Shaden Smith; Alec Beri; George Karypis

Low-rank sparse tensor factorization is a populartool for analyzing multi-way data and is used in domainssuch as recommender systems, precision healthcare, and cybersecurity.Imposing constraints on a factorization, such asnon-negativity or sparsity, is a natural way of encoding priorknowledge of the multi-way data. While constrained factorizationsare useful for practitioners, they can greatly increasefactorization time due to slower convergence and computationaloverheads. Recently, a hybrid of alternating optimization andalternating direction method of multipliers (AO-ADMM) wasshown to have both a high convergence rate and the ability tonaturally incorporate a variety of popular constraints. In thiswork, we present a parallelization strategy and two approachesfor accelerating AO-ADMM. By redefining the convergencecriteria of the inner ADMM iterations, we are able to splitthe data in a way that not only accelerates the per-iterationconvergence, but also speeds up the execution of the ADMMiterations due to efficient use of cache resources. Secondly,we develop a method of exploiting dynamic sparsity in thefactors to speed up tensor-matrix kernels. These combinedadvancements achieve up to 8 speedup over the state-of-the art on a variety of real-world sparse tensors.

ieee high performance extreme computing conference | 2017

Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms

Ancy Sarah Tom; Narayanan Sundaram; Nesreen K. Ahmed; Shaden Smith; Stijn Eyerman; Midhunchandra Kodiyath; Ibrahim Hur; Fabrizio Petrini; George Karypis

The widespread use of graphs to model large scale real-world data brings with it the need for fast graph analytics. In this paper, we explore the problem of triangle counting, a fundamental graph-analytic operation, on shared-memory platforms. Existing triangle counting implementations do not effectively utilize the key characteristics of large sparse graphs for tuning their algorithms for performance. We explore such optimizations and develop faster serial and parallel variants of existing algorithms, which outperform the state-of-the-art on Intel manycore and multicore processors. Our algorithms achieve good strong scaling on many graphs with varying scale and degree distributions. Furthermore, we extend our optimizations to a well-known graph processing framework, GraphMat, and demonstrate their generality.

Explore More