Michael J. Anderson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael J. Anderson is active.

Explore More

Publication

Featured researches published by Michael J. Anderson.

very large data bases | 2015

GraphMat: high performance graph analytics made productive

Narayanan Sundaram; Nadathur Satish; Md. Mostofa Ali Patwary; Subramanya R. Dulloor; Michael J. Anderson; Satya Gautam Vadlamudi; Dipankar Das; Pradeep Dubey

Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We thus get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is a single-node multicore graph framework written in C++ which has enabled us to write a diverse set of graph algorithms with the same effort compared to other vertex programming frameworks. GraphMat performs 1.1-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. GraphMat also matches the performance of MapGraph, a GPU-based graph framework, despite running on a CPU platform with significantly lower compute and bandwidth resources. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMat can naturally benefit from the trend of increasing parallelism in future hardware.

ieee international conference on high performance computing, data, and analytics | 2015

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Md. Mostofa Ali Patwary; Nadathur Satish; Narayanan Sundaram; Jongsoo Park; Michael J. Anderson; Satya Gautam Vadlamudi; Dipankar Das; Sergey G. Pudov; Vadim O. Pirogov; Pradeep Dubey

Sparse matrix-matrix multiplication (SpGEMM) is a key kernel in many applications in High Performance Computing such as algebraic multigrid solvers and graph analytics. Optimizing SpGEMM on modern processors is challenging due to random data accesses, poor data locality and load imbalance during computation. In this work, we investigate different partitioning techniques, cache optimizations (using dense arrays instead of hash tables), and dynamic load balancing on SpGEMM using a diverse set of real-world and synthetic datasets. We demonstrate that our implementation outperforms the state-of-the-art using Intel\(^{{\textregistered }}\) Xeon\(^{{\textregistered }}\) processors. We are up to 3.8X faster than Intel\(^{{\textregistered }}\) Math Kernel Library (MKL) and up to 257X faster than CombBLAS. We also outperform the best published GPU implementation of SpGEMM on nVidia GTX Titan and on AMD Radeon HD 7970 by up to 7.3X and 4.5X, respectively on their published datasets. We demonstrate good multi-core scalability (geomean speedup of 18.2X using 28 threads) as compared to MKL which gets 7.5X scaling on 28 threads.

very large data bases | 2016

LDBC graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms

Alexandru Iosup; Tim Hegeman; Wing Lung Ngai; Stijn Heldens; Arnau Prat-Pérez; Thomas Manhardto; Hassan Chafio; Mihai Capotă; Narayanan Sundaram; Michael J. Anderson; Ilie Gabriel Tănase; Yinglong Xia; Lifeng Nai; Peter A. Boncz

In this paper we introduce LDBC Graphalytics, a new industrial-grade benchmark for graph analysis platforms. It consists of six deterministic algorithms, standard datasets, synthetic dataset generators, and reference output, that enable the objective comparison of graph analysis platforms. Its test harness produces deep metrics that quantify multiple kinds of system scalability, such as horizontal/vertical and weak/strong, and of robustness, such as failures and performance variability. The benchmark comes with open-source software for generating data and monitoring performance. We describe and analyze six implementations of the benchmark (three from the community, three from the industry), providing insights into the strengths and weaknesses of the platforms. Key to our contribution, vendors perform the tuning and benchmarking of their platforms.

ieee international conference on high performance computing data and analytics | 2015

BD-CATS: big data clustering at trillion particle scale

Md. Mostofa Ali Patwary; Surendra Byna; Nadathur Satish; Narayanan Sundaram; Zarija Lukić; V. Roytershteyn; Michael J. Anderson; Yushu Yao; Prabhat; Pradeep Dubey

Modern cosmology and plasma physics codes are now capable of simulating trillions of particles on petascale systems. Each timestep output from such simulations is on the order of 10s of TBs. Summarizing and analyzing raw particle data is challenging, and scientists often focus on density structures, whether in the real 3D space, or a high-dimensional phase space. In this work, we develop a highly scalable version of the clustering algorithm Dbscan, and apply it to the largest datasets produced by state-of-the-art codes. Our system, called Bd-Cats, is the first one capable of performing end-to-end analysis at trillion particle scale (including: loading the data, geometric partitioning, computing kd-trees, performing clustering analysis, and storing the results). We show analysis of 1.4 trillion particles from a plasma physics simulation, and a 10,2403 particle cosmological simulation, utilizing ~100,000 cores in 30 minutes. Bd-Cats is helping infer mechanisms behind particle acceleration in plasma physics and holds promise for qualitatively superior clustering in cosmology. Both of these results were previously intractable at the trillion particle scale.

ieee international conference on high performance computing data and analytics | 2015

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors

Yida Wang; Michael J. Anderson; Jonathan D. Cohen; Alexander Heinecke; Kai Li; Nadathur Satish; Narayanan Sundaram; Nicholas B. Turk-Browne; Theodore L. Willke

Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuroscientific questions efficiently, we are developing a closed-loop analysis system with FCMA on a cluster of nodes with Intel® Xeon Phi™ coprocessors. Here we propose several ideas for data-driven algorithmic modification to improve the performance on the coprocessor. Our experiments with real datasets show that the optimized single-node code runs 5x-16x faster than the baseline implementation using the well-known Intel® MKL and LibSVM libraries, and that the cluster implementation achieves near linear speedup on 5760 cores.

international conference on big data | 2016

Enabling factor analysis on thousand-subject neuroimaging datasets

Michael J. Anderson; Mihai Capota; Javier S. Turek; Xia Zhu; Theodore L. Willke; Yida Wang; Po-Hsuan Chen; Jeremy R. Manning; Peter J. Ramadge; Kenneth A. Norman

The scale of functional magnetic resonance image data is rapidly increasing as large multi-subject datasets are becoming widely available and high-resolution scanners are adopted. The inherent low-dimensionality of the information in this data has led neuroscientists to consider factor analysis methods to extract and analyze the underlying brain activity. In this work, we consider two recent multi-subject factor analysis methods: the Shared Response Model and the Hierarchical Topographic Factor Analysis. We perform analytical, algorithmic, and code optimization to enable multi-node parallel implementations to scale. Single-node improvements result in 99χ and 2062x speedups on the two methods, and enables the processing of larger datasets. Our distributed implementations show strong scaling of 3.3x and 5.5χ respectively with 20 nodes on real datasets. We demonstrate weak scaling on a synthetic dataset with 1024 subjects, equivalent in size to the biggest fMRI dataset collected until now, on up to 1024 nodes and 32,768 cores.

international conference on big data | 2016

Real-time full correlation matrix analysis of fMRI data

Yida Wang; Bryn Keller; Mihai Capota; Michael J. Anderson; Narayanan Sundaram; Jonathan D. Cohen; Kai Li; Nicholas B. Turk-Browne; Theodore L. Willke

Real-time functional magnetic resonance imaging (rtfMRI) is an emerging approach for studying the functioning of the human brain. Computational challenges combined with high data velocity have to this point restricted rtfMRI analyses to studying regions of the brain independently. However, given that neural processing is accomplished via functional interactions among brain regions, neuroscience could stand to benefit from rtfMRI analyses of full-brain interactions. In this paper, we extend such an offline analysis method, full correlation matrix analysis (FCMA), to enable its use in rtfMRI studies. Specifically, we introduce algorithms capable of processing real-time data for all stages of the FCMA machine learning workflow: incremental feature selection, model updating, and real-time classification. We also present an actor-model based distributed system designed to support FCMA and other rtfMRI analysis methods. Experiments show that our system successfully analyzes a stream of brain volumes and returns neurofeedback with less than 180 ms of lag. Our real-time FCMA implementation provides the same accuracy as an optimized offline FCMA toolbox while running 3.6–6.2x faster.

international conference on learning representations | 2016