Maryam Mehri Dehnavi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Maryam Mehri Dehnavi is active.

Explore More

Publication

Featured researches published by Maryam Mehri Dehnavi.

IEEE Transactions on Magnetics | 2010

Finite-Element Sparse Matrix Vector Multiplication on Graphic Processing Units

Maryam Mehri Dehnavi; David M. Fernández; Dennis D. Giannacopoulos

A wide class of finite-element (FE) electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture, and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

international parallel and distributed processing symposium | 2014

MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures

Yang You; Shuaiwen Leon Song; Haohuan Fu; Andres Marquez; Maryam Mehri Dehnavi; Kevin J. Barker; Kirk W. Cameron; Amanda Randles; Guangwen Yang

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.

ieee conference on electromagnetic field computation | 2010

Enhancing the performance of conjugate gradient solvers on graphic processing units

Maryam Mehri Dehnavi; David M. Fernández; Dennis D. Giannacopoulos

A study of the fundamental obstacles to accelerate the preconditioned conjugate gradient (PCG) method on modern graphic processing units (GPUs) is presented and several techniques are proposed to enhance its performance over previous work independent of the GPU generation and the matrix sparsity pattern. The proposed enhancements increase the performance of PCG up to 23 times compared to vector optimized PCG results on modern CPUs and up to 3.4 times compared to previous GPU results.

IEEE Transactions on Parallel and Distributed Systems | 2013

Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units

Maryam Mehri Dehnavi; David M. Fernández; Jean-Luc Gaudiot; Dennis D. Giannacopoulos

Accelerating numerical algorithms for solving sparse linear systems on parallel architectures has attracted the attention of many researchers due to their applicability to many engineering and scientific problems. The solution of sparse systems often dominates the overall execution time of such problems and is mainly solved by iterative methods. Preconditioners are used to accelerate the convergence rate of these solvers and reduce the total execution time. Sparse approximate inverse (SAI) preconditioners are a popular class of preconditioners designed to improve the condition number of large sparse matrices. We propose a GPU accelerated SAI preconditioning technique called GSAI, which parallelizes the computation of this preconditioner on NVIDIA graphic cards. The preconditioner is then used to enhance the convergence rate of the BiConjugate Gradient Stabilized (BiCGStab) iterative solver on the GPU. The SAI preconditioner is generated on average 28 and 23 times faster on the NVIDIA GTX480 and TESLA M2070 graphic cards, respectively, compared to ParaSails (a popular implementation of SAI preconditioners on CPU) single processor/core results. The proposed GSAI technique computes the SAI preconditioner in approximately the same time as ParaSails generates the same preconditioner on 16 AMD Opteron 252 processors.

IEEE Transactions on Magnetics | 2012

Alternate Parallel Processing Approach for FEM

David M. Fernández; Maryam Mehri Dehnavi; Warren J. Gross; Dennis D. Giannacopoulos

In this work we present a new alternate way to formulate the finite element method (FEM) for parallel processing based on the solution of single mesh elements called FEM-SES. The key idea is to decouple the solution of a single element from that of the whole mesh, thus exposing parallelism at the element level. Individual element solutions are then superimposed node-wise using a weighted sum over concurrent nodes. A classic 2-D electrostatic problem is used to validate the proposed method obtaining accurate results. Results show that the number of iterations of the proposed FEM-SES method scale sublinearly with the number of unknowns. Two generations of CUDA enabled NVIDIA GPUs were used to implement the FEM-SES method and the execution times were compared to the classic FEM showing important performance benefits.

brazilian conference on intelligent systems | 2014

Designing a Heuristic Cross-Architecture Combination for Breadth-First Search

Yang You; David A. Bader; Maryam Mehri Dehnavi

Breadth-First Search (BFS) is widely used in real-world applications including computational biology, social networks, and electronic design automation. The most effective BFS approach has been shown to be a combination of top-down and bottom-up approaches. Such hybrid techniques need to identify a switching point which is conventionally found through expensive trial-and-error and exhaustive search routines. We present an adaptive method based on regression analysis that enables dynamic switching at runtime with little overhead. We improve the performance of our method by exploiting popular heterogeneous platforms and efficiently design the approach for a given architecture. An 155x speedup is achieved over the standard top-down approach on GPUs. Our approach is the first to combine top-down and bottom-up across different architectures. Unlike combination on a single architecture, a mistuned switching point may significantly decrease the performance of cross-architecture combination. Our adaptive method can predict the switching point with high accuracy, leading to an 695x speedup compared the worst switching point.

ieee international conference on high performance computing data and analytics | 2014

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Yang You; Haohuan Fu; Shuaiwen Leon Song; Maryam Mehri Dehnavi; Lin Gan; Xiaomeng Huang; Guangwen Yang

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, which greatly limits their performance and power efficiency. In this paper, we accelerate the forward-modeling technique on the latest multi-core and many-core architectures such as Intel® Sandy Bridge CPUs, NVIDIA Fermi C2070 GPUs, NVIDIA Kepler K20× GPUs, and the Intel® Xeon Phi co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels. For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best performance. Although our stencil with 114 component variables poses several great challenges for performance optimization, and the low stencil ratio between computation and memory access is too inefficient to fully take advantage of our evaluated architectures, we manage to achieve performance efficiencies ranging from 4.730% to 20.02% of the theoretical peak. We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted applications.

international conference on cluster computing | 2017

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

Bangtian Liu; Chengyao Wen; Anand D. Sarwate; Maryam Mehri Dehnavi

Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make such implementations challenging. We leverage the fact that sparse tensor operations share similar computation patterns to propose a unified tensor representation called F-COO. Combined with GPU-specific optimizations, F-COO provides highly-optimized implementations of sparse tensor computations on GPUs. The performance of the proposed unified approach is demonstrated for tensor-based kernels such as the Sparse Matricized Tensor-Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor-Times-Matrix Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs.

Computer Physics Communications | 2015

Parallel finite element technique using Gaussian belief propagation

Yousef El-Kurdi; Maryam Mehri Dehnavi; Warren J. Gross; Dennis D. Giannacopoulos

Abstract The computational efficiency of Finite Element Methods (FEMs) on parallel architectures is severely limited by conventional sparse iterative solvers. Conventional solvers are based on a sequence of global algebraic operations that limits their parallel efficiency. Traditionally, sophisticated programming techniques tailored to specific CPU architectures are used to improve the poor performance of sparse algebraic kernels. The introduced FEM Multigrid Gaussian Belief Propagation (FMGaBP) algorithm is a novel technique that eliminates all global algebraic operations and sparse data-structures. The algorithm is based on reformulating the FEM into a distributed variational inference problem on graphical models. We present new formulations for FMGaBP, which enhance its computation and communication complexities. A Helmholtz problem is used to validate the FMGaBP formulation for 2D, 3D and higher FEM degrees. Implementation techniques for multicore architectures that exploit the parallel features of FMGaBP are presented showing speedups compared to open-source libraries, specifically deal.II and Trilinos. FMGaBP is also implemented on manycore architectures in this work; Speedups of 4.8X, 2.3X and 1.5X are achieved on an NVIDIA Tesla C2075 compared to the parallel CPU implementation of FMGaBP on dual-core, quad-core and 12-core CPUs respectively.

ieee international conference on high performance computing data and analytics | 2017

Sympiler: transforming sparse matrix codes by decoupling symbolic analysis

Kazem Cheshmi; Shoaib Kamil; Michelle Mills Strout; Maryam Mehri Dehnavi

Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8× and 1.5× respectively.

Explore More