Is this you? Create Your Porfile

Ali Jannesari

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ali Jannesari is active.

Explore More

Publication

Featured researches published by Ali Jannesari.

international parallel and distributed processing symposium | 2015

An Efficient Data-Dependence Profiler for Sequential and Parallel Programs

Zhen Li; Ali Jannesari; Felix Wolf

Extracting data dependences from programs serves as the foundation of many program analysis and transformation methods, including automatic parallelization, runtime scheduling, and performance tuning. To obtain data dependences, more and more related tools are adopting profiling approaches because they can track dynamically allocated memory, pointers, and array indices. However, dependence profiling suffers from high runtime and space overhead. To lower the overhead, earlier dependence profiling techniques exploit features of the specific program analyses they are designed for. As a result, every program analysis tool in need of data-dependence information requires its own customized profiler. In this paper, we present an efficient and at the same time generic data-dependence profiler that can be used as a uniform basis for different dependence-based program analyses. Its lock-free parallel design reduces the runtime overhead to around 86× on average. Moreover, signature-based memory management adjusts space requirements to practical needs. Finally, to support analyses and tuning approaches for parallel programs such as communication pattern detection, our profiler produces detailed dependence records not only for sequential but also for multi-threaded code.

IEEE Transactions on Parallel and Distributed Systems | 2014

Library-Independent Data Race Detection

Ali Jannesari; Walter F. Tichy

Data races are a common problem on shared-memory parallel computers, including multicores. Analysis programs called race detectors help find and eliminate them. However, current race detectors are geared for specific concurrency libraries. When programmers use libraries unknown to a given detector, the detector becomes useless or requires extensive reprogramming. We introduce a new synchronization detection mechanism that is independent of concurrency libraries. It dynamically detects synchronization constructs based on a characteristic code pattern. The approach is non-intrusive and applicable to various concurrency libraries. Experimental results confirm that the approach identifies synchronizations and detects data races regardless of the concurrency libraries involved. With this mechanism, race detectors can be written once and need not be adapted to particular libraries.

Journal of Systems and Software | 2016

Unveiling parallelization opportunities in sequential programs

Zhen Li; Rohit Atre; Zia Ul Huda; Ali Jannesari; Felix Wolf

Detect both loop and task parallelism in a single tool.Identify parallelism based on the concept of computational units (CUs).A ranking method to highlight the most promising parallelization targets.Time and memory overhead that is low enough to deal with real-world applications. The stagnation of single-core performance leaves application developers with software parallelism as the only option to further benefit from Moores Law. However, in view of the complexity of writing parallel programs, the parallelization of myriads of sequential legacy programs presents a serious economic challenge. A key task in this process is the identification of suitable parallelization targets in the source code. In this paper, we present an approach to automatically identify potential parallelism in sequential programs of realistic size. In comparison to earlier approaches, our work combines a unique set of features that make it superior in terms of functionality: It not only (i) detects available parallelism with high accuracy but also (ii) identifies the parts of the code that can run in parallel-even if they are spread widely across the code, (iii) ranks parallelization opportunities according to the speedup expected for the entire program, while (iv) maintaining competitive overhead both in terms of time and memory.

international conference on parallel and distributed systems | 2013

Detecting Correlation Violations and Data Races by Inferring Non-deterministic Reads

Ali Jannesari; Nico Koprowski; Jochen Schimmel; Felix Wolf; Walter F. Tichy

This article proposes a novel iterative algorithm based on Low Density Parity Check (LDPC) codes for compression of correlated sources at rates approaching the Slepian-Wolf bound. The setup considered in the article looks at the problem of compressing one source at a rate determined based on the knowledge of the mean source correlation at the encoder, and employing the other correlated source as side information at the decoder which decompresses the first source based on the estimates of the actual correlation. We demonstrate that depending on the extent of the actual source correlation estimated through an iterative paradigm, significant compression can be obtained relative to the case the decoder does not use the implicit knowledge of the existence of correlation.Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor and memory system for the target FPGA platform. We report design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGA boards. Results show up to 14.6x and 2.9x speedups over software on Intel Core i7 CPU for the two applications, respectively.With the introduction of multicore systems and parallel programs concurrency bugs have become more common. A notorious class of these bugs are data races that violate correlations between variables. This happens, for example, when the programmer does not update correlated variables atomically, which is needed to maintain their semantic relationship. The detection of such races is challenging because correlations among variables usually escape traditional race detectors which are oblivious of semantic relationships. In this paper, we present an effective method for dynamically identifying correlated variables together with a race detector based on the notion of non-deterministic reads that identifies malicious data races on correlated variables. In eight programs and 190 micro benchmarks, we found more than 100 races that were overlooked by other race detectors. Furthermore, we identified about 300 variable correlations which were violated by these races.In big data research, an important field is the big data graph algorithm. The Bayesian Network (BN) is a very powerful graph model for causal relationship modeling and probabilistic reasoning. One key process of building a BN is discovering its structure -- a directed acyclic graph (DAG). In the literature, numerous Bayesian network structure learning algorithms are proposed to discover BN structure from data. However, facing structures learned by different learning algorithms, a general purpose improvement algorithm is lacking. This study proposes a novel algorithm called SBNR (Score-based Bayesian Network Refinement). SBNR leverages Bayesian score function to enrich and rectify BN structures. Empirical study applies SBNR to BN structures learned by three major BN learning algorithms: PC, TPDA and MMHC. Up to 50% improvements are observed, confirming the effectiveness of SBNR towards improving BN structure learning. SBNR is a general purpose algorithm applicable to different BN learning with small computational overhead. Therefore, SBNR can be helpful to advance big data graphic model learning.

computing frontiers | 2017

Boda: A Holistic Approach for Implementing Neural Network Computations

Matthew W. Moskewicz; Ali Jannesari; Kurt Keutzer

Neural networks (NNs) are currently a very popular topic in machine learning for both research and practice. GPUs are the dominant computing platform for research efforts and are also gaining popularity as a deployment platform for applications such as autonomous vehicles. As a result, GPU vendors such as NVIDIA have spent enormous effort to write special-purpose NN libraries. On other hardware targets, especially mobile GPUs, such vendor libraries are not generally available. Thus, the development of portable, open, high-performance, energy-efficient GPU code for NN operations would enable broader deployment of NN-based algorithms. A root problem is that high efficiency GPU programming suffers from high complexity, low productivity, and low portability. To address this, this work presents a framework to enable productive, high-efficiency GPU programming for NN computations across hardware platforms and programming models. In particular, the framework provides specific support for metaprogramming and autotuning of operations over ND-Arrays. To show the correctness and value of our framework and approach, we implement a selection of NN operations, covering the core operations needed for deploying three common image-processing neural networks. We target three different hardware platforms: NVIDIA, AMD, and Qualcomm GPUs. On NVIDIA GPUs, we show both portability between OpenCL and CUDA as well competitive performance compared to the vendor library. On Qualcomm GPUs, we show that our framework enables productive development of target-specific optimizations, and achieves reasonable absolute performance. Finally, On AMD GPUs, we show initial results that indicate our framework can yield reasonable performance on a new platform with minimal effort.

parallel, distributed and network-based processing | 2016

Improving Performance of Transactional Applications through Adaptive Transactional Memory

Thireshan Jeyakumaran; Ehsan Atoofian; Yang Xiao; Zhen Li; Ali Jannesari

Transactional memory (TM) has become progressively widespread especially with hardware transactional memory implementation becoming increasingly available. In this paper, we focus on Restricted Transactional Memory (RTM) in Intels Haswell processor and show that performance of RTM varies across applications. While RTM enhances performance of some applications relative to software transactional memory (STM), in some others, it degrades performance. We exploit this variability and present an adaptive system which is a static approach that switches between HTM and STM in transaction granularity. By incorporating a decision tree prediction module, we are able to predict the optimum TM system for a given transaction based on its characteristics. Our adaptive system supports both HTM and STM with the aim of increasing an applications performance. We show that our adaptive system has an average overall speedup of 20.82% over both TM systems.

international parallel and distributed processing symposium | 2016

Automatic Parallel Pattern Detection in the Algorithm Structure Design Space

Zia Ul Huda; Rohit Atre; Ali Jannesari; Felix Wolf

Parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate supportstructure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. Weevaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.

international conference on parallel processing | 2015

Characterizing Loop-Level Communication Patterns in Shared Memory

Arya Mazaheri; Ali Jannesari; Abdolreza Mirzaei; Felix Wolf

Communication patterns extracted from parallel programs can provide a valuable source of information for parallel pattern detection, application auto-tuning, and runtime workload scheduling on heterogeneous systems. Once identified, such patterns can help find the most promising optimizations. Communication patterns can be detected using different methods, including sandbox simulation, memory profiling, and hardware counter analysis. However, these analyses usually suffer from high runtime and memory overhead, necessitating a trade off between accuracy and resource consumption. More importantly, none of the existing methods exploit fine-grained communication patterns on the level of individual code regions. In this paper, we present an efficient tool based on Disco PoP profiler that characterizes the communication pattern of every hotspot in a shared-memory application. With the aid of static and dynamic code analysis, it produces a nested structure of communication patterns based on programs loops. By employing asymmetric signature memory, the runtime overhead is around 225× while the required amount of memory remains fixed. In comparison with other profilers, the proposed method is efficient enough to be used with real world applications.

Concurrency and Computation: Practice and Experience | 2018

Improving Performance of Transactional Memory through Machine Learning

Yang Xiao; Thireshan Jeyakumaran; Ehsan Atoofian; Ali Jannesari

Transactional memory (TM) is a programming paradigm that facilitates parallel programming for multi‐core processors. In the last few years, some chip manufacturers provided hardware support for TM to reduce runtime overhead of Software Transactional Memory (STM). In this work, we offer two optimization techniques for TMs. The first technique focuses on Restricted Transactional Memory (RTM) in Intels Haswell processor and shows that while in some applications, RTM improves performance over STM, in some others, it falls behind STM. We exploit this variability and propose an adaptive technique that switches between RTM and STM, statically. The second technique focuses on the overhead of TM and enhances the speed of the adaptive system. In particular, we focus on the size of transactions and improve performance by changing the transaction size. Optimizing the transaction size manually is a time‐consuming process and requires significant software engineering effort. We use a combination of Linear Regression (LR) and decision tree to decide on the transaction size, automatically. We evaluate our optimization techniques using a set of benchmarks from NAS, DiscoPoP, and STAMP benchmark suites. Our experimental results reveal that our optimization techniques are able to improve the performance of TM programs by 9% and energy‐delay by 15%, on average.

acm symposium on parallel algorithms and architectures | 2017

Brief Announcement: Meeting the Challenges of Parallelizing Sequential Programs

Rohit Atre; Ali Jannesari; Felix Wolf

Discovering which code sections in a sequential program can be made to run in parallel is the first step in parallelizing it, and programmers routinely struggle in this step. Most of the current parallelism discovery techniques focus on specific language constructs while trying to identify such code sections. In contrast, we propose to concentrate on the computations performed by a program. In our approach, a program is treated as a collection of computations communicating with one another using a number of variables. Each computation is represented as a Computational Unit (CU). A CU contains the inputs and outputs of a computation, and the three phases of a computation: read, compute, and write. Based on the notion of CU, We present a unified framework to identify both loop and task parallelism in sequential programs.

Explore More