Is this you? Create Your Porfile

Matin Hashemi

University of California, Davis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matin Hashemi is active.

Explore More

Publication

Featured researches published by Matin Hashemi.

acm multimedia | 2016

CNNdroid: GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android

Seyyed Salar Latifi Oskouei; Hossein Bakhshi Golestani; Matin Hashemi; Soheil Ghiasi

Many mobile applications running on smartphones and wearable devices would potentially benefit from the accuracy and scalability of deep CNN-based machine learning algorithms. However, performance and energy consumption limitations make the execution of such computationally intensive algorithms on mobile devices prohibitive. We present a GPU-accelerated library, dubbed CNNdroid [1], for execution of trained deep CNNs on Android-based mobile devices. Empirical evaluations show that CNNdroid achieves up to 60X speedup and 130X energy saving on current mobile devices. The CNNdroid open source library is available for download at https://github.com/ENCP/CNNdroid

ACM Transactions in Embedded Computing Systems | 2009

Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures

Matin Hashemi; Soheil Ghiasi

We present a methodology for pipelined software synthesis of streaming applications. First, we develop a versatile task assignment algorithm capable of optimizing realistically-arbitrary cost functions for two cores. The algorithm is exact (i.e., theoretically optimal) contrary to existing heuristics. Second, our approximation technique provides an adjustable knob to trade solution quality with algorithm runtime and memory. Third, we develop a recursive heuristic for more cores. FPGA-based emulated experiments validate our theoretical results. The exact algorithm yields 1.7 × throughput improvement. The approximation method offers a range of tradeoff points (e.g., 3 × faster with 20 × less memory) while degrading the throughput only 1% to 5%.

symposium on application specific processors | 2008

System-Level Performance Estimation for Application-Specific MPSoC Interconnect Synthesis

Po-Kuan Huang; Matin Hashemi; Soheil Ghiasi

We present a framework for development of streaming applications as concurrent software modules running on multi-processors system-on-chips (MPSoC). We propose an iterative design space exploration mechanism to customize MPSoC architecture for given applications. Central to the exploration engine is our system-level performance estimation methodology, that both quickly and accurately determine quality of candidate architectures. We implemented a number of streaming applications on candidate architectures that were emulated on an FPGA. Hardware measurements show that our system-level performance estimation method incurs only 15% error in predicting application throughput. More importantly, it always correctly guides design space exploration by achieving 100% fidelity in quality-ranking candidate architectures. Compared to behavioral simulation of compiled code, our system-level estimator runs more than 12 times faster, and requires 7 times less memory.

languages, compilers, and tools for embedded systems | 2012

FORMLESS: scalable utilization of embedded manycores in streaming applications

Matin Hashemi; Mohammad H. Foroozannejad; Soheil Ghiasi; Christoph Etzel

Variants of dataflow specification models are widely used to synthesize streaming applications for distributed-memory parallel processors. We argue that current practice of specifying streaming applications using rigid dataflow models, implicitly prohibits a number of platform oriented optimizations and hence limits portability and scalability with respect to number of processors. We motivate Functionally-cOnsistent stRucturally-MalLEabe Streaming Specification, dubbed FORMLESS, which refers to raising the abstraction level beyond fixed-structure dataflow to address its portability and scalability limitations. To demonstrate the potential of the idea, we develop a design space exploration scheme to customize the application specification to better fit the target platform. Experiments with several common streaming case studies demonstrate improved portability and scalability over conventional dataflow specification models, and confirm the effectiveness of our approach.

ACM Transactions in Embedded Computing Systems | 2013

Throughput-memory footprint trade-off in synthesis of streaming software on embedded multiprocessors

Matin Hashemi; Mohammad H. Foroozannejad; Soheil Ghiasi

We study the trade-off between throughput and memory footprint of embedded software that is synthesized from acyclic static dataflow (task graph) specifications targeting distributed memory multiprocessors. We identify iteration overlapping as a knob in the synthesis process by which one can trade application throughput for its memory requirement. Given an initial processor assignment and non-overlapped task schedule, we formally present underlying properties of the problem, such as constraints on a valid iteration overlapping, maximum possible throughput, and minimum memory footprint. Moreover, we develop an effective algorithm for generation of a rich set of design points that provide a range of trade-off options. Experimental results on a number of applications and architectures validate the effectiveness of our approach.

languages, compilers, and tools for embedded systems | 2007

Joint throughput and energy optimization for pipelined execution of embedded streaming applications

Po-Kuan Huang; Matin Hashemi; Soheil Ghiasi

We present a methodology for synthesizing streaming applications, modeled as task graphs, for pipelined execution on multi-core architectures. We develop a task graph extraction and characterization framework that accurately determines the structure, computation and communication characteristics of application task graph from its specification in C. Furthermore, we develop a provably optimal algorithm that jointly balances the workload assigned to each core, and minimizes inter-core communication traffic. Experiment results show that our versatile method improves the through-put of streaming applications significantly under a variety of hardware configurations.

languages compilers and tools for embedded systems | 2015

Implementation-Aware Model Analysis: The Case of Buffer-Throughput Tradeoff in Streaming Applications

Kamyar Mirzazad Barijough; Matin Hashemi; Volodymyr Khibin; Soheil Ghiasi

Models of computation abstract away a number of implementation details in favor of well-defined semantics. While this has unquestionable benefits, we argue that analysis of models solely based on operational semantics (implementation-oblivious analysis) is unfit to drive implementation design space exploration. Specifically, we study the tradeoff between buffer size and streaming throughput in applications modeled as synchronous data flow (SDF) graphs. We demonstrate the inherent inaccuracy of implementation-oblivious approach, which only considers SDF operational semantic. We propose a rigorous transformation, which equips the state of the art buffer-throughput tradeoff analysis technique with implementation awareness. Extensive empirical evaluation show that our approach results in significantly more accurate estimates in streaming throughput at the model level, while running two orders of magnitude faster than cycle-accurate simulation of implementations.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

Time-Scalable Mapping for Circuit-Switched GALS Chip Multiprocessor Platforms

Mohammad H. Foroozannejad; Matin Hashemi; Alireza Mahini; Bevan M. Baas; Soheil Ghiasi

We study the problem of mapping concurrent tasks of an application to cores of a chip multiprocessor that utilize circuit-switched interconnect and global asynchronous local synchronous (GALS) clocking domains. We develop a configurable algorithm that naturally handles a number of practical requirements, such as architectural features of the target platform, core failures, and hardware accelerators, and in addition, is scalable to a large number of tasks and cores. Experiments with several real life applications show that our algorithm outperforms manual mapping, integer linear programming-based mapping after ten days of solver run time, and a recent packet-switched network on chip-based task mapper through which, we underscore the unique requirements of task mapping for circuit-switched GALS architectures.

design, automation, and test in europe | 2008

Exact and approximate task assignment algorithms for pipelined software synthesis

Matin Hashemi; Soheil Ghiasi

Pipelined execution of streaming applications enable processing of high-throughput data under performance constraint. We present an integrated approach to synthesizing pipelined software for dual-core architectures. We target streaming applications modeled as task graphs that are amenable to static analysis. We develop a versatile task assignment algorithm that considers the combined effect of workload imbalance between processors and inter-processor communication. Our technique, which runs in pseudo-linear time, provably maximizes application throughput. Furthermore, we develop an approximation algorithm for task assignment whose complexity is strictly polynomial. It provides the designer with an adjustable knob to controllably trade solution quality with algorithm runtime and memory requirement. Empirical throughput measurements using an FPGA-based dual-core system validate our theoretical results. Our exact algorithm consistently outperforms a recent competitor. Compared to exact task assignment, the approximate method runs about 3 times faster, requires about 20 times less memory, and results in only 1% to 5% throughput loss.

Archive | 2017

Throughput-Driven Parallel Embedded Software Synthesis from Synchronous Dataflow Models: Caveats and Remedies

Matin Hashemi; Kamyar Mirzazad Barijough; Soheil Ghiasi

Synchronous dataflow (SDF) graphs are often the computational model of choice for specification, analysis, and automated synthesis of parallel streaming kernels targeting embedded multiprocessor system-on-a-chip (MPSoC) platforms. We discuss several limitations of the SDF graphs in the context of conventional parallel software synthesis methodologies, and highlight the associated degradation in analysis accuracy and performance of the synthesized software. Subsequently, we propose several extensions to the strict notion of SDF graph model that address the identified issues. We present extensive empirical evaluations, which underscore the model limitations and the effectiveness of our approach.

Explore More