Zain Ul-Abdin
Halmstad University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zain Ul-Abdin.
asilomar conference on signals, systems and computers | 2014
Andreas Olofsson; Tomas Nordström; Zain Ul-Abdin
In this paper we introduce Epiphany as a highperformance energy-efficient manycore architecture suitable for real-time embedded systems. This scalable architecture supports floating point operations in hardware and achieves 50 GFLOPS/W in 28 nm technology, making it suitable for high performance streaming applications like radio base stations and radar signal processing. Through an efficient 2D mesh Network-on-Chip and a distributed shared memory model, the architecture is scalable to thousands of cores on a single chip. An Epiphany-based open source computer named Parallella was launched in 2012 through Kickstarter crowd funding and has now shipped to thousands of customers around the world.
embedded and real-time computing systems and applications | 2014
Süleyman Savas; Essayas Gebrewahid; Zain Ul-Abdin; Tomas Nordström; Mingkun Yang
Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the intercore communication. The code generation can target multiple architectures, but the results presented in this paper is focused on Adaptevas many core architecture Epiphany. We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and handwritten implementations from a factor of 4.3× down to a factor of only 1.3×.
international parallel and distributed processing symposium | 2012
Zain Ul-Abdin; Essayas Gebrewahid; Bertil Svensson
With the advent of many core architectures comprising hundreds of processing elements, fault management has become a major challenge. We present an approach that uses the occam-pi language to manage the fault recovery mechanism on a new many core architecture, the Platform 2012 (P2012). The approach is made possible by extending our previously developed compiler framework to compile occam-pi implementations to the P2012 architecture. We describe the techniques used to translate the salient features of the occam-pi language to the native programming model of the P2012 architecture. We demonstrate the applicability of the approach by an experimental case study, in which the DCT algorithm is implemented on a set of four processing elements. During run-time, some of the tasks are then relocated from assumed faulty processing elements to the faultless ones by means of dynamic reconfiguration of the hardware. The working of the demonstrator and the simulation results illustrate not only the feasibility of the approach but also how the use of higher-level abstractions simplifies the fault handling.
signal processing systems | 2017
Zain Ul-Abdin; Mingkun Yang
The successful realization of next generation radar systems have high performance demands on the signal processing chain. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications. This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our findings in terms of performance and productivity gains achieved in this case study. The comparison of the performance results with the reference sequential implementations executing on a state-of-the-art embedded processor show that we are able to achieve a speedup of 1.6x to 4.4x by using only 10 cores of Epiphany.
parallel, distributed and network-based processing | 2016
Benard Xypolitidis; Rudin Shabani; Satej V. Khandeparkar; Zain Ul-Abdin; Süleyman Savas; Tomas Nordström
Today many of the high performance embedded processors already contain multiple processor cores and we see heterogeneous manycore architectures being proposed. Therefore it is very desirable to have a fast way to explore various heterogeneous architectures through the use of an architectural design space exploration tool, giving the designer the option to explore design alternatives before the physical implementation. In this paper, we have extended Heracles, a design space exploration tool for (homogeneous) manycore architectures, to incorporate different types of processing cores, and thus allow us to model heterogeneity. Our tool, called the Heterogeneous Heracles System (HHS), can besides the already supported MIPS core also include OpenRISC cores. The new tool retains the possibility available in Heracles to perform register transfer level (RTL) simulations of each explored architecture in Verilog as well as synthesizing it to field-programmable gate arrays (FPGAs). To facilitate the exploration of heterogeneous architectures, we have also extended the graphical user interface (GUI) to support heterogeneity. This GUI provides options to configure the types of core, core settings, memory system and network topology. Some initial results on FPGA utilization are presented from synthesizing both homogeneous and heterogeneous manycore architectures, as well as some benchmark results from both simulated and synthesized architectures.
field-programmable logic and applications | 2009
Zain Ul-Abdin
Embedded signal processing systems are facing the challenges of increased computational demands. Reconfigurable architectures, which can be configured to form application-specific hardware, offer not only high degree of parallelism but also the possibility to dynamically allocate the resources during run-time. This allows the user to implement applications which are otherwise too large to be handled by a particular device. The reconfigurable computing devices have evolved over the years from the gate-level arrays to a more coarse-grained composition of highly optimized functional blocks or even program controlled processing elements, which are operated in a coordinated manner to improve performance and energy efficiency.LICENTIATE THESIS High-Level Programming of Coarse-Grained Reconfigurable Architectures Abstract The design of high-performance embedded systems for signal processing applications is facing the challenges of not only increased computational demands but also increased demands for adaptability to future functional requirements for these applications. A category of reconfigurable architectures consisting of program controlled processing units is gaining attention as a means to cope with these problems. This thesis focuses on the programming aspects for such coarse-grained reconfigurable computing devices and the relevant computation models capable of exposing different kinds of parallelism inherent in the application. Thus these computation models can be adopted for expressing computations intended for such machines in order to achieve better performance.The design of high-performance embedded systems for signal processing applications is facing the challenges of not only increased computational demands but also increased demands for adaptability to future functional requirements for these applications. A category of reconfigurable architectures consisting of program controlled processing units is gaining attention as a means to cope with these problems. This thesis focuses on the programming aspects for such coarse-grained reconfigurable computing devices and the relevant computation models capable of exposing different kinds of parallelism inherent in the application. Thus these computation models can be adopted for expressing computations intended for such machines in order to achieve better performance. The field of coarse-grained reconfigurable architectures is first explored, and based on the architectural variations in terms of granularity, reconfigurability, and interconnection networks, coarse-grained reconfigurable architectures are classified into four categories. The categories are hybrid architectures, arrays of functional units, arrays of processors, and arrays of soft processors. A study of the programming methodologies used for these different architectures reveals that programming techniques based on computation models such as Stream processing, Communicating Sequential Processes (CSP), and Kahn Process Networks (KPN) are gaining wider acceptance. Our hypothesis is that the use of languages based on appropriate model of computation enhances the productivity and allows better use of resources without compromising performance. As a proof of concept, experimental studies are performed based on CSP as a selected model of computation. The rst study involves a concurrent language, which is used to generate descriptions for FPGAs. In the second, an approach of compiling a CSP based language, occam-pi, to a reconfigurable processor array is evaluated. The method is based on developing a compiler backend for generating native code for the target architecture.
The first computers | 2018
Süleyman Savas; Zain Ul-Abdin; Tomas Nordström
The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. T ...
ieee computer society annual symposium on vlsi | 2017
Süleyman Savas; Erik Hertz; Tomas Nordström; Zain Ul-Abdin
This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.
field programmable gate arrays | 2016
Zain Ul-Abdin; Bertil Svensson
The future trend in microprocessors for the more advanced embedded systems is focusing on massively parallel reconfigurable architectures, consisting of heterogeneous ensembles of hundreds of processing elements communicating over a reconfigurable interconnection network. However, the mastering of low-level microarchitectural details involved in the programming of such massively parallel platforms becomes too cumbersome, which limits their adoption in many applications. Thus, there is a dire need for an approach to produce high-performance scalable implementations that harness the computational resources of the emerging reconfigurable platforms. This article addresses the grand challenge of accessibility of these diverse reconfigurable platforms by suggesting the use of a high-level language, occam-pi, and developing a complete design flow for building, compiling, and generating machine code for heterogeneous coarse-grained hardware. We have evaluated the approach by implementing complex industrial case studies and three common signal processing algorithms. The results of the implemented case studies suggest that the occam-pi language-based approach, because of its well-defined semantics for expressing concurrency and reconfigurability, simplifies the development of applications employing runtime reconfigurable devices. The associated compiler framework ensures portability as well as the performance benefits across heterogeneous platforms.
acm sigplan symposium on principles and practice of parallel programming | 2016
Essayas Gebrewahid; Mehmet Ali Arslan; Andreas Karlsson; Zain Ul-Abdin
With the arrival of heterogeneous manycores comprising various features to support task, data and instruction-level parallelism, developing applications that take full advantage of the hardware parallel features has become a major challenge. In this paper, we present an extension to our CAL compilation framework (CAL2Many) that supports data parallelism in the CAL Actor Language. Our compilation framework makes it possible to program architectures with SIMD support using high-level language and provides efficient code generation. We support general SIMD instructions but the code generation backend is currently implemented for two custom architectures, namely ePUMA and EIT. Our experiments were carried out for two custom SIMD processor architectures using two applications. The experiment shows the possibility of achieving performance comparable to hand-written machine code with much less programming effort.