Jonathan Parri
University of Ottawa
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jonathan Parri.
international conference on hardware/software codesign and system synthesis | 2013
Wei Wang; Miodrag Bolic; Jonathan Parri
In this paper we present pvFPGA, the first system design solution for virtualizing an FPGA-based hardware accelerator on the x86 platform. Our design adopts the Xen virtual machine monitor (VMM) to build a paravirtualized environment, and a Xilinx Virtex-6 as an FPGA accelerator. The accelerator communicates with the x86 server via PCI Express (PCIe). In comparison to the recent accelerator virtualization solutions which primarily intercept and redirect API calls to the hosted or privileged domains user space, pvFPGA virtualizes an FPGA accelerator directly at the lower device driver level. This gives rise to higher efficiency and lower overhead. In pvFPGA, each unprivileged domain allocates a shared data pool for both user-kernel and inter-domain data transfer. In addition, we propose a new component, the coprovisor, which enables multiple domains to simultaneously access an FPGA accelerator. The experimental results have shown that 1) pvFPGA achieves close-to-zero overhead compared to accessing the FPGA accelerator without the VMM layer, 2) the FPGA accelerator is successfully shared by multiple domains, and 3) distributing different maximum data transfer bandwidths to different domains is achieved by regulating the size of the shared data pool at the split driver loading time.
ACM Queue | 2011
Jonathan Parri; Daniel Shapiro; Miodrag Bolic; Voicu Groza
Exposing SIMD units within interpreted languages could simplify programs and unleash floods of untapped processor power.
symposium on applied computational intelligence and informatics | 2011
Daniel Shapiro; Jonathan Parri; John-Marc Desmarais; Voicu Groza; Miodrag Bolic
Customized application-specific processors called ASIPs are becoming commonplace in contemporary embedded system designs. Neural networks are an interesting application for which an ASIP can be tailored to increase performance, lower power consumption and/or increase throughput. Here, both the bidirectional associative memory and hopfield auto-associative memory networks are run through an automated instruction-set identification algorithm to identify and select custom instruction candidates suitable for neural network applications. Clusters of neural networks are highly parallel, and therefore it is interesting to consider a homogeneous multiprocessor composed of ASIPs. The two legacy neural network applications showed a 18–120% improvement with the automatic hardware/software partitioning for a uniprocessor ASIP. However, due to pointers and function calling which did not resolve to hardware, the acceleration was concentrated in the network initialization part of the code.
symposium on applied computational intelligence and informatics | 2011
Jonathan Parri; Miodrag Bolic; Voicu Groza
Traditionally, common processor augmentation solutions have involved the addition of coprocessors or the datapath integration of custom instructions within extensible processors as Instruction Set Extensions (ISE). Rarely is the hybrid option of using both techniques explored. Much research already exists concerning the identification and selection of custom hardware blocks from hardware/software partitioning techniques, but the question of how to best use this hardware within a user system where both coprocessors and datapath augmentations are possible remains. This paper looks to extend existing ISE algorithms which provide custom hardware as dataflow graphs (DFG) and place them appropriately within a hybrid System-on-Chip (SoC) using standard combinatorial optimization techniques. A combinatorial model is presented to address this placement issue and is applied to two well known kernel programs. We further show that such standard techniques can execute within a reasonable time frame alleviating the need for heuristics.
canadian conference on electrical and computer engineering | 2010
Jonathan Parri; John-Marc Desmarais; Daniel Shapiro; Miodrag Bolic; Voicu Groza
The use of Single Instruction Multiple Data (SIMD) operations can be instrumental in meeting the needs of high performance computations. Most languages, including C/C++, give a user the power to directly exploit this hardware and inherent parallelism. We have created a retargetable native SIMD library which Java programmers are now able to use to directly access SIMD intrinsics including MMX, SSE1, SSE2 and SSE3 through prescribed Java methods in an API. This API gives users direct control over their high-performance computations instead of solely relying on the SIMD optimizations of the Java Virtual Machine (JVM), or relying on a GPU which must send and receive the data from the CPU. Through the use of this Java API and the included backing library, substantial performance gains can be achieved on large and complex vector operations. We show an example for which the API obtains a 2x to 3x speedup for both small and large data sets as compared to solely relying on the SIMD optimizations in the JVM.
canadian conference on electrical and computer engineering | 2014
Iype P. Joseph; Jonathan Parri; Yu Wang; Miodrag Bolic; Amir Rajabzadeh; Voicu Groza
GPUs and multicore CPUs are becoming common in todays embedded world of tablets and smartphones. With CPUs and GPUs getting more complex, maximizing hardware utilization and minimizing energy consumption are becoming problematic. The challenges faced in GPGPU computing on embedded platforms are different from their desktop counterparts due to the memory and computational limitations. This study evaluates the advantages of offloading Java applications to an embedded GPU. By employing two approaches namely, Java Native Interface (JNI-OpenCL) and Java bindings for OpenCL (JOCL) we allowed programmers to program an embedded GPU from Java. Experiments were conducted on a Freescale i.MX6Q SabreLite board which contains a quad-core ARM Cortex A9 CPU and a Vivante GC 2000 GPU that supports the OpenCL 1.1 Embedded Profile. The results show up to an eight times increase in performance efficiency by consuming only one-third the energy compared to the CPU-only version of the Java program. This paper demonstrates the performance and energy benefits achieved by offloading Java programs onto an embedded GPU. To the best of our knowledge, this is the first work involving Java acceleration on embedded GPUs.
symposium on applied computational intelligence and informatics | 2011
A. Ayala; H. Osman; Daniel Shapiro; John-Marc Desmarais; Jonathan Parri; Miodrag Bolic; Voicu Groza
Backtracking algorithms are used to methodically and exhaustively search a solution space for an optimal solution to a given problem. A classic example of a backtracking algorithm is illustrated by finding all solutions to the problem of placing N-queens on an N × N chess board such that no two queens attack each other. This paper demonstrates a methodology for rewriting this backtracking algorithm to take advantage of multi-core computing resources. We accelerated a sequential version of the N-queens problem on ×86 and PPC64 architectures. Using problem sizes between 13 and 17, we observed an average speedup of 3.24 for ×86 and 9.24 for the PPC64.
International Journal of High Performance Computing and Networking | 2017
Wei Wang; Miodrag Bolic; Jonathan Parri
This paper presents an ameliorated design of pvFPGA, which is a novel system design solution for virtualising an FPGA-based hardware accelerator by a virtual machine monitor (VMM). The accelerator design on the FPGA can be used for accelerating various applications, regardless of the application computation latencies. In the implementation, we adopt the Xen VMM to build a paravirtualised environment, and a Xilinx Virtex-6 as an FPGA accelerator. The data transferred between the x86 server and the FPGA accelerator through direct memory access (DMA), and a streaming pipeline technique is adopted to improve the efficiency of data transfer. Several solutions to solve streaming pipeline hazards are discussed in this paper. In addition, we propose a technique, hyper-requesting, which enables portions of two requests bidding to different accelerator applications to be processed on the FPGA accelerator simultaneously through DMA context switches, to achieve request level parallelism. The experimental results show that hyper-requesting reduces request turnaround time by up to 80%.
Archive | 2012
Jonathan Parri; John-Marc Desmarais; Daniel Shapiro; Miodrag Bolic; Voicu Groza
Software/hardware codesign is a complex research problem that has been slowly making headway into industry-ready system design products. Recent advances have shown viability to this direction within the design space exploration scope, especially with regards to rapid development cycles. Here, we exploit the hardware/software codesign landscape in the artificial neural network problem space. Automated tools requiring minimal technical expertise from Altera and Tensilica are examined along with newer advances solely within hardware/software codesign research domain. The design space exploration options discussed here look to achieve better software/hardware partitions using instruction-set extensions and coprocessors. As neural networks continue to find usage in embedded systems, it has become imperative to efficiently optimize their implementation within a short development cycle. Modest speedups can be easily achieved with these automated hardware/software codesign tools on the benchmarks examined.
Archive | 2013
Miodrag Bolic; Jonathan Parri; Wei Wang