Anthony Brandon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anthony Brandon is active.

Explore More

Publication

Featured researches published by Anthony Brandon.

design, automation, and test in europe | 2013

Support for dynamic issue width in VLIW processors using generic binaries

Anthony Brandon; Stephan Wong

Different applications exhibit different behavior that cannot be optimally captured by a fixed organization of a VLIW processor. However, through exploitation of reconfigurable hardware we can optimize the organization when running different applications. In this paper, we propose a novel way to execute the same binary on different issue-width processors without much hardware modifications. We propose to change the compiler and assembler to ensure correct results. Our experiments show an average slowdown of around 1.3× when compared to binaries compiled for specific issue-widths. This can be further improved to less than 1.09× on average with additional compiler optimizations. Even though the flexibility comes at a price, it can be exploited for many other purposes, such as: dynamic performance/energy trade-off and energy-saving mechanisms, dynamic hardware sharing, and dynamic code insertion for hardware fault detection mechanisms.

field-programmable logic and applications | 2010

General Purpose Computing with Reconfigurable Acceleration

Anthony Brandon; Ioannis Sourdis; Georgi Gaydadjiev

In this paper we describe a new generic approach for accelerating software functions using a reconfigurable device connected through a high-speed link to a general purpose system. As opposed to related ISA extension approaches, we insert system calls to the original program at hand to control the reconfigurable accelerator. The reconfigurable device is controlled by the host through a device driver, and initiates communication by raising interrupts; it further has direct accesses to the main memory (DMA) operating in the virtual address space. To do so, the reconfigurable device supports address translation, memory protection and paging, while the driver serves the device interrupts, and ensures that shared data in the host-cache remain coherent. The system is implemented in a machine which provides a Hyper Transport bus connecting a Xilinx Virtex4-100 FPGA.

reconfigurable computing and fpgas | 2015

Multiple contexts in a multi-ported VLIW register file implementation

Joost Hoozemans; Jens Johansen; Jeroen van Straten; Anthony Brandon; Stephan Wong

The register file is an expensive component in the design of any processor, especially, when considering the additional ports that are needed to support multiple datapaths within a wide-issue VLIW processor. In a recent work, these additional resources were used to dynamically reconfigure the register file to support a dynamically reconfigurable VLIW core. The design can be perceived as a single 8-issue, two 4-issue, or four 2-issue VLIW cores. Consequently, the multi-ported design can operate in different modes, namely as one, two, or four register files, respectively, corresponding to the active number of cores. The implementation of the register file design on FPGAs using Block RAMs still results in unused resources due to the coarseness of the Block RAMs. In this paper, we propose to re-purpose these unused BRAM resources to additionally support multiple contexts next to earliermentioned modes. In this manner, the 8-issue, 4-issue, and 2- issue cores have access to 4, 2, and 1 contexts, respectively. Consequently, we can avoid saving and restoring of the task states in a multi-task environment, turning context switching from a traditionally time-consuming event to an almost instantaneous event. The advantage of this is the reduction of interrupt latency and task switching latency, which are important in real-time and embedded systems. Our results show that our technique can improve interrupt latency by a factor of 17.4x compared to using a software register spill routine, depending on the behavior of the memory system. Likewise, the task switching time can be improved by 6.7x.

design, automation, and test in europe | 2016

Run-time phase prediction for a reconfigurable VLIW processor

Qi Guo; Anderson Luiz Sartor; Anthony Brandon; Antonio Carlos Schneider Beck; Xuehai Zhou; Stephan Wong

It is well-known that different applications exhibit varying amounts of ILP. Execution of these applications on the same fixed-width VLIW processor will result (1) in wasted energy due to underutilized resources if the issue-width of the processor is larger than the inherent ILP; or alternatively, (2) in lower performance if the issue-width is smaller than the inherent ILP. Moreover, even within a single application distinct phases can be observed with varying ILP and therefore changing resource requirements. With this in mind, we designed the ρ-VEX processor, which is a VLIW processor that can change its issue-width at run-time. In this paper, we propose a novel scheme to dynamically (i.e., at run-time) optimize the resource utilization by predicting and matching the number of active data-paths for each application phase. The purpose is to achieve low energy consumption for applications with low ILP, and high performance for applications with high ILP, on a single VLIW processor design. We prototyped the ρ-VEX processor on an FPGA and obtained the dynamic traces of applications running on top of a Linux port. Our results show that it is possible in some cases to achieve the performance of an 8-issue core with 10% lower energy consumption, while in others we achieve the energy consumption of a 2-issue core with close to 20% lower execution time.

field-programmable technology | 2011

Reconfigurable acceleration and dynamic partial self-reconfiguration in general purpose computing

Ioannis Sourdis; Abhijit Nandy; Venkatasubramanian Viswanathan; Anthony Brandon; Dimitris Theodoropoulos; Georgi Gaydadjiev

In this paper, we describe a generic approach for integrating a dynamically reconfigurable device into a general purpose system interconnected with a high-speed link. The system can dynamically install and execute hardware instances of functions to accelerate parts of a given software code. The hardware descriptions of the functions (bitstreams) are inserted into the executable binary running on the system. Our compiler further inserts system-calls to the software code to control the reconfigurable device. Thereby, the general purpose host-processor of the system manages the hardware reconfiguration and execution through a Linux device driver. The device has direct access to the main memory (DMA) operating in the virtual address space; it further supports memory mapped IO for data and control, and is able to raise and handle interrupts for synchronization. The above system is implemented on a general purpose machine providing a HyperTransport bus to connect a Xilinx Virtex4–100 FPGA, an AMD Opteron-244, and 1 GB of DDR main memory. We evaluate our proposal using a secure audio processing application. We accelerate in hardware the Audio processing kernel as well as the subsequent AES encryption function via dynamic partial self-reconfiguration. The proposed system achieves a 12× speedup over a software for the application at hand.

automation, robotics and control systems | 2017

Exploring ILP and TLP on a Polymorphic VLIW Processor

Anthony Brandon; Joost Hoozemans; Jeroen van Straten; Stephan Wong

In today’s computing environments, the concurrent execution of multiple applications/threads is common and multi-cores are very well-suited to handle such workloads. However, they suffer from the fact that any mismatch between the application’s inherent instruction-level parallelism (ILP) and the core’s parallelism leads to unused resources or loss in performance. An accepted solution is to include several types of cores and match them dynamically depending on the performance needs of the application. This approach becomes less efficient when the number of cores does not match the number of parallel threads. Furthermore, the heterogeneity of (fixed) cores cannot be increased indefinitely as it would result in even higher degrees of mismatching and increased movement of instruction and data streams. In this paper, we are proposing a polymorphic processor, based on VLIW architectures, that can adapt its issue-width during runtime. By design, the processor can be perceived as a single wide core (8-issue VLIW) or two medium-wide cores (4-issue) or four small cores (2-issue) that can run high-ILP/low DLP, medium-ILP/medium DLP, and low-ILP/high-DLP applications, respectively. Furthermore, we are executing one single generic binary while performing these reconfigurations. In order to show the effectiveness of our approach, we synthesized different versions of the core to represent fixed heterogeneous cores and compared them to the dynamic implementation of the core. Our experiments show that the dynamically adaptive solution performs on average \(7\%\) faster and uses \(5\%\) less area than a platform which consists of fixed cores with \(1.5\times \) as many datapaths.

reconfigurable computing and fpgas | 2015

A sparse VLIW instruction encoding scheme compatible with generic binaries

Anthony Brandon; Joost Hoozemans; Jeroen van Straten; Arthur Francisco Lorenzon; Anderson Luiz Sartor; Antonio Carlos Schneider Beck; Stephan Wong

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.

international conference on industrial informatics | 2013

Embedded reconfigurable computing: the ERA approach

Georgios Keramidas; Stephan Wong; Fakhar Anjam; Anthony Brandon; Roel Seedorf; Claudio Scordino; Luigi Carro; Debora Matos; Roberto Giorgi; Stamatis Kavvadias; Sally A. McKee; Bhavishya Goel; Vasileios Spiliopoulos

The growing complexity and diversity of embedded systems-combined with continuing demands for higher performance and lower power consumption-places increasing pressure on embedded platforms designers. The target of the ERA project is to offer a holistic, multi-dimensional methodology to address these problems in a unified framework exploiting the inter- and intra-synergism between the reconfigurable hardware (core, memory, and network resources), the reconfigurable software (compiler and tools), and the run-time system. Starting from the hardware level, we design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. These hardware elements can adapt their composition, organization, and even instruction-set architectures to exploit tradeoffs in performance and power. Appropriate hardware resources can be selected both statically at design time and dynamically at run time. Hardware details are exposed to our custom operating system, our custom runtime system, and our adaptive compiler, and are even visible all the way up to the application level. The design philosophy followed in the ERA project proved efficient enough not only to enable a better choice of power/performance trade-offs but also to support fast platform prototyping of high-efficiency embedded system designs. In this paper, we present a brief overview of the design approach, the major outcomes, and the lessons learned in the ERA project.

applied reconfigurable computing | 2017

Improving the Performance of Adaptive Cache in Reconfigurable VLIW Processor

Sensen Hu; Anthony Brandon; Qi Guo; Yizhuo Wang

In this paper, we study the impact of cache reconfiguration on the cache misses when the issue-width of a VLIW processor is changed. We clearly note here that our investigation pertains the local temporal effects of the cache resizing and how we counteract the negative impact of cache misses in such resizing instances. We propose a novel reconfigurable d-cache framework that can dynamically adapt its least recently used (LRU) replacement policy without much hardware overhead. We demonstrate that using our adaptive d-cache, it ensures a smooth cache performance from one cache size to the other. This approach is orthogonal to future research in cache resizing for such architectures that take into account energy consumption and performance of the overall application.

international conference on industrial informatics | 2011

Early results from ERA — Embedded Reconfigurable Architectures

Stephan Wong; Anthony Brandon; Fakhar Anjam; Roel Seedorf; Roberto Giorgi; Zhibin Yu; Nikola Puzovic; Sally A. McKee; Magnus Själander; Luigi Carro; Georgios Keramidas

Explore More