Stephan Wong
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephan Wong.
IEEE Transactions on Computers | 2004
Stamatis Vassiliadis; Stephan Wong; Georgi Gaydadjiev; Koen Bertels; Georgi Kuzmanov; Elena Moscu Panainte
In this paper, we present a polymorphic processor paradigm incorporating both general-purpose and custom computing processing. The proposal incorporates an arbitrary number of programmable units, exposes the hardware to the programmers/designers, and allows them to modify and extend the processor functionality at will. To achieve the previously stated attributes, we present a new programming paradigm, a new instruction set architecture, a microcode-based microarchitecture, and a compiler methodology. The programming paradigm, in contrast with the conventional programming paradigms, allows general-purpose conventional code and hardware descriptions to coexist in a program: In our proposal, for a given instruction set architecture, a onetime instruction set extension of eight instructions, is sufficient to implement the reconfigurable functionality of the processor. We propose a microarchitecture based on reconfigurable hardware emulation to allow high-speed reconfiguration and execution. To prove the viability of the proposal, we experimented with the MPEG-2 encoder and decoder and a Xilinx Virtex II Pro FPGA. We have implemented three operations, SAD, DCT, and IDCT. The overall attainable application speedup for the MPEG-2 encoder and decoder is between 2.64-3.18 and between 1.56-1.94, respectively, representing between 93 percent and 98 percent of the theoretically obtainable speedups.
field-programmable logic and applications | 2005
Ioannis Sourdis; Dionisios N. Pnevmatikatos; Stephan Wong; Stamatis Vassiliadis
In this paper, we consider scanning and analyzing packets in order to detect hazardous contents using pattern matching. We introduce a hardware perfect-hashing technique to access the memory that contains the matching patterns. A subsequent simple comparison between incoming data and memory output determines the match. We implement our scheme in reconfigurable hardware and show that we can achieve a throughput between 1.7 and 5.7 Gbps requiring only a few tens of FPGA memory blocks and 0.30 to 0.57 logic cells per matching character. We also show that our designs achieve at least 30% better efficiency compared to previous work, measured in throughput per area required per matching character.
Proceedings. 28th Euromicro Conference | 2002
Stephan Wong; Stamatis Vassiliadis; Sorin Cotofana
In this paper we propose a new hardware unit that performs a 16/spl times/1 SAD operation. The hardware unit is intended to augment a general-purpose core. Further we show that the 16/spl times/1 SAD implementation used can be easily extended to perform the 16/spl times/16 SAD operation, which is commonly used in many multimedia standards, including MPEG-1 and MPEG-2. We have chosen to implement the 16/spl times/1 SAD operation in field-programmable gate arrays (FPGA), because it provides increased flexibility, sufficient performance, and faster design times. We performed simulations to validate the functionality of the 16/spl times/1 SAD implementation using the MAX+plus 11 (version 9.23 BASELINE) software from Altera and synthesis using the FPGA Express (version 3.4) software from Synopsis. Targeting the Alteras FLEX20KE family, synthesis of our 16/spl times/1 SAD unit produced the following results for area and clock frequency: 1699 look-up tables (LUT) and 197 MHz, respectively.
field-programmable logic and applications | 2001
Stamatis Vassiliadis; Stephan Wong; Sorin Cotöfană
In this paper, we introduce the MOLEN ρµ-coded processor which comprises hardwired and microcoded reconfigurable units. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. The reconfiguration of the hardware and the execution on the reconfigured hardware are performed by ρ-microcode (an extension of the classical microcode to allow reconfiguration capabilities). We include fixed and pageable microcode hardware features to extend the flexibility and improve the performance. The scheme allows partial reconfiguration and includes caching mechanisms for nonfrequently used reconfiguration and execution microcode. Using simulations, we establish the performance potential of the proposed processor assuming the JPEG and MPEG-2 benchmarks, the ALTERA APEX20K boards for the implementation, and a hardwired superscalar processor. After implementation, cycle time estimations and normalization, our simulations indicate that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% for the MPEG-2 benchmark using the proposed processor organization.
field-programmable technology | 2008
Stephan Wong; T. van As; G. Brown
This paper presents the architectural design of a reconfigurable and extensible very long instruction word (VLIW) processor. In addition to architectural extensibility, our processor also supports reconfigurable operations. Furthermore, we present an application development framework to optimally exploit the freedom of reconfigurable operations. Because our processor is based on the VEX ISA, we already have a good compiler which is able to deal with ISA extensibility and reconfigurable operations. Our results show that different configurations of our processor lead to considerable cycle count reductions for a selected benchmark application.
international conference on networks | 2007
Mahmood Ahmadi; Stephan Wong
Within packet processing systems, lengthy memory accesses greatly reduce performance. To overcome this limitation, network processors utilize many different techniques, e.g., utilizing multi-level memory hierarchies, special hardware architectures, and hardware threading. In this paper, we introduce a multi-level memory hierarchy and a special hardware cache architecture for counting Bloom filters that is utilized by network processors and packet processing applications such as packet classification and distributed web caching systems. Based on the value of the counters in the counting Bloom filter, a multi-level cache architecture called the cache counting Bloom filter (CCBF) is presented and analyzed. The results show that the proposed cache architecture decreases the number of memory accesses by at least 51.3% when compared to a standard Bloom filter.
design, automation, and test in europe | 2010
Stephan Wong; Fakhar Anjam; Faisal Nadeem
This paper presents dynamic reconfiguration of a register file of a Very Long Instruction Word (VLIW) processor implemented on an FPGA. We developed an open-source reconfigurable and parameterizable VLIW processor core based on the VLIW Example (VEX) Instruction Set Architecture (ISA), capable of supporting reconfigurable operations as well. The VEX architecture supports up to 64 multiported shared registers in a register file for a single cluster VLIW processor. This register file accounts for a considerable amount of area in terms of slices when the VLIW processor is implemented on an FPGA. Our processor design supports dynamic partial reconfiguration allowing the creation of dedicated register file sizes for different applications. Therefore, valuable area can be freed and utilized for other implementations running on the same FPGA when not the full register file size is needed. Our design requires 924 slices on a Xilinx Virtex-II Pro device for dynamically placing a chunk of 8 registers, and places registers in multiples of 8 registers to simplify the design. Consequently, when 64 registers is not needed at all times, the area utilization can be reduced during run-time.
IEEE Micro | 2003
Stamatis Vassiliadis; Stephan Wong; Sorin Cotofana
Microcode is an important innovation in computer engineering. the authors discuss the evolution of microcode from its introduction to its decline and to its likely resurgence in custom computing machines. Furthermore, they present a microcoded machine augmented with field-programmable gate arrays (FPGAs) and provide experimental evidence that it can substantially increase the performance of some media benchmarks.
design, automation, and test in europe | 2013
Anthony Brandon; Stephan Wong
Different applications exhibit different behavior that cannot be optimally captured by a fixed organization of a VLIW processor. However, through exploitation of reconfigurable hardware we can optimize the organization when running different applications. In this paper, we propose a novel way to execute the same binary on different issue-width processors without much hardware modifications. We propose to change the compiler and assembler to ensure correct results. Our experiments show an average slowdown of around 1.3× when compared to binaries compiled for specific issue-widths. This can be further improved to less than 1.09× on average with additional compiler optimizations. Even though the flexibility comes at a price, it can be exploited for many other purposes, such as: dynamic performance/energy trade-off and energy-saving mechanisms, dynamic hardware sharing, and dynamic code insertion for hardware fault detection mechanisms.
design, automation, and test in europe | 2011
Luca Sterpone; Luigi Carro; Debora Matos; Stephan Wong; F. Fakhar
Power consumption is dramatically increasing for Static Random Access Memory Field Programmable Gate Arrays (SRAM-FPGAs), therefore lower power FPGA circuitry and new CAD tools are needed. Clock-gating methodologies have been applied in low power FPGA designs with only minor success in reducing the total average power consumption. In this paper, we developed a new structural clock-gating technique based on internal partial reconfiguration and topological modifications. The solution is based on the dynamic partial reconfiguration of the configuration memory frames related to the clock routing resources. For a set of design cases, figures of static and dynamic power consumption were obtained. The analyses have been performed on a synchronous FIFO and on a r-VEX VLIW processor. The experimental results shown that the efficiency in the total average power consumptions ranges from about 28% to 39% with respect to standard clock-gating approaches. Besides, the proposed method is not intrusive, and presents a very limited cost in term of area overhead.