Mihai Sima
University of Victoria
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mihai Sima.
field programmable logic and applications | 2002
Mihai Sima; Stamatis Vassiliadis; Sorin Cotofana; Jos van Eijndhoven; Kees A. Vissers
The ability for providing a hardware platformwhich can be customized on a per-application basis under software control has established Reconfigurable Computing (RC) as a new computing paradigm. A machine employing the RC paradigm is referred to as a Field-Programmable Custom Computing Machine (FCCM). So far, the FCCMs have been classified according to implementation criteria. For the previous classifications do not reveal the entire meaning of the RC paradigm, we propose to classify the FCCMs according to architectural criteria. To analyze the phenomena inside FCCMs, we introduce a formalism based on microcode, in which any custom operation performed by a field-programmed computing facility is executed as a microprogram with two basic stages: SET CONFIGURATION and EXECUTE CUSTOM OPERATION. Based on the SET/EXECUTE formalism, we then propose an architectural-based taxonomy of FCCMs.
field-programmable custom computing machines | 2001
Mihai Sima; Sorin Cotofana; J.T.J. van Eijndhoven; Stamatis Vassiliadis; Kornelis Antonius Vissers
This paper presents an experiment which aims to assess the potential impact on performance yielded by augmenting a TriMedia/CPU64 processor with a reconfigurable core. We first propose the skeleton of an extension of the Tri-Media/CPU64 architecture, which consists of a Reconfigurable Functional Unit (RFU) and the associated instructions. Then, we address the computation of the 8×8 IDCT on such extended TriMedia and propose a scheme to implement the 1-D IDCT operation on the RFU. When implemented on an ACEX EP1K100 FPGA from Altera, the proposed 1-D IDCT exhibits a latency of 16 and a recovery of 2 TriMedia (200 MHz) cycles, and occupies 42% of the device. By configuring the 1-D IDCT computing facility on the RFU at application load-time, a 2-D IDCT including all overheads can be computed with the throughput of 1/32 IDCT/cycle. This is an improvement of more than 40% over the standard TriMedia/CPU64.
reconfigurable computing and fpgas | 2011
Dongdong Chen; Mihai Sima
This paper presents a parallel architecture of an QR decomposition systolic array based on the Givens rotations algorithm on FPGA. The proposed architecture adopts a direct mapping by 21 fixed-point CORDIC-based process units that can compute the QR decomposition for an 4×4 real matrix. In order to achieve a comprehensive resource and performance evaluation, the computational error analysis, the resource utilized, and speed achieved on Virtex5 XC5VTX150T FPGA, are evaluated with the different precision of the intermediate word lengthes. The evaluation results show that 1) the proposed systolic array satisfies 99.9% correct 4×4 QR decomposition for the 2-13 accuracy requirement when the word length of the data path is lager than 25-bit, 2) occupies about 2, 810 (13%) slices, and achieves about 2.06 M/sec updates by running at the maximum frequency 111 MHz.
international conference on computer design | 2002
Jari Nikara; Stamatis Vassiliadis; Jarmo Takala; Mihai Sima; Petri Liuha
In this paper a parallel Variable-Length Decoding (VLD) scheme is introduced. The scheme is capable of decoding all the codewords in an N-bit buffer whose accumulated codelength is at most N. The proposed method partially breaks the recursive dependency related to the MPEG-2 VLD. All possible codewords in the buffer are detected in parallel and the sum of the codelengths is provided to the external shifter aligning the variable-length coded input stream for a new decoding cycle. Two length detection mechanisms are proposed: the first approach determines the length in a parallel/serial fashion and the second using a new device denoted as MultiplexedAdd. In order to prove feasibility and determine the limiting factors of our proposal, the parallel/serial codeword detector with 32-bit input has been described in behavioral non-optimized VHDL and mapped onto Alteras ACEX EP1K100 FPGA. The implemented prototype exhibits a latency of 110 ns and uses 32% of the logic cells of the device. When applied to MPEG-2 standard benchmark scenes, on average 3.5 symbols are decoded per cycle.
international conference on computer design | 2001
Mihai Sima; Sorin Cotofana; S. Vasseliadis; J.T.J. van Eijndhoven; Kornelis Antonius Vissers
This paper describes an experiment which aims to reveal the potential impact on performance yielded by augmenting a TriMedia-CPU64 processor with a multiple-context FPGA core. We first propose an extension of the TriMedia CPU64 architecture, which consists of a reconfigurable functional unit and its associated instructions. Then, we address the decoding of variable-length codes on such extended TriMedia and describe the architecture and FPGA-implementation of a variable-length decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA, the proposed VLD exhibits a latency of 7 cycles. Preliminary results indicate that by configuring each of the VLD and 1-D IDCT (which is described elsewhere) facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform macroblock parsing followed up by pel reconstruction with an improvement of 20 - 25% over the standard TriMedia.
pacific rim conference on communications, computers and signal processing | 2005
Michael McGuire; Mihai Sima
This paper addresses the design of low-order Kalman filters to estimate radio channels with Rayleigh fading. Rayleigh fading cannot be perfectly modelled with any finite order auto-regressive (AR) process. Previously, only first and second order Kalman filters were used for channel estimation since higher order Kalman filters were found to not significantly improve accuracy. This is due to mismatches in the statistics of the AR models of the Kalman filters and the true Rayleigh fading. In this paper, the coefficients of the AR models for the Kalman filter are calculated by solving for the minimum square error solutions of an over-determined linear systems. The AR models generated have statistics closely matching the Rayleigh fading process. The Kalman filter using these AR models can accurately estimate the Rayleigh fading process. The accuracy of the new Kalman filters is demonstrated in the tracking of simulated Rayleigh fading processes of different bandwidths.
wireless and mobile computing, networking and communications | 2007
Mihai Sima; Murugappan Senthilvelan; Daniel Iancu; C. John Glossner; Mayan Moudgill; Michael J. Schulte
We present a software approach for MIMO-OFDM wireless communication technology. We first show that complex matrix operations like singular-value decomposition (SVD), diagonalization, triangularization, etc., can be executed efficiently in software using a combination of CORDIC and unitary rotation algorithms in a multithreaded SIMD processor. We then investigate and analyze the transformation of a MIMO-OFDM channel into multiple independent SISO-OFDM channels by means of the SVD. The algorithms are implemented on the Sandblaster processor. The numerical results indicate that the CORDIC-augmented processor provides a significant reduction in the computing time of more than 47% over the standard sandblaster processor when converting a 4-by-4 MIMO-OFDM channel into four SISO-OFDM channels. The technique is applicable to emerging wireless communication protocols, such as WiMAX and Wi-Fi, and provides the flexibility required to adapt to continually changing and evolving standards without the need for expensive hardware redesigns and respins.
pacific rim conference on communications, computers and signal processing | 2011
Dian-Marie Ross; Scott Miller; Mihai Sima; Curran Crawford
The CORDIC algorithm performs vector rotations in a number of coordinates systems, providing cost-effective solutions for many application domains today including intelligent guidance systems, wireless communications, and digital signal processing. CORDIC is sequential; thus hardware solutions are needed to improve the computation speed. We analyze three implementation schemes for CORDIC on FPGAs in terms of area, delay, and scalability. Extensive simulations have been carried out and a complete set of numerical figures are provided. We conclude the paper by proposing a number of design rules for implementing CORDIC on commercial reconfigurable devices based on use-case and speed and/or area requirements.
field-programmable logic and applications | 2011
Fatemeh Eslami; Mihai Sima
FPGA interconnection network is implemented using nMOS pass transistor multiplexers. Since the threshold voltage drop across an nMOS device degrades the high logic value, the pMOS transistors of the downstream buffers do not turn fully off, making this approach suffer from static power consumption due to the short-circuit currents and reduced noise margins. The standard pMOS transistor pull-up in the active feedback of an inverter reduces the static power consumption but degrades the switching time and/or dynamic power consumption. We propose to use capacitive boosting to increase the gate voltage of the pass transistors, thus driving the multiplexer output to the full high voltage level at a faster rate than the pMOS pull-up can alone. This way, the signal transitions are accelerated and the short-circuit current of downstream buffers are cut off. The simulations carried out with Cadence indicate a reduction of at least 10% in propagation delay for the proposed circuit versus the standard one across 180nm, 130nm, 90nm, and 65nm technologies.
digital systems design | 2007
Scott Miller; Mihai Sima; Michael McGuire
Programmable routing and logic in field-programmable gate arrays are implemented using nMOS pass transistors. Since the threshold voltage drop across an nMOS device degrades the high logic value, causing the pMOS transistor of the downstream buffer to not turn fully off, this approach suffers from static power consumption and reduced noise margins. The standard pMOS transistor pull-up in an active feedback of an inverter reduces the static power consumption, but degrades the switching time and/or active power consumption. We propose a circuit technique to build level-restoring buffers, which improves the propagation delay or active power consumption at a tiny area penalty. Our main idea is to replicate the nMOS element of the downstream buffer, where each replica is driven by a signal that originates from earlier stages of the nMOS-tree multiplexer. This way, when passing high logic values, signals from earlier stages directly drive the downstream buffer improving the delay or the slope of the transition edge. The passing of low logic values is still performed in the original way by the nMOS tree and the pMOS element of the downstream buffer. The simulations indicate an average improvement of the composite metric area-delay-energy product of 25% versus the standard approach across 180 nm, 130 nm, and 90 nm technologies.