Is this you? Create Your Porfile

Oana Boncalo

Information Technology University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oana Boncalo is active.

Explore More

Publication

Featured researches published by Oana Boncalo.

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2010

Design Issues and Implementations for Floating-Point Divide–Add Fused

Alexandru Amaricai; Mircea Vladutiu; Oana Boncalo

This brief presents a dedicated unit for the combined operation of floating-point (FP) division followed by addition/subtraction-the divide-add fused (DAF). The goal of this unit is to increase the performance and the accuracy of applications where this combined operation is frequent, such as the interval Newtons method or the polynomial approximation. The proposed DAF unit presents a similar architecture to the FP multiply-accumulate units. The main difference is represented by the divider, which is implemented using digit-recurrence algorithms. An important design tradeoff regarding DAF is represented by the number of required quotient bits. We present the impact of the adopted number of quotient bits on accuracy, cost, and performance. Consequently, two implementations are proposed: one pro-accuracy and one pro-performance. We show that the proposed implementations have better accuracy with respect to the solution based on two distinct units: an FP divider and an FP adder. The implementation suitable for lower latency presents the best cost-performance tradeoff.

international new circuits and systems conference | 2015

FPGA design of high throughput LDPC decoder based on imprecise Offset Min-Sum decoding

Truong Nguyen-Ly; Khoa Le; Fakhreddine Ghaffari; Alexandru Amaricai; Oana Boncalo; Valentin Savin; David Declercq

This paper first proposes two new LDPC decoding algorithms that may be seen as imprecise versions of the Offset Min-Sum (OMS) decoding: the Partially OMS, which performs only partially the offset correction, and the Imprecise Partially OMS, which introduces a further level of impreciseness in the check-node processing unit. We show that they allow significant reduction in the memory (25% with respect to the baseline) and interconnect, and we further propose a cost-efficient check-node unit architecture, yielding a cost reduction of 56% with respect to the baseline. We further implement FPGA-based layered decoder architectures using the proposed algorithms as decoding kernels, for a (3, 6)-regular Quasi-Cyclic LDPC code of length 1296 bits, and evaluate them in terms of cost, throughput and decoding performance. Implementation results on Xilinx Virtex 6 FPGA device show that they can achieve a throughput between 1.95 and 2.41 Gbps for 20 decoding iterations (48% to 83% increase with respect to OMS), while providing decoding performance close to the OMS decoder, despite the impreciseness introduced in the processing units.

field programmable logic and applications | 2014

An FPGA sliding window-based architecture harris corner detector

Alexandru Amaricai; Constantina-Elena Gavriliu; Oana Boncalo

This paper proposes a FPGA implementation based on sliding processing window for Harris corner algorithm. It represents one of the most frequently used pre-processing method, for a wide variety of image processing algorithms, such as feature detection, motion tracking, image registration, etc.. It relies on a series of sequential steps, each processing an image outputted by the previous step. The purpose of the sliding window is to avoid storing intermediate results of processing stages in the external FPGA memory or to avoid utilize large line buffers typically implemented with BRAM blocks. Therefore, the entire processing pipeline benefits from data locality. Implementation results for Virtex5 and Spartan-6 devices show that the proposed solution has very good performance (more than 130 fps for 1280×720 images in Xilinx Spartan-6) with significant less BRAM usage with respect to other approaches.

digital systems design | 2014

Probabilistic Gate Level Fault Modeling for Near and Sub-Threshold CMOS Circuits

Alexandru Amaricai; Sergiu Nimara; Oana Boncalo; Jiaoyan Chen; Emanuel M. Popovici

This paper presents gate level delay dependent probabilistic fault models for CMOS circuits operating at sub-threshold and near-threshold supply voltages. A bottom-up approach has been employed: SPICE simulations have been used to derive higher level error models implemented using Verilog HDL. HSPICE Monte-Carlo simulations show that the delay dependent probabilistic nature of these faults is due to the process-voltage-temperature (PVT) variations which affect the circuits operating at very low supply voltages. For gate level error analysis, mutant based simulated fault injection (SFI) techniques have been employed for combinational net list reliability analysis. Four types of gate level fault models, with different accuracies, are proposed. Our findings show that the proposed SFI method presents a 2X-5X simulation time overhead compared to the simulation of the gold circuit, with respect to SPICE analysis, the proposed method requires three orders of magnitude less simulation time.

design and diagnostics of electronic circuits and systems | 2007

Design of Addition and Multiplication Units for High Performance Interval Arithmetic Processor

Alexandru Amaricai; Mircea Vladutiu; Lucian Prodan; Mihai Udrescu; Oana Boncalo

This paper proposes a new approach for the optimization process of the interval addition and multiplication floating point units. For the interval addition/subtraction, an adder exploiting the parallelism of the double path adder structure is used. The two floating point additions needed are performed simultaneously on different data paths. Therefore, the performance of the proposed adder can be the same as that of two individual floating point adders, but with a much reduced cost overhead. Regarding the interval multiplication, a multiplier architecture was designed, in order to be suitable for pipelined structures. It consists of a floating point multiplier which computes two results for the same operation (rounded differently), and of two floating point comparators. In terms of performance, the proposed multiplier unit presents half of the performance of a conventional floating point multiplier. This is not a drawback, if we consider the fact that interval multiplication requires four floating point operations and six comparisons. This paper shows that interval arithmetic can be efficiently implemented in terms of performance and cost.

annual simulation symposium | 2007

Using Simulated Fault Injection for Fault Tolerance Assessment of Quantum Circuits

Oana Boncalo; Mihai Udrescu; Lucian Prodan; Mircea Vladutiu; Alexandru Amaricai

This paper addresses the problem of evaluating the fault tolerance algorithms and methodologies (FTAMs) designed for quantum systems, by adopting the simulated fault injection methodology from classical computation. Due to their wide spectrum of applications (including quantum circuit simulation) and hierarchical features, the HDLs were employed for performing fault injection, as prescribed by the guidelines of the QUERIST project. At the same time, the injection techniques taken from classical circuit simulation had to be adapted to quantum computation requirements, including the specific quantum error models. The experimental simulated fault injection campaigns are thoroughly described along with the experimental results, which confirm the analytical expectations

field programmable logic and applications | 2014

Cost-efficient FPGA layered LDPC decoder with serial AP-LLR processing

Oana Boncalo; Alexandru Amaricai; Andrei Hera; Valentin Savin

This paper proposes an FPGA based layered architecture for quasi-cyclic (QC) irregular LDPC decoder. Our approach is based on merging variable and check node processing into one single variable-check node (VCN) unit. Layer message computation is done using a parallel scheme of a number of VCNs equal to the expansion factor of the QC matrix. The proposed architecture is characterized by the serial processing of the a posteriori LLRs by an FPGA specific high frequency VCN unit implementation using ROM memories. In our approach data conversions as well as additions and comparators are replaced by look-up-tables implemented using distributed RAM. In addition to this, other techniques such as: efficient packaging of LLRs messages and check-node message compression as well as the configurable port width of the FPGAs BRAM are used to reduce BRAM block utilization. Throughput increase is achieved by utilizing techniques such as pipelining, parallel processing of multiple VCNs, as well as relatively high working frequency. Implementation results for the WiMAX (1152, 2304) QC irregular LDPC code indicate that the proposed architecture has up to 3x less slices resource utilization and up to 1 order of magnitude less BRAM blocks with respect to other approaches, while maintaining a throughput of several hundreds of Mbps (800 Mbps coded bits). We achieved this without sacrificing flexibility; therefore we can easily adapt our design to accommodate different code rates.

design and diagnostics of electronic circuits and systems | 2016

FPGA architecture of multi-codeword LDPC decoder with efficient BRAM utilization

Sergiu Nimara; Oana Boncalo; Alexandru Amaricai; Mircea Popa

Implementation of Quasi-Cyclic (QC) Low Density Parity-Check (LDPC) decoder on FPGA devices has shown great interest in both wireless communication, as well as error correction for Flash memories. This paper presents an FPGA flooded LDPC decoder which uses multiple codeword processing for efficient memory utilization. It is based on a partially parallel implementation, which relies on memory blocks for message passing between the processing units. We obtain efficient memory utilization by packing multiple messages corresponding to multiple codewords into the same Block RAM word. The increase in throughput is linear with the number of processed codewords. The proposed LDPC decoder can process up to 9 codewords in parallel, for 4-bit message quantization, or up to 12 codewords, for 3-bit message quantization, without introducing significant memory overhead.

norchip | 2014

Cost effective FPGA probabilistic fault emulation

Oana Boncalo; Alexandru Amaricai; Christian Spagnol; Emanuel M. Popovici

This paper presents a cost effective FPGA fault emulation technique for probabilistic errors. The problem it addresses is how to efficiently inject faults in many locations within a circuit under test. For this purpose, the emulated fault injection (EFI) components proposed are a trade-off between the desire for speed/performance and the inherent physical device limitations of the FPGA. The proposed method also allows exploring the best option for this trade-off with minimal effort. The proposed solution allows enough flexibility to be able to deal with the different EFI architectures selectable by minor code intervention. An analysis of the overhead introduced by EFI components when varying the number of fault locations has been provided. Furthermore, this paper presents a case study of two ISCAS benchmark circuits in order to test our methodologies and to highlight the differences for combinatorial and a sequential circuits. It is shown that the number of fault locations can be increased more than 20 times with similar overhead than other state of the art methods reported in the literature.

conference on ph.d. research in microelectronics and electronics | 2009

Design of floating point units for interval arithmetic

Alexandru Amaricai; Mircea Vladutiu; Oana Boncalo

In this paper, hardware units for interval addition, multiplication and divide-add fused are presented. Regarding interval addition, a new architecture of double path adder, is presented. This architecture exploits the parallel structure of double path adder. Regarding multiplication, the proposed architecture is based on a dual result multiplier (floating point multiplication unit with two differently rounded results for the same pair of operands) and two floating point comparators. The goal of the divide-add fused unit is to increase the performance of the interval Newtons method. Algorithm and architecture for this operation, inspired by the ones used for multiply-add fused, are proposed.

Explore More