Normand Bélanger
École Polytechnique de Montréal
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Normand Bélanger.
IEEE Transactions on Circuits and Systems | 2010
Syed Rafay Hasan; Normand Bélanger; Yvon Savaria; M.O. Ahmad
This paper characterizes the potentially catastrophic effect of crosstalk glitches on representative circuit implementations of two widely used asynchronous protocols. It is demonstrated that the crosstalk glitches can induce false events, which can undesirably propagate into asynchronous interface circuits and may cause system failure. Conventionally, to a circuit designer, glitch propagation (GP) due to aggressor-to-quiet-line crosstalk (AQX) in asynchronous handshake schemes can only be observed through circuit-level analysis/simulation. In this paper, circuit-level analysis is first performed to prove that even optimized conventional asynchronous circuits allow crosstalk glitches produced over moderate-length interconnects (1.5 mm) to propagate. This is a precursor to a more problematic crosstalk glitch occurrence due to further scaling of technologies. To warn the digital designers from GP due to AQX, a novel modeling technique is proposed. This modeling method works at the logic level to facilitate asserting asynchronous interface robustness to crosstalk glitches. This model can accurately identify the possibility of intrinsic (to the asynchronous interface) crosstalk GP in asynchronous circuits at the logic level and, hence, provides a foundation to formally verify such circuits. To our knowledge, this is the first work on modeling GP due to AQX at the logic level for asynchronous circuits.
signal processing systems | 2006
Nicolas Beucher; Normand Bélanger; Yvon Savaria; Guy Bois
This paper describes an application-specific instruction set for a configurable processor to accelerate motion compensated frame rate conversion (MC-FRC) algorithms based on block motion estimation (BME). The proposed instruction set is generic enough to support many MC-FRC algorithms. The performance gain obtained from this instruction set is described and explained. The new instructions are used to implement two BME algorithms: the full search (FS) algorithm and the one-dimensional full search (ODFS) algorithm. The obtained acceleration factor is about one hundred in the case of the FS algorithm and about forty in the case of ODFS. This paper describes the new instruction set and explains these results by describing more precisely how the acceleration is performed. This paper also shows that the acceleration reached can lead to real-time performance on video streams that have a bandwidth similar to NTSC using processors currently available
signal processing systems | 2009
Nicolas Beucher; Normand Bélanger; Yvon Savaria; Guy Bois
This paper describes an application-specific instruction set for a configurable processor to accelerate motion-compensated frame rate conversion (MC-FRC) algorithms based on block motion estimation (BME). The paper shows that the key to achieve very high performance when creating new instructions is to leverage, at the same time, parallel computations, data reuse, and efficient cache use. This is supported by concrete examples that demonstrate how it can be done in the case of the two algorithms considered. The new instructions are used to implement two BME algorithms: one implements the full search (FS) block matching algorithm (BMA), while the other implements the One-Dimensional Full Search (ODFS) BMA. The obtained acceleration factors exceed one hundred for the MC-FRC algorithm embedding the FS algorithm and twenty for the ODFS algorithm. The results show that getting such global acceleration is the consequence of combining parallel computations, data reuse, and efficient cache use, not of only one of them.
IEEE Transactions on Communications | 1994
Normand Bélanger; David Haccoun; Yvon Savaria
The Zigangirov-Jelinek (stack) algorithm allows decoding convolutional codes with a small computational effort compared to the optimum Viterbi algorithm. However, it suffers from a variability of that computational effort that is highly undesirable. The paper describes an architecture that implements a multiple-path-like stack algorithm for reducing this variability. This architecture is organized as a linear structure comprising special processors for extending tree nodes, called extenders, and priority stacks for storing nodes in sorted metric order. The architecture is shown to have a good potential for reducing the computational variability without adding much overhead to the system. Simulations have shown that this architecture effectively reduces computational variability as the number of processors increases, even for a relatively large number of extenders. Simulations run for up to 16 extenders have also shown that using 4 to 16 extenders is a good choice. The architecture is also shown to reduce computational variability like the multiple path algorithm does, while having a better time performance. >
Proceedings of the first edition workshop on High performance and programmable networking | 2013
Thibaut Stimpfling; Yvon Savaria; Andre Beliveau; Normand Bélanger; Omar Cherkaoui
Packet Classification remains a hot research topic, as it is a fundamental function in telecommunication networks, which are now facing new challenges. Due to the emergence of new standards such as OpenFlow, packet classification algorithms have to be reconsidered to support effectively classification over more than 5 fields. In this paper, we analyze the performance offered by EffiCuts in the context of OpenFlow. We extended the EffiCuts algorithm according to OpenFlows context by proposing three improvements: optimization of the leaf data set size, enhancements to the heuristic used to compute the number of cuts, and utilization of an adaptive grouping factor. These extensions provide gains in many contexts but they were tailored for the OpenFlow context. When used in this context, it is shown using suitable benchmarks that they allow reducing the number of memory accesses by a factor of 2 on average, while decreasing the size of the data structure by about 35%.
Integration | 2011
Syed Rafay Hasan; Normand Bélanger; Yvon Savaria; M. Omair Ahmad
High-performance clocking of intellectual property (IP) modules, within a skew budget, is becoming difficult in deep sub-micron technologies. In this work, we propose a novel and all-digital synchronous design method for point-to-point communications, using two stages of interfacing registers and locally delayed clock with phase adjustments. This design is free from synchronizers and clock-data mismatch problems. Moreover, communicating modules run at frequencies which are virtually independent of the clock skew. We also provide a comprehensive case-wise mathematical analysis to facilitate design automation for synthesizing such designs as standard cells. An overall improvement in skew tolerance of up to n times (where n is the number of registers used), when compared to conventional designs, is achieved when the skew orientation is known and n/2 times if the skew orientation is unknown. Improvement in skew tolerance is validated using gate level simulations with the 0.18@mm TSMC CMOS technology. A prototype implementation of the proposed design using a Virtex-II Pro FPGA from Xilinx validates the claim that such designs allow a fast module to communicate with a slow module without constraining their frequencies.
signal processing systems | 2007
Mame Maria Mbaye; Normand Bélanger; Yvon Savaria; Samuel Pierre
Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm’s critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.
international conference on electronics, circuits, and systems | 2007
Nicolas Beucher; Normand Bélanger; Yvon Savaria; Guy Bois
This paper proposes an FPGA based methodology to assess the energy efficiency of application specific processors (ASIPs). This methodology is applied to a video processing algorithm, the motion compensated frame rate conversion (MC-FRC). Previous work has shown that designing a specific instruction set can enhance the performance with a speed-up of more than 80 fold. The purpose of this work is to quantify the energy efficiency of the resulting accelerated processor. This efficiency is evaluated by estimating the power and energy consumption of the processor and of the ASIP when running the algorithm. The results obtained show that the ASIP is more energy efficient than the standard processor by a factor of at least 40. This paper describes the methodology used to compute the power and energy consumption and explains the results through a more detailed analysis of the power and energy consumption.
2006 IEEE North-East Workshop on Circuits and Systems | 2006
Normand Bélanger; Yvon Savaria
This paper investigates the integration of a 64-bit LNS arithmetic unit into a conventional microprocessor. The goals are to devise an LNS unit that can be faster than an FPU for a broad range of applications, and to minimize the added hardware. Two ways of implementing the logarithmic sum and difference functions are studied. One way uses higher-order Taylor series implemented by look-up tables and interpolation, while the other is based on a CORDIC engine. It is shown that a look-up table based implementation is fairly competitive to a floating-point unit in terms of clock rate, overall latency and repeat rate, at the expense of some cache pressure, while the CORDIC-based implementation is fast, has a repeat rate of one clock cycle, and supports complex operations but at the cost of a higher gate count
IEEE Transactions on Circuits and Systems | 2010
Syed Rafay Hasan; Normand Bélanger; Yvon Savaria; M.O. Ahmad
This paper proposes a novel solution to make asynchronous handshake interfaces tolerant to crosstalk-glitch propagation (GP). This study leverages our recently proposed crosstalk-glitch modeling technique for asynchronous systems, which models the crosstalk GP in asynchronous interfaces. In this paper, a novel set of solutions to quench the GP is proposed. This set of solutions is called crosstalk-glitch gating, as the quenching is obtained by using carefully generated control signals. A step-by-step method is recommended to apply crosstalk-glitch gating to conventional asynchronous handshake interfaces. Transistor-level electrical simulations demonstrate that our method effectively generates crosstalk-glitch gate-controlling signals. In comparison with contemporary signal-integrity management techniques, our gating methodology can be applied to fewer nodes and earlier in the design cycle.