P. Glenn Gulak
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by P. Glenn Gulak.
IEEE Transactions on Communications | 1994
Javan Erfanian; Subbarayan Pasupathy; P. Glenn Gulak
The problem of practical realization of the optimal fixed-delay symbol-by-symbol detection algorithm, which is optimum in the sense of minimizing the symbol error probability, given a delay constraint D, is investigated. A fully-parallel structure is developed, and through systematic reformulations of the algorithm, the computational requirements are reduced considerably. In addition, the problems associated with a large dynamic range such as overflow (or underflow) are (practically) removed. A number of approximations are applied to this simplified parallel symbol (SPS) detector that lead to the derivation of suboptimal detectors. One such suboptimal detector is shown to be the same as the minimum-metric Viterbi detector. A brief comparison of the SPS detector and the Viterbi detector shows that the former has a slightly better performance at low values of signal-to-noise ratio (SNR) and the latter performs a smaller number of computations (particularly) at higher values of SNR; otherwise, the two detectors are comparable in performance and complexity. >
Analog Integrated Circuits and Signal Processing | 1998
Dean R. D’Mello; P. Glenn Gulak
The drive towards shorter design cycles for analog integrated circuits has given impetus to several developments in the area of Field-Programmable Analog Arrays (FPAAs). Various approaches have been taken in implementing structural and parametric programmability of analog circuits. Recent extensions of this work have married FPAAs to their digital counterparts (FPGAs) along with data conversion interfaces, to form Field-Programmable Mixed-Signal Arrays (FPMAs). This survey paper reviews work to date in the area of programmable analog and mixed-signal circuits. The body of work reviewed includes university and industrial research, commercial products and patents. A time-line of important achievements in the area is drawn, the status of various activities is summarized, and some directions for future research are suggested.
IEEE Transactions on Circuits and Systems | 2010
Ritu Raj Singh; Derek Ho; Alireza Nilchi; P. Glenn Gulak; Patrick Yau; Roman Genov
A hybrid CMOS/thin-film microsystem for fluorescence contact imaging is presented. The microsystem integrates a high-performance optical interference filter and a 128 × 128 pixel active pixel sensor fabricated in a standard 0.35-¿m CMOS technology. The thin-film filter has an optical density greater than 6.0 at the wavelength of interest, providing adequate excitation rejection to the 532-nm solid-state laser. Microsystem performance is experimentally validated by imaging spots of Cyanine-3 fluorophore, conventionally used in DNA detection. The emission intensity as a function of fluorophore concentration is measured with an estimated sensitivity of 5000 fluorophore/¿m2 . A human DNA microarray has been imaged with the sensor prototype.
International symposium on turbo codes and related topics | 2005
David Gnaedig; Emmanuel Boutillon; Michel Jezequel; Vincent C. Gaudet; P. Glenn Gulak
The main problem with the hardware implementation of turbo codes is the lack of parallelism in the MAP-based decoding algorithm. This paper proposes to overcome this problem by using a new family of turbo codes called Multiple Slice Turbo Codes. This family is based on two ideas: the encoding of each dimension with P independent tail-biting codes and a constrained interleaver structure that allows the parallel decoding of the P independent codewords in each dimension. The optimization of the interleaver is described. A high degree of parallelism is obtained with equivalent or better performance than thedvb-rcs turbo code. For very high throughput applications, the parallel architecture decreases both decoding latency and hardware complexity compared to the classical serial architecture, which requires memory duplication.RésuméLe problème majeur dans l’implementation matérielle d’un turbo-décodeur réside dans le manque de parallélisme des algorithmes de décodage fondés sur la probabilitéa posteriori maximale (MAP). Cet article propose un nouveau procédé de turbocodage basé sur deux idées : le codage de chaque dimension par P codes convolutifs récursifs circulaires indépendants et l’imposition de contraintes sur la structure de l’entrelaceur de façon à permettre de décoder en parallèle les P codes convolutifs dans chaque dimension. La construction des codes constituants et de l’entrelaceur est décrite et analysée. Un haut degré de parallélisme est obtenu avec des performances équivalentes ou meilleures que le turbocode de la normedvb-rcs. L’architecture parallèle du décodeur permet de réduire à la fois la latence de décodage et la complexité du turbo-décodeur pour des applications à très hauts débits.
international symposium on circuits and systems | 2008
Mahdi Shabany; P. Glenn Gulak
An efficient lattice-reduction (LR) aided implementation of the K-best algorithm is proposed for the general infinite lattice detection problem, which is realized with about 80% less complexity than currently reported architectures. The saving in complexity is achieved by the introduction of an on-demand candidate generation scheme along with a distributed sorting scheme. The proposed scheme does not require any a priori knowledge of the candidate displacement as it expands the candidates using the Schnorr-Euchner method. It is scalable in terms of the number of transmit antennas and its complexity grows sub-linearly with the constellation order. The parallelism intrinsic to the algorithm makes it suitable for the pipelined VLSI implementation.
international symposium on circuits and systems | 2010
Dimpesh Patel; Vadim Smolyakov; Mahdi Shabany; P. Glenn Gulak
This paper presents a VLSI architecture of a novel soft-output K-Best MIMO detector. The proposed detector attains low computational complexity using three improvement ideas: relevant discarded paths selection, last stage on-demand expansion, and relaxed LLR computation. A deeply pipelined architecture for a soft-output MIMO detector is implemented for a 4×4 64-QAM MIMO system realizing a peak throughput of 655Mbps, while consuming 174K gates and 195mW in 0.13um CMOS. Synthesis results in 65nm CMOS show the potential to support a sustained throughput up to 2Gbps achieving the data rates envisioned by emerging IEEE 802.16m and LTE-Advanced wireless standards.
international symposium on circuits and systems | 2009
Dimpesh Patel; Mahdi Shabany; P. Glenn Gulak
QR decomposition (QRD) is an essential signal processing task for many MIMO signal detection schemes. However, decomposition of complex MIMO channel matrices with large dimensions leads to high computational complexity, and hence results in either large core area or low throughput. Moreover, for mobile communication applications that involve fast-varying channels, it is required to perform QR decomposition with low processing latency. In this paper, we propose a hybrid QRD scheme that uses a combination of multi-dimensional Givens rotations, Householder transformations and the conventional two-dimensional (2D) Givens rotations to both reduce the overall computational complexity and achieve higher execution parallelism. To prove the effectiveness of the proposed QRD scheme, a novel pipelined architecture is presented that uses un-rolled pipelined CORDIC processors iteratively to maximize throughput and resource utilization, while minimizing the gate count. The architectures of the main data processing modules, namely the 2D, Householder 3D and 4D/2D configurable pipelined CORDIC processors, are also presented. Synthesis results for a 4×4 MIMO detector in 0.13µm CMOS process indicate that this QRD design computes a 4×4 complex R matrix and four updated 4×1 complex symbol vectors every 40 cycles, at a clock frequency of 270 MHz and requires 36K gates. The proposed design achieves the lowest processing time and the highest throughput reported to-date for the same framework.
IEEE Transactions on Biomedical Circuits and Systems | 2015
Meysam Zargham; P. Glenn Gulak
Delivering milliwatts of wireless power at centimeter distances is advantageous to many existing and emerging biomedical applications. It is highly desirable to fully integrate the receiver on a single chip in standard CMOS with no additional post-processing steps or external components. This paper presents a 2 × 2.18 mm2 on-chip wireless power transfer (WPT) receiver (Rx) coil fabricated in 0.13 μm CMOS. The WPT system utilizes a 14.5 × 14.5 mm2 transmitter (Tx) coil that is fabricated on a standard FR4 substrate. The on-chip power harvester demonstrates a peak WPT efficiency of -18.47 dB, -20.96 dB and -20.15 dB at 10 mm of separation through air, bovine muscle and 0.2 molar NaCl, respectively. The achieved efficiency enables the delivery of milliwatts of power to application circuits while staying below safe power density and electromagnetic (EM) exposure limits.
Integration | 1995
Kenneth J. Schultz; P. Glenn Gulak
Content addressable memories (CAMs) have significantly lower capacities than RAMs. Following a summary of large-capacity CAM applications and a brief tutorial look at CAM operation, this paper reviews the sources of this capacity disadvantage: comparator area overhead and difficulty implementing two-dimensional decoding. Past attempts at achieving higher CAM density and capacity are reviewed, and advantages and disadvantages of each are discussed qualitatively. Architectures are divided into the broad classes of serial and fully parallel. The former include bit-serial, Orthogonal-RAM-based, insertion-memory, word-serial, multiport, vector-centered, pattern-addressable memory, and systolic associative memory. The latter include standard architectures, post-encoding, and pre-classification. A taxonomy, providing the first structured comparison of existing techniques, is presented. Thereafter, four architectures (two serial and two fully parallel) are quantitatively analyzed, in terms of delay, area, and power, and the cost-performance measures area x delay and power x delay. The fully-parallel architectures, despite their high cost, produce superior cost-performance results.
IEEE Transactions on Very Large Scale Integration Systems | 2013
Mahdi Shabany; Ameer Youssef; P. Glenn Gulak
This paper presents the first silicon-proven implementation of a lattice reduction (LR) algorithm, which achieves maximum likelihood diversity. The implementation is based on a novel hardware-optimized due to the Lenstra, Lenstra, and Lovász (LLL) algorithm, which significantly reduces its complexity by replacing all the computationally intensive LLL operations (multiplication, division, and square root) with low-complexity additions and comparisons. The proposed VLSI design utilizes a pipelined architecture that produces an LR-reduced matrix set every 40 cycles, which is a 60% reduction compared to current state-of-the-art LR field-programmable gate array implementations. The 0.13-μm CMOS LR core presented in this paper achieves a clock rate of 352 MHz, and thus is capable of sustaining a throughput of 880 Mb/s for 64-QAM multiple-input-multiple-output detection with superior performance while dissipating 59.4 mW at 1.32 V supply.