Dheeraj Sreedhar
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dheeraj Sreedhar.
IEEE Wireless Communications Letters | 2013
Karthik Muralidhar; Dheeraj Sreedhar
Vector state-scalar observation (VSSO) Kalman channel estimators for doubly-selective OFDM systems (DS-OFDM) result in complexity savings of 90% without any loss of performance, as compared to vector state-vector observation (VSVO) Kalman estimators. In this letter, we present the VSSO Kalman channel estimator for doubly-selective multiple-input multiple-output OFDM (DS-MIMO-OFDM) systems. Unlike the VSSO estimator in a DS-OFDM system, where all the pilot symbols had the same value, the pilot symbols need to be designed in a specific way to enable the feasibility of a VSSO estimator for a DS-MIMO-OFDM system. We derive a sufficient condition called as VSSO Kalman-filter condition (VSSO-KFC) that needs to be satisfied by the pilot-pattern design for the feasibility of the VSSO Kalman estimator. We comment, and compare, the proposed pilot pattern with that of Barhumis and Dais pilot pattern. A new scheme is introduced, based on the proposed pilot pattern and VSSO Kalman filter, that increases the spectral efficiency compared to a conventional system.
international conference on supercomputing | 2018
Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Prakash Murali; Shivmaran S. Pandian; Yogish Sabharwal; Dheeraj Sreedhar
The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.
international parallel and distributed processing symposium | 2017
Venkatesan T. Chakaravarthy; Jee W. Choi; Douglas J. Joseph; Xing Liu; Prakash Murali; Yogish Sabharwal; Dheeraj Sreedhar
The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensor-times-matrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to 7x speed-up in the overall running time.
international conference on acoustics, speech, and signal processing | 2013
Karthik Muralidhar; Dheeraj Sreedhar
Recently, we presented a low-complexity vector state-scalar observation (VSSO) Kalman channel estimator for doubly-selective OFDM systems in [1]. In [1], we derived the decoupling equations, where a received pilot symbol vector is decoupled into L scalars, L being the number of multipaths of the channel. This decoupling concept formed the basis of the VSSO Kalman channel estimator. However, in [1], we considered only one observed subcarrier from each pilot cluster. In this paper, we consider more than one observed pilot subcarriers from each pilot cluster and work out a more generalized form of the decoupling equation. This paves way to a more generalized form of the VSSO Kalman estimator. The performance and complexity results are shown compared to an existing vector state-vector observation Kalman estimator. It will be seen that our proposed VSSO method achieves the same performance as the existing vector state-vector observation (VSVO) method [2] and results in more than 90% complexity savings. Results are also presented for a practical system like a digital video broadcasting (DVB-H) system.
international conference on acoustics, speech, and signal processing | 2013
Dheeraj Sreedhar; Jeff H. Derby; Augusto Vega; B. Rogers; Charles Luther Johnson; Robert K. Montoye
The high speed uplink packet access (HSUPA) wireless standard requires extremely high-performance signal processing in the baseband receiver, the most challenging being the chip rate rake receiver. In this paper we describe the architectural enhancements on the IBMs PowerEN processor, to enable it to support the computational requirements of the rake receiver in a fully programmable and scalable fashion. A key feature of these enhancements is a bank-based very-large register file, with embedded single instruction multiple data (SIMD) support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank. This overcomes the limitation on the number of register file ports and at the same time enables high degree of parallelism. We show that these enhancements enable the integration of multi-sector HSUPA G-RAKE receivers on a single processor.
Iet Communications | 2013
Karthik Muralidhar; Dheeraj Sreedhar
Equations (17), (18) and (19) are incorrectly given in the above study. In this study, we give the correct versions of those equations.
ieee international conference on high performance computing, data, and analytics | 2014
Dheeraj Sreedhar; Jeff H. Derby; Robert K. Montoye; Charles Luther Johnson
Dense matrix-matrix multiply is an important kernel in many high performance computing applications including the emerging deep neural network based cognitive computing applications. Graphical processing units (GPU) have been very successful in handling dense matrix-matrix multiply in a variety of applications. However, recent research has shown that GPUs are very inefficient in using the available compute resources on the silicon for matrix multiply in terms of utilization of peak floating point operations per second (FLOPS). In this paper, we show that an architecture with a large register file supported by “indirection ” can utilize the floating point computing resources on the processor much more efficiently. A key feature of our proposed in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. We show that we can achieve more than 25% better performance than the best known results for matrix multiply using GPUs. This is achieved using far lesser floating point computing units and hence lesser silicon area and power. We also show that architecture blends well with the Strassen and Winograd matrix multiply algorithms. We optimize the selective data parallelism that the LCEs enable for these algorithms and study the area-performance trade-offs.
Archive | 2014
Dheeraj Sreedhar; Robert K. Montoye; Jeffrey Haskell Derby
Archive | 2015
Jeffrey Haskell Derby; Charles Luther Johnson; Robert K. Montoye; Dheeraj Sreedhar; Steven Paul Vanderwiel
Archive | 2014
Jeffrey Haskell Derby; Robert K. Montoye; Dheeraj Sreedhar