Niraj Nandan
Texas Instruments
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Niraj Nandan.
international conference on consumer electronics berlin | 2013
Mihir Mody; Niraj Nandan; Tamama Hideo
In-Loop filtering in HEVC/H.265 is one of most computation intensive block taking around 15-20% of overall complexity for decoding. The loop filtering in HEVC is more sophisticated with introduction of Sample adaptive offset (SAO) filter in addition to de-blocking filter in comparison to H.264. In this paper, very high performance as well as area efficient VLSI architecture is proposed for HEVC decoder, which supports 4K@60fps for next generation Ultra HDTV at 200 MHz clock. The design can process Largest Coding Unit (LCU) of size 64×64 in less than 1200 cycles with performance directly scaling down based on LCU size. The architecture consists of LCU level pipelining across de-blocking and SAO filtering with four & three stage internal pipeline within each block. The architectures proposes fully on-the-fly filtering avoiding memory bandwidth, custom filtering order as well as scanning, 4×4 based block processing and FIFO based asynchronous architecture to achieve high performance. The final design in 28nm CMOS process is expected to take around 0.2 mm2 after actual place & route. The proposed design is capable of handling 4K@60fps as well as fully compliant to HEVC video standard specification including all corner conditions handling like slice and tiles processing via generic region definition.
advances in computing and communications | 2014
Mihir Mody; Hetul Sanghvi; Niraj Nandan; Shashank Dabral; Rajasekhar Allu; Dharmendra Soni; Sunil Sah; Gayathri Seshadri; Prashant Karandikar
Imaging Sub-system (ISS) enables capturing photographs or live video from raw image sensors. This consists of a set of sensor interfaces and cascaded set of algorithms to improve image/video quality. This paper illustrates typical imaging sub-system architectures consisting of a Sensor front end, an Image Signal Processor (ISP) and an Image Co-processor (sIMCOP). Here we describe the ISS developed by Texas Instruments (TI) for the OMAP 5432 processor. The given solution is flexible to interface with various kinds of image sensors and provides hooks to tune visual quality for specific customers as well as end applications. This solution is also flexible by providing options to enable customized data flows based on actual algorithm needs. The overall solution runs at a high throughput of 1 pixel/clock cycle to enable full HD video at high visual quality.
ieee international conference on image information processing | 2013
Niraj Nandan; Mihir Mody
There is continuous thrust on improved and innovative video solution to facilitate video conferencing, video surveillance, transcoding, streaming video and many more customer centric new solutions. Increasing frame rate and frame size demands high performance hardware accelerators (HWA) to enable efficient 16×16 pixels macroblock level (MB) pipelining inside video processing engine (IVAHD). Inloop de-blocking filter of H.264 codec reduces blocking artifacts in MB and it is very demanding in terms of cycles and resources (memory access and memory storage). Removal of blocking artifacts due to block-based video codecs takes around 20-25% of overall decoder complexity in current generation of standards (H.264) and trend will continue going forward in H.265. Higher adaptability of filter process, smaller block sizes (4×4), motion vector (MV) dependent boundary strength (BS) computation for each edge of 4×4 block, predefined order for doing filtering (vertical edge followed by horizontal edge) and data pixel loading of current and neighbor MB requires large number of accesses to shared memory of IVAHD (SL2), higher processing cycles and larger internal pixel buffer (IPB). This paper discusses novel approach of loop filter (LPF) operation to overcome above barriers and facilitate IVAHD to go up to 240fps frame rate in full HD processing of H.264 codec with leadership area and power. The final design in 28nm CMOS process is expected to take around 0.10 mm2 after actual place and route (consisting of 220 KGate with 5 KB of internal memory). Proposed design is capable of handling 4K@60fps and scalable to support H.265 inloop de-blocking filter.
international symposium on circuits and systems | 2014
Hetul Sanghvi; Mihir Mody; Niraj Nandan; Mahesh Mehendale; Subrangshu Das; Dipan Kumar Mandal; Pavan Shastry
Video codec standards like H.264 and HEVC are driving the need for high computation and high memory bandwidth in current SOCs. On the other hand, portable devices like smartphones and tablets are driving the need to reduce power consumption for enhanced battery life. In this paper, we present a scalable H.264 Ultra-HD video codec engine that dissipates 9 mW of decode and 18 mW of encode power (for a typical HP H.264 1080p30 bit-stream) in 28 nm low power process technology node using various low power optimization techniques across architecture, design, circuit, software and systems.
international conference on consumer electronics | 2015
Mihir Mody; Niraj Nandan; Hetul Sanghvi Rajasekhar Allu; Rajat Sagar
Wide Dynamic Range (WDR) enables capturing images with extreme light variations within them. Typically, image sensor manufactures enable WDR using their proprietary techniques consisting of actual physical capture method as well as transmission format. This poses extreme challenge for interfacing Image Signal Processor (ISP) as complexity grows significantly to support customized processing & data flow for each WDR techniques. This paper summarizes various WDR techniques from multiple sensor manufactures from ISP standpoint and proposes a flexible architecture to support WDR data flows & processing using software and hardware. The given solution can be mapped to silicon area savings by a factor of N (number of WDR techniques to be supported) & 33% for rest of ISP with additional external memory bandwidth enabling universal WDR receiver.
international conference on consumer electronics | 2014
Niraj Nandan
Innovative video solutions are required for many more customer centric new solutions (video conferencing, video surveillance, transcoding, streaming video etc). Increasing frame rate and frame size demands more DDR bandwidth. For example, Video processing engine (IVAHD) needs close to 3GByte/sec (encoder) of system bandwidth to support 4K@30fps. With concurrent multimedia sub-systems it is more important to have efficient data transfer between IVAHD shared Level2 Memory (SL2) and DDR. Amount of data, 2D nature of data, 2D data byte alignment conflicting with interconnect word (16Byte) alignment could not be addressed by the generic DMAs. This paper discusses about the DMA engine (VDMA) running at 266MHz which facilitates IVAHD in 4K processing per LCU (16×16) within 200 cycles. The final design in 28nm CMOS process is expected to consume 4mw of encode power and take around 0.50 mm2 after actual place & route (consisting of 1280 KG with 24 context each of size 128 Byte.
international symposium on signal processing and information technology | 2014
Niraj Nandan; Mihir Mody; Hetul Sanghvi; Prithvi Shankar
Transformation and quantization in block based video codecs introduces blocking artifacts at edges. Special optimized video filter called de-blocking filter is applied on 4×4/8×8 block boundary to enhance visual quality and improve prediction efficiency. Most of the recent video codecs, H.264, H.265 (HEVC), VC-1 uses in-loop de-blocking (LPF) filter in decoder path. Each video codec standard defines fixed order of filter operation to have consistency in universal decoder output. Standard defined fixed edge order is not optimal for various architectures of de-blocking Hardware Accelerator (HWA), as it will have to compromise on performance, power or area. Pipelining of unfiltered pixel loading with filter operation, internal storage to keep partially or fully filtered pixels and order of storage of fully filtered pixels are some of the challenges that are difficult to meet with standard defined edge order. In this paper, a novel approach of customizing edge order is discussed for differing architectural requirements and for various video codec standards. Resultant filtered data with optimized edge order matches that of with standard defined edge order.
international conference on acoustics, speech, and signal processing | 2014
Hetul Sanghvi; Mihir Mody; Niraj Nandan; Mahesh Mehendale; Subrangshu Das; Dipan Kumar Mandal; Nainala Vyagrheswarudu; Vijayavardhan Baireddy; Pavan Shastry
With advances in video coding standards like H.264 and HEVC coupled with those in the display technology, Ultra HD contents have started taking the mainstream. This is driving the need for high computation and memory bandwidth in current multi-media SOCs. In this paper, we present a monolithic multi-format video codec engine which achieves Ultra HD performance for H.264 High Profile, reduces the external memory bandwidth requirement by 2X as compared to its predecessor and takes only 5.9 mm2 of silicon area in a low power 28nm process.
international conference on consumer electronics | 2015
Dipan Kumar Mandal; Mihir Mody; Mahesh Mehendale; Naresh Yadav; Ghone Chaitanya; Piyali Goswami; Hetul Sanghvi; Niraj Nandan
Video coding standards (e.g. H.264, HEVC) use slice, consisting of a header and payload video data, as an independent coding unit for low latency encode-decode and better transmission error resiliency. In typical video streams, decoding the slice header is quite simple that can be done on standard embedded RISC processor architectures. However, universal decoding scenarios require handling worst case slice header complexity that grows to un-manageable level, well beyond the capacity of most embedded RISC processors. Hardwiring of slice processing control logic is potentially helpful but it reduces flexibility to tune the decoder for error conditions - an important differentiator for the end user. The paper presents a programmable approach to accelerate slice header decoding using an Application Specific Instruction Set Processor (ASIP). Purpose built instructions, built as extensions to a RISC processor (ARP32), accelerate slice processing by 30% for typical cases, reaching up to 70% for slices with worst case decoding complexity. The approach enables real time universal video decode for all slice-complexity-scenarios without sacrificing the flexibility, adaptability to customize, differentiate the codec solution via software programmability.
international conference on communications | 2014
Mihir Mody; Niraj Nandan; Hetul Sanghvi
Typically, industry video solutions does in-loop filtering on-the-fly (e.g. at Macro Block for H.264 or Coding Unit level for HEVC) and transfer YUV data to external memory using DMA engine. These transfers are un-aligned with processing unit (MB/CU) as well as variable sized due to loop filtering operation. This poses challenges in Loop filer and DMA engine on handling of such transfers. This paper proposes concept of “region” to handle such transfers for H.264 and previous generation standard. In case of H.264, the paper proposes division of video frame in to nine regions with common transfer attributes. In case of HEVC, there is additional complexity of TILES (with & without loop filtering) which creates variable numbers of regions in output frame. This paper also proposed flexible & generic scheme to handle output YUV transfers for handling of TILES in HEVC. The updated scheme is flexible enough to encompass previous generation of video standard e.g. H.264 with better performance.