Is this you? Create Your Porfile

Kolin Paul

Indian Institute of Technology Delhi

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kolin Paul is active.

Explore More

Publication

Featured researches published by Kolin Paul.

computer and information technology | 2010

Android on Mobile Devices: An Energy Perspective

Kolin Paul; Tapas Kumar Kundu

Mobile devices and embedded devices need more processing power but energy consumption should be less to save battery power. Open Handset Alliance (OHA) hosting members like Google, Motorola, HTC etc released an open source platform Android for mobile devices. Android is also used in netbook and embedded platform. Android runs on top of linux kernel with a custom JVM set on top of it. Android uses new power management framework to save power in mobile devices. Android developers are allowed to build only JAVA applications. Google tries to make Android as energy efficient as possible to save battery power in mobile devices. In this work, we present benefits of using Android in low power embedded devices. We compared Android JAVA performance with popular Sun embedded JVM running on top of Angstrom linux. Our work shows that Android provides better VM designs but consumes more energy due to lack of dynamic compiler in Dalvik JVM. The implication is that, Android can become more energy efficient by implementing an optimized dynamic compiler in Dalvik JVM.

international conference on vlsi design | 2007

Application Specific Datapath Extension with Distributed I/O Functional Units

Nagaraju Pothineni; Anshul Kumar; Kolin Paul

Performance of an application can be improved through augmenting the processor with application specific functional units (AFUs). Usually a cluster of operations identified from the application forms the behavior of an AFU. Several researchers studied the impact of input and output (I/O) constraints for a legal operation cluster on the overall achievable speedup. The general observation is that the speedup potential grows with the relaxation of I/O constraints. Going further, in this paper, the authors investigate the speedup potential of AFUs in the absence of I/O constraints. Design challenge in the absence of I/O constraints is addressed in a very practical manner, through the identification of maximal convex subgraphs. Usually the available register ports are few but the number of inputs/outputs of the identified patterns are likely to be large. The authors solve the register port limitation by the design of distributed I/O functional units, in which the operands are communicated in multiple cycles. The experimental results show that selection of maximal clusters achieves average 50% higher speedup than selecting I/O constrained operation clusters. Also, our identification algorithm runs 2 to 3 orders faster than an exhaustive identification approach

field-programmable technology | 2011

Compact generic intermediate representation (CGIR) to enable late binding in coarse grained reconfigurable architectures

Syed Mohammad Asad Hassan Jafri; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

In the era of platforms hosting multiple applications, where inter-application communication and concurrency patterns are arbitrary, static compile time decision making is neither optimal nor desirable. As a part of solving this problem, we present a novel method for compactly representing multiple configuration bitstreams of a single application, with varying parallelisms, as a unique, compact, and customizable representation, called CGIR. The representation thus stored is unraveled at runtime to configure the device with optimal (e.g. in terms of energy) implementation. Our goal was to provide optimal decision making capability to the runtime resource manager (RTM) without compromising the runtime behavior or the memory requirements of the system. The presence of multiple binaries enhance optimality by providing the RTM with multiple implementations to choose from. The CGIR ensures minimal increase in memory requirements with the addition of each binary. The low-cost unraveling of CGIR guarantees the runtime behavior. We have chosen the dynamically reconfigurable resource array (DRRA) as a vehicle to study the feasibility of our approach. Simulation results using 16 point decimation in time fast Fourier transform (FFT) has showed massive (up to 18% for 2 versions, 33% for 3 versions) memory savings compared to state of the art. Formal evaluation shows that the savings increase with the increase in the number of implementations stored.

international conference on embedded computer systems architectures modeling and simulation | 2013

Energy-aware-task-parallelism for efficient dynamic voltage, and frequency scaling, in CGRAs

Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.

ieee computer society annual symposium on vlsi | 2006

Defect-aware design paradigm for reconfigurable architectures

Rahul Jain; Anindita Mukherjee; Kolin Paul

With advances in process technology, the feature sizes are decreasing, which leads to higher defect densities. More sophisticated techniques, at increased costs are required to avoid defects. If nanotechnology based fabrications are applied, the yield may even go down to zero, as avoiding defects during fabrication will not be a feasible option. Hence, future architectures have to be defect-tolerant. Most of the current defect-tolerance schemes introduce redundancy in architecture to combat defects. Alternatively we can introduce defect tolerance in the design-flow. In this paper we analyze the bottlenecks faced by current design-methodologies while addressing defect tolerance. We study the performance of present place and route tools on a defective fabric in terms of area and critical delay penalty, and explore routing aware placement in this context. We have proposed a new cost function, CA-RISA for improving the performance in a defect-aware environment

asia and south pacific design automation conference | 2010

A high-level synthesis flow for custom instruction set extensions for application-specific processors

Nagaraju Pothineni; Philip Brisk; Paolo Ienne; Anshul Kumar; Kolin Paul

Custom instruction set extensions (ISEs) are added to an extensible base processor to provide application-specific functionality at a low cost. As only one ISE executes at a time, resources can be shared. This paper presents a new high-level synthesis flow targeting ISEs. We emphasize a new technique for resource allocation, binding, and port assignment during synthesis. Our method is derived from prior work on datapath merging, and increases area reduction by accounting for the cost of multiplexors that must be inserted into the resulting datapath to achieve multi-operational functionality.

international symposium on quality electronic design | 2013

Energy-aware coarse-grained reconfigurable architectures using dynamically reconfigurable isolation cells

Syed Mohammad Asad Hassan Jafri; Ozan Bag; Ahmed Hemani; Nasim Farahini; Kolin Paul; Juha Plosila; Hannu Tenhunen

This paper presents a self adaptive architecture to enhance the energy efficiency of coarse-grained reconfigurable architectures (CGRAs). Today, platforms host multiple applications, with arbitrary inter-application communication and concurrency patterns. Each application itself can have multiple versions (implementations with different degree of parallelism) and the optimal version can only be determined at runtime. For such scenarios, traditional worst case designs and compile time mapping decisions are neither optimal nor desirable. Existing solutions to this problem employ costly dedicated hardware to configure the operating point at runtime (using DVFS). As an alternative to dedicated hardware, we propose exploiting the reconfiguration features of modern CGRAs. Our solution relies on dynamically reconfigurable isolation cells (DRICs) and autonomous parallelism, voltage, and frequency selection algorithm (APVFS). The DRICs reduce the overheads of DVFS circuitry by configuring the existing resources as isolation cells. APVFS ensures high efficiency by dynamically selecting the parallelism, voltage and frequency trio, which consumes minimum power to meet the deadlines on available resources. Simulation results using representative applications (Matrix multiplication, FIR, and FFT) showed up to 23% and 51% reduction in power and energy, respectively, compared to traditional DVFS designs. Synthesis results have confirmed significant reduction in area overheads compared to state of the art DVFS methods.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Compression Based Efficient and Agile Configuration Mechanism for Coarse Grained Reconfigurable Architectures

Syed Mohammad Asad Hassan Jafri; Ahmed Hemani; Kolin Paul; Juha Plosila; Hannu Tenhunen

This paper considers the possibility of speeding up the configuration by reducing the size of configware in coarse grained reconfigurable architectures (CGRAs). Our goal was to reduce the number of cycles and increase the configuration bandwidth. The proposed technique relies on multicasting and bit stream compression. The multicasting reduces the cycles by configuring the components performing identical functions simultaneously, in a single cycle, while the bit stream compression increases the configuration bandwidth. We have chosen the dynamically reconfigurable resource array (DRRA) architecture as a vehicle to study the efficiency of this approach. In our proposed method, the configuration bit stream is compressed offline and stored in a memory. If reconfiguration is required, the compressed bit stream is decompressed using an online decompresser and sent to DRRA. Simulation results using practical applications showed up to 78% and 22% decrease in configuration cycles for completely parallel and completely serial implementations, respectively. Synthesis results have confirmed nigligible overhead in terms of area (1.2 %) and timing.

international conference on vlsi design | 2010

Clocking-Based Coplanar Wire Crossing Scheme for QCA

Rajeswari Devadoss; Kolin Paul; M. Balakrishnan

Quantum-dot Cellular Automata is one of the promising next-gen fabrics for circuits. Coplanar wire crossings is one of the more elegant features of this new low power computing paradigm. However, these need two types of cells and are known to be neither easy to fabricate nor very robust. In this work, we propose coplanar wire crossing using a single type of QCA cells, by applying the concept of Time Division Multiplexing to design the crossing. This has massive implications in fabrication and fault tolerance of QCA circuits.

international parallel and distributed processing symposium | 2012

Performance Estimation of GPUs with Cache

Arun Kumar Parakh; M. Balakrishnan; Kolin Paul

Performance estimation of an application on any processor is becoming a essential task, specially when the processor is used for high performance computing. Our work here presents a model to estimate performance of various applications on a modern GPU. Recently, GPUs are getting popular in the area of high performance computing along with original application domain of graphics. We have chosen FERMI architecture from NVIDIA, as an example of modern GPU. Our work is divided into two basic parts, first we try to estimate computation time and then follow it up with estimation of memory access time. Instructions in the kernel contribute significantly to the computation time. We have developed a model to count the number of instructions in the kernel. We have found our instruction count methodology to give exact count. Memory access time is calculated in three steps, address trace generation, cache simulation and average memory latency per warp. Finally, computation time is combined with memory access time to predict the total execution time. This model has been tested with micro-benchmarks as well as real life kernels like blowfish encryption matrix multiplication and image smoothing. We have found that our average estimation errors for these applications range from -7.76% to 55%.

Explore More