Is this you? Create Your Porfile

Tulika Mitra

National University of Singapore

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tulika Mitra is active.

Explore More

Publication

Featured researches published by Tulika Mitra.

ACM Transactions in Embedded Computing Systems | 2008

The worst-case execution-time problem—overview of methods and survey of tools

Reinhard Wilhelm; Jakob Engblom; Andreas Ermedahl; Niklas Holsti; Stephan Thesing; David B. Whalley; Guillem Bernat; Christian Ferdinand; Reinhold Heckmann; Tulika Mitra; Frank Mueller; Isabelle Puaut; Peter P. Puschner; Jan Staschulat; Per Stenström

The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

Science of Computer Programming | 2007

Chronos: A timing analyzer for embedded software

Xianfeng Li; Yun Liang; Tulika Mitra; Abhik Roychoudhury

Estimating the Worst Case Execution Time (WCET) of real-time embedded software is an important problem. WCET is defined as the upper bound b on the execution time of a program P on a processor X such that for any input the execution time of P on X is guaranteed to not exceed b. Such WCET estimates are crucial for schedulability analysis of real-time systems. In this paper, we present Chronos, a static analysis tool for generating WCET estimates of C programs. It performs detailed micro-architectural modeling to capture the timing effects of the underlying processor platform. Consequently, we can provide safe but tight WCET estimate of a given C program running on a complex modern processor. Chronos is an open-source distribution specifically suited to the needs of the research community. We support processor models captured by the popular SimpleScalar architectural simulator rather than targeting specific commercial processors. This makes the Chronos flexible, extensible and easily accessible to the researcher.

real-time systems symposium | 2005

WCET centric data allocation to scratchpad memory

Vivy Suhendra; Tulika Mitra; Abhik Roychoudhury; Ting Chen

Scratchpad memory is a popular choice for on-chip storage in real-time embedded systems. The allocation of code/data to scratchpad memory is performed at compile time leading to predictable memory access latencies. Current scratchpad memory allocation techniques improve the average-case execution time of tasks. For hard real-time systems, on the other hand, worst case execution time (WCET) is a key metric. In this paper, we propose scratchpad allocation techniques for data memory that aim to minimize a tasks WCET. We first develop an integer linear programming (ILP) based solution which constructs the optimal allocation assuming that all program paths are feasible. Next, we employ branch-and-bound search to more accurately construct the optimal allocation by exploiting infeasible path information. However, the branch-and-bound search is too time-consuming in practice. Therefore, we design fast heuristic searches that achieve near-optimal allocations for all our benchmarks

compilers, architecture, and synthesis for embedded systems | 2006

Integrated scratchpad memory optimization and task scheduling for MPSoC architectures

Vivy Suhendra; Chandrashekar Raghavan; Tulika Mitra

Multiprocessor system-on-chip (MPSoC) is an integrated circuit containing multiple instruction-set processors on a single chip that implements most of the functionality of a complex electronic system. An MPSoC architecture is, in general, customized for an embedded application. A critical component of this customization process is the on-chip memory system configuration. Embedded systems increasingly employ software-controlled scratchpad memory(SPM) due to its inherent advantages in terms of area, energy, and timing predictability compared to caches. An application-specific flexible partitioning of the on-chip SPM budget among the processors is critical for performance optimization. Moreover, scheduling the tasks of an application on to the processors and partitioning the SPM are inter-dependent even though these steps are decoupled in the traditional design space exploration process. In this work, we design an integrated task mapping, scheduling, SPM partitioning, and data allocation technique based on Integer Linear Programming(ILP)formulation. Our ILP formulation explores the optimal performance limit and shows that integrated task schedul-ing and SPM optimization improves performance by up to 80% for embedded applications.

design automation conference | 2008

Exploring locking & partitioning for predictable shared caches on multi-cores

Vivy Suhendra; Tulika Mitra

Multi-core architectures consisting of multiple processing cores on a chip have become increasingly prevalent. Synthesizing hard realtime applications onto these platforms is quite challenging, as the contention among the cores for various shared resources leads to inherent timing unpredictability. This paper proposes the use of shared cache in a predictable manner through a combination of locking and partitioning mechanisms. We explore possible design choices and evaluate their effects on the worst-case application performance. Our study reveals certain design principles that strongly dictate the performance of a predictable memory hierarchy.

design automation conference | 2013

Hierarchical power management for asymmetric multi-core in dark silicon era

Thannirmalai Somu Muthukaruppan; Mihai Pricopi; Vanchinathan Venkataramani; Tulika Mitra; Sanjay Vishin

Asymmetric multi-core architectures integrating cores with diverse power-performance characteristics is emerging as a promising alternative in the dark silicon era where only a fraction of the cores on chip can be powered on due to thermal limits. We introduce a hierarchical power management framework for asymmetric multi-cores that builds on control theory and coordinates multiple controllers in a synergistic manner to achieve optimal power-performance efficiency while respecting the thermal design power budget. We integrate our framework within Linux and implement/evaluate it on real ARM big.LITTLE asymmetric multi-core platform.

international symposium on computer architecture | 1997

Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Sriram Vajapeyam; Tulika Mitra

Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular, register renaming a large number of instructions per cycle is difficult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the register file into a global file and several local files, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.

design automation conference | 2004

Characterizing embedded applications for instruction-set extensible processors

Pan Yu; Tulika Mitra

Extensible processors, which allow customization for an application domain by extending the core instruction set architecture, are becoming increasingly popular for embedded systems. However, existing techniques restrict the set of possible candidates for custom instructions by imposing a variety of constraints. As a result, the true extent of performance improvement achievable by extensible processors for embedded applications remains unknown. Moreover, it is unclear how the interplay among these restrictions impacts the performance potential. Our careful examination of this issue shows that significant speedup can only be obtained by relaxing some of the constraints to a reasonable extent. In particular, to the best of our knowledge, ours is the first work that studies the impact of relaxing control flow constraint by identifying instructions across basic blocks and indicates 5--148% relative speedup for different applications.

real-time systems symposium | 2009

Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores

Yan Li; Vivy Suhendra; Yun Liang; Tulika Mitra; Abhik Roychoudhury

Memory accesses form an important source of timing unpredictability. Timing analysis of real-time embedded software thus requires bounding the time for memory accesses. Multiprocessing, a popular approach for performance enhancement, opens up the opportunity for concurrent execution. However due to contention for any shared memory by different processing cores, memory access behavior becomes more unpredictable, and hence harder to analyze. In this paper, we develop a timing analysis method for concurrent software running on multi-cores with a shared instruction cache. Communication across tasks is by message passing where the message mailboxes are accessed via interrupt service routines. We do not handle data cache, shared memory synchronization and code sharing across tasks. Our method progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache. Possible conflicts arising from overlapping task lifetimes are accounted for in the hit-miss classification of accesses to the shared cache, to provide safe execution time bounds. We show that our method produces lower worst-case response time (WCRT) estimates than existing shared-cache analysis on a real-world embedded application.

international conference on hardware/software codesign and system synthesis | 2003

Accurate estimation of cache-related preemption delay

Hemendra Singh Negi; Tulika Mitra; Abhik Roychoudhury

Multitasked real-time systems often employ caches to boost performance. However the unpredictable dynamic behavior of caches makes schedulability analysis of such systems difficult. In particular, the effect of caches needs to be considered for estimating the inter-task interference. As the memory blocks of different tasks can map to the same cache blocks, preemption of a task may introduce additional cache misses. The time penalty introduced by these misses is called the cache-related preemption delay (CRPD). In this paper, we provide a program path analysis technique to estimate CRPD. Our technique performs path analysis of both the preempted and the preempting tasks. Furthermore, we improve the accuracy of the analysis by estimating the possible states of the entire cache at each possible preemption point rather than estimating the states of each cache block independently. To avoid incurring high space requirements, the cache states can be maintained symbolically as a binary decision diagram. Experimental results indicate that we obtain tight CRPD estimates for realistic benchmarks.

Explore More