Is this you? Create Your Porfile

Kunal Agrawal

Washington University in St. Louis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kunal Agrawal is active.

Explore More

Publication

Featured researches published by Kunal Agrawal.

Real-time Systems | 2013

Multi-core real-time scheduling for generalized parallel task models

Abusayeed Saifullah; Jing Li; Kunal Agrawal; Chenyang Lu; Christopher D. Gill

Multi-core processors offer a significant performance increase over single-core processors. They have the potential to enable computation-intensive real-time applications with stringent timing constraints that cannot be met on traditional single-core processors. However, most results in traditional multiprocessor real-time scheduling are limited to sequential programming models and ignore intra-task parallelism. In this paper, we address the problem of scheduling periodic parallel tasks with implicit deadlines on multi-core processors. We first consider a synchronous task model where each task consists of segments, each segment having an arbitrary number of parallel threads that synchronize at the end of the segment. We propose a new task decomposition method that decomposes each parallel task into a set of sequential tasks. We prove that our task decomposition achieves a resource augmentation bound of 4 and 5 when the decomposed tasks are scheduled using global EDF and partitioned deadline monotonic scheduling, respectively. Finally, we extend our analysis to a directed acyclic graph (DAG) task model where each node in the DAG has a unit execution requirement. We show how these tasks can be converted into synchronous tasks such that the same decomposition can be applied and the same augmentation bounds hold. Simulations based on synthetic workload demonstrate that the derived resource augmentation bounds are safe and sufficient.

real-time systems symposium | 2011

Multi-core Real-Time Scheduling for Generalized Parallel Task Models

Abusayeed Saifullah; Kunal Agrawal; Chenyang Lu; Christopher D. Gill

Multi-core processors offer a significant performance increase over single core processors. Therefore, they have the potential to enable computation-intensive real-time applications with stringent timing constraints that cannot be met on traditional single-core processors. However, most results in traditional multiprocessor real-time scheduling are limited to sequential programming models and ignore intra-task parallelism. In this paper, we address the problem of scheduling periodic parallel tasks with implicit deadlines on multi-core processors. We first consider a synchronous task model where each task consists of segments, each segment having an arbitrary number of parallel threads that synchronize at the end of the segment. We propose a new task decomposition method that decomposes each parallel task into a set of sequential tasks. We prove that our task decomposition achieves a resource augmentation bound of 2.62 and 3.42 when the decomposed tasks are scheduled using global EDF and partitioned deadline monotonic scheduling, respectively. Finally, we extend our analysis to directed a cyclic graph tasks. We show how these tasks can be converted into synchronous tasks such that the same transformation can be applied and the same augmentation bounds hold.

international world wide web conferences | 2011

Parallel boosted regression trees for web search ranking

Stephen Tyree; Kilian Q. Weinberger; Kunal Agrawal; Jennifer Paykin

Gradient Boosted Regression Trees (GBRT) are the current state-of-the-art learning paradigm for machine learned web-search ranking - a domain notorious for very large data sets. In this paper, we propose a novel method for parallelizing the training of GBRT. Our technique parallelizes the construction of the individual regression trees and operates using the master-worker paradigm as follows. The data are partitioned among the workers. At each iteration, the worker summarizes its data-partition using histograms. The master processor uses these to build one layer of a regression tree, and then sends this layer to the workers, allowing the workers to build histograms for the next layer. Our algorithm carefully orchestrates overlap between communication and computation to achieve good performance. Since this approach is based on data partitioning, and requires a small amount of communication, it generalizes to distributed and shared memory machines, as well as clouds. We present experimental results on both shared memory machines and clusters for two large scale web search ranking data sets. We demonstrate that the loss in accuracy induced due to the histogram approximation in the regression tree creation can be compensated for through slightly deeper trees. As a result, we see no significant loss in accuracy on the Yahoo data sets and a very small reduction in accuracy for the Microsoft LETOR data. In addition, on shared memory machines, we obtain almost perfect linear speed-up with up to about 48 cores on the large data sets. On distributed memory machines, we get a speedup of 25 with 32 processors. Due to data partitioning our approach can scale to even larger data sets, on which one can reasonably expect even higher speedups.

euromicro conference on real-time systems | 2014

Analysis of Federated and Global Scheduling for Parallel Real-Time Tasks

Jing Li; Jian-Jia Chen; Kunal Agrawal; Chenyang Lu; Christopher D. Gill; Abusayeed Saifullah

This paper considers the scheduling of parallel real-time tasks with implicit deadlines. Each parallel task is characterized as a general directed acyclic graph (DAG). We analyze three different real-time scheduling strategies: two well known algorithms, namely global earliest-deadline-first and global rate-monotonic, and one new algorithm, namely federated scheduling. The federated scheduling algorithm proposed in this paper is a generalization of partitioned scheduling to parallel tasks. In this strategy, each high-utilization task (utilization ≥ 1) is assigned a set of dedicated cores and the remaining low-utilization tasks share the remaining cores. We prove capacity augmentation bounds for all three schedulers. In particular, we show that if on unit-speed cores, a task set has total utilization of at most m and the critical-path length of each task is smaller than its deadline, then federated scheduling can schedule that task set on m cores of speed 2, G-EDF can schedule it with speed 3 + √5/2 ≈ 2.618, and G-RM can schedule it with speed 2 + √3 ≈ 3.732. We also provide lower bounds on the speedup and show that the bounds are tight for federated scheduling and G-EDF when m is sufficiently large.

acm sigplan symposium on principles and practice of parallel programming | 2008

Nested parallelism in transactional memory

Kunal Agrawal; Jeremy T. Fineman; Jim Sukha

This paper investigates adding transactions with nested parallelism and nested transactions to a dynamically multithreaded parallel programming language that generates only series-parallel programs. We describe XConflict, a data structure that facilitates conflict detection for a software transactional memory system which supports transactions with nested parallelism and unbounded nesting depth. For languages that use a Cilk-like work-stealing scheduler, XConflict answers concurrent conflict queries in O(1) time and can be maintained efficiently. In particular, for a program with T1 work and a span (or critical-path length) of T∞, the running time on p processors of the program augmented with XConflict is only O(T1/p + pT∞). Using XConflict, we describe CWSTM, a runtime-system design for software transactional memory which supports transactions with nested parallelism and unbounded nesting depth of transactions. The CWSTM design provides transactional memory with eager updates, eager conflict detection, strong atomicity, and lazy cleanup on aborts. In the restricted case when no transactions abort and there are no concurrent readers, CWSTM executes a transactional computation on p processors also in time O(T1/p + pT∞). Although this bound holds only under rather optimistic assumptions, to our knowledge, this result is the first theoretical performance bound on a TM system that supports transactions with nested parallelism which is independent of the maximum nesting depth of transactions.

IEEE Transactions on Parallel and Distributed Systems | 2014

Parallel Real-Time Scheduling of DAGs

Abusayeed Saifullah; David Ferry; Jing Li; Kunal Agrawal; Chenyang Lu; Christopher D. Gill

Recently, multi-core processors have become mainstream in processor design. To take full advantage of multi-core processing, computation-intensive real-time systems must exploit intra-task parallelism. In this paper, we address the problem of realtime scheduling for a general model of deterministic parallel tasks, where each task is represented as a directed acyclic graph (DAG) with nodes having arbitrary execution requirements. We prove processor-speed augmentation bounds for both preemptive and nonpreemptive real-time scheduling for general DAG tasks on multi-core processors. We first decompose each DAG into sequential tasks with their own release times and deadlines. Then we prove that these decomposed tasks can be scheduled using preemptive global EDF with a resource augmentation bound of 4. This bound is as good as the best known bound for more restrictive models, and is the first for a general DAG model. We also prove that the decomposition has a resource augmentation bound of 4 plus a constant non-preemption overhead for non-preemptive global EDF scheduling. To our knowledge, this is the first resource augmentation bound for non-preemptive scheduling of parallel tasks. Finally, we evaluate our analytical results through simulations that demonstrate that the derived resource augmentation bounds are safe in practice.

international parallel and distributed processing symposium | 2010

Executing task graphs using work-stealing

Kunal Agrawal; Charles E. Leiserson; Jim Sukha

NABBIT is a work-stealing library for execution of task graphs with arbitrary dependencies which is implemented as a library for the multithreaded programming language Cilk++. We prove that Nabbit executes static task graphs in parallel in time which is asymptotically optimal for graphs whose nodes have constant in-degree and out-degree. To evaluate the performance of Nabbit, we implemented a dynamic program representing the Smith-Waterman algorithm, an irregular dynamic program on a two-dimensional grid. Our experiments indicate that when task-graph nodes are mapped to reasonably sized blocks, Nabbit exhibits low overhead and scales as well as or better than other scheduling strategies. The Nabbit implementation that solves the dynamic program using a task graph even manages in some cases to outperform a divide-and-conquer implementation for directly solving the same dynamic program. Finally, we extend both the Nabbit implementation and the completion-time bounds to handle dynamic task graphs, that is, graphs whose nodes and edges are created on the fly at runtime.

acm sigplan symposium on principles and practice of parallel programming | 2006

Adaptive scheduling with parallelism feedback

Kunal Agrawal; Yuxiong He; Wen Jing Hsu; Charles E. Leiserson

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on the allotted processors. In this context, the number of processors allotted to a particular job may vary during the jobs execution, and the thread scheduler must adapt to these changes in processor resources. For overall system efficiency, the thread scheduler should also provide parallelism feedback to the job scheduler to avoid allotting a job more processors than it can use productively. This paper provides an overview of several adaptive thread schedulers we have developed that provide provably good history-based feedback about the jobs parallelism without knowing the future of the job. These thread schedulers complete the job in near-optimal time while guaranteeing low waste. We have analyzed these thread schedulers under stringent adversarial conditions, showing that the thread schedulers are robust to various system environments and allocation policies. To analyze the thread schedulers under this adversarial model, we have developed a new technique, called trim analysis, which can be used to show that the thread scheduler provides good behavior on the vast majority of time steps, and performs poorly on only a few. When our thread schedulers are used with dynamic equipartitioning and other related job scheduling algorithms, they are O(1)-competitive against an optimal offline scheduling algorithm with respect to both mean response time and makespan for batched jobs and nonbatched jobs, respectively. Our algorithms are the first nonclairvoy-ant scheduling algorithms to offer such guarantees.

real time technology and applications symposium | 2013

A real-time scheduling service for parallel tasks

David Ferry; Jing Li; Mahesh Mahadevan; Kunal Agrawal; Christopher D. Gill; Chenyang Lu

The multi-core revolution presents both opportunities and challenges for real-time systems. Parallel computing can yield significant speedup for individual tasks (enabling shorter deadlines, or more computation within the same deadline), but unless managed carefully may add complexity and overhead that could potentially wreck real-time performance. There is little experience to date with the design and implementation of realtime systems that allow parallel tasks, yet the state of the art cannot progress without the construction of such systems. In this work we describe the design and implementation of a scheduler and runtime dispatcher for a new concurrency platform, RT-OpenMP, whose goal is the execution of real-time workloads with intra-task parallelism.

euromicro conference on real-time systems | 2013

Outstanding Paper Award: Analysis of Global EDF for Parallel Tasks

Jing Li; Kunal Agrawal; Chenyang Lu; Christopher D. Gill

As multicore processors become ever more prevalent, it is important for real-time programs to take advantage of intra-task parallelism in order to support computation-intensive applications with tight deadlines. We prove that a Global Earliest Deadline First (GEDF) scheduling policy provides a capacity augmentation bound of 4-2/m and a resource augmentation bound of 2-1/m for parallel tasks in the general directed a cyclic graph model. For the proposed capacity augmentation bound of 4-2/m for implicit deadline tasks under GEDF, we prove that if a task set has a total utilization of at most m/(4-2/m) and each tasks critical path length is no more than 1/(4-2/m) of its deadline, it can be scheduled on a machine with m processors under GEDF. Our capacity augmentation bound therefore can be used as a straightforward schedulability test. For the standard resource augmentation bound of 2-1/m for arbitrary deadline tasks under GEDF, we prove that if an ideal optimal scheduler can schedule a task set on m unit-speed processors, then GEDF can schedule the same task set on m processors of speed 2-1/m. However, this bound does not lead to a schedulabilty test since the ideal optimal scheduler is only hypothetical and is not known. Simulations confirm that the GEDF is not only safe under the capacity augmentation bound for various randomly generated task sets, but also performs surprisingly well and usually outperforms an existing scheduling technique that involves task decomposition.

Explore More