Anamitra R. Choudhury

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anamitra R. Choudhury is active.

Explore More

Publication

Featured researches published by Anamitra R. Choudhury.

ieee international conference on high performance computing data and analytics | 2012

Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

Fabio Checconi; Fabrizio Petrini; Jeremiah Willcock; Andrew Lumsdaine; Anamitra R. Choudhury; Yogish Sabharwal

In this paper, we describe the challenges involved in designing a family of highly-efficient Breadth-First Search (BFS) algorithms and in optimizing these algorithms on the latest two generations of Blue Gene machines, Blue Gene/P and Blue Gene/Q. With our recent winning Graph 500 submissions in November 2010, June 2011, and November 2011, we have achieved unprecedented scalability results in both space and size. On Blue Gene/P, we have been able to parallelize a scale 38 problem with 238 vertices and 242 edges on 131,072 processing cores. Using only four racks of an experimental configuration of Blue Gene/Q, we have achieved a processing rate of 254 billion edges per second on 65,536 processing cores. This paper describes the algorithmic design and the main classes of optimizations that we have used to achieve these results.

international parallel and distributed processing symposium | 2011

Multifrontal Factorization of Sparse SPD Matrices on GPUs

Thomas George; Vaibhav Saxena; Anshul Gupta; Amik Singh; Anamitra R. Choudhury

Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.

international parallel and distributed processing symposium | 2008

Optimizations in financial engineering: The Least-Squares Monte Carlo method of Longstaff and Schwartz

Anamitra R. Choudhury; Alan King; Sunil Kumar; Yogish Sabharwal

In this paper we identify important opportunities for parallelization in the least-squares Monte Carlo (LSM) algorithm, due to Longstaff and Schwartz, for the pricing of American options. The LSM method can be divided into three phases: path-simulation, calibration and valuation. We describe how each of these phases can be parallelized, with more focus on the calibration phase, which is inherently more difficult to parallelize. We implemented these parallelization techniques on Blue Gene using the Quantlib open source financial engineering package. We achieved up to factor of 9 speed-up for the calibration phase and 18 for the complete LSM method on a 32 processor BG/P system using monomial basis functions.

international parallel and distributed processing symposium | 2011

Improved Algorithms for the Distributed Trigger Counting Problem

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Yogish Sabharwal

Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The first algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.

european symposium on algorithms | 2014

Improved Algorithms for Resource Allocation under Varying Capacity

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Shalmoli Gupta; Sambuddha Roy; Yogish Sabharwal

We consider the problem of scheduling a set of jobs on a system that offers certain resource, wherein the amount of resource offered varies over time. For each job, the input specifies a set of possible scheduling instances, where each instance is given by starting time, ending time, profit and resource requirement. A feasible solution selects a subset of job instances such that at any timeslot, the total requirement by the chosen instances does not exceed the resource available at that timeslot, and at most one instance is chosen for each job. The above problem falls under the well-studied framework of unsplittable flow problem (UFP) on line. The generalized notion of scheduling possibilities captures the standard setting concerned with release times and deadlines. We present improved algorithms based on the primal-dual paradigm, where the improvements are in terms of approximation ratio, running time and simplicity.

international conference of distributed computing and networking | 2011

An efficient decentralized algorithm for the distributed trigger counting problem

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Vijay K. Garg; Yogish Sabharwal

Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. The main result of the paper is a decentralized and randomized algorithm with expected message complexity O(n log n log w). Moreover, every processor in this algorithm receives no more than O(log n log w) messages with high probability. All the earlier algorithms for this problem have maximum message load of Ω(n log w).

foundations of software technology and theoretical computer science | 2010

A Near-linear Time Constant Factor Algorithm for Unsplittable Flow Problem on Line with Bag Constraints

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Yogish Sabharwal

Consider a scenario where we need to schedule a set of jobs on a system offering some resource (such as electrical power or communication bandwidth), which we shall refer to as bandwidth. Each job consists of a set (or bag) of job instances. For each job instance, the input specifies the start time, finish time, bandwidth requirement and profit. The bandwidth offered by the system varies at different points of time and is specified as part of the input. A feasible solution is to choose a subset of instances such that at any point of time, the sum of bandwidth requirements of the chosen instances does not exceed the bandwidth available at that point of time, and furthermore, at most one instance is picked from each job. The goal is to find a maximum profit feasible solution. We study this problem under a natural assumption called the no-bottleneck assumption (NBA), wherein the bandwidth requirement of any job instance is at most the minimum bandwidth available. We present a simple, near-linear time constant factor approximation algorithm for this problem, under NBA. When each job consists of only one job instance, the above problem is the same as the well-studied unsplittable flow problem (UFP) on lines. A constant factor approximation algorithm is known for the UFP on line, under NBA. Our result leads to an alternative constant factor approximation algorithm for this problem. Though the approximation ratio achieved by our algorithm is inferior, it is much simpler, deterministic and faster in comparison to the existing algorithms. Our algorithm runs in near-linear time (

international symposium on distributed computing | 2010

Brief announcement: a decentralized algorithm for distributed trigger counting

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Vijay K. Garg; Yogish Sabharwal

O(n*log^2 n)

international parallel and distributed processing symposium | 2013

Distributed Algorithms for Scheduling on Line and Tree Networks with Non-uniform Bandwidths

Venkatesan T. Chakaravarthy; Anamitra R. Choudhury; Sambuddha Roy; Yogish Sabharwal

), whereas the running time of the known algorithms is a high order polynomial. The core idea behind our algorithm is a reduction from the varying bandwidth case to the easier uniform bandwidth case, using a technique that we call slicing.

international conference on parallel processing | 2013