Suresh Purini
International Institute of Information Technology, Hyderabad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Suresh Purini.
high performance embedded architectures and compilers | 2013
Suresh Purini; Lakshya Jain
The compiler optimizations we enable and the order in which we apply them on a program have a substantial impact on the program execution time. Compilers provide default optimization sequences which can give good program speedup. As the default sequences have to optimize programs with different characteristics, they embed in them multiple subsequences which can optimize different classes of programs. These multiple subsequences may falsely interact with each other and affect the potential program speedup achievable. Instead of searching for a single universally optimal sequence, we can construct a small set of good sequences such that for every program class there exists a near-optimal optimization sequence in the good sequences set. If we can construct such a good sequences set which covers all the program classes in the program space, then we can choose the best sequence for a program by trying all the sequences in the good sequences set. This approach completely circumvents the need to solve the program classification problem. Using a sequence set size of around 10 we got an average speedup up to 14% on PolyBench programs and up to 12% on MiBench programs. Our approach is quite different from either the iterative compilation or machine-learning-based prediction modeling techniques proposed in the literature so far. We use different training and test datasets for cross-validation as against the Leave-One-Out cross-validation technique.
international conference on cloud computing | 2014
Sanidhya Kashyap; Jaspal Singh Dhillon; Suresh Purini
Today, IaaS cloud providers are dynamically minimizing the cost of data centers operations, while maintaining the Service Level Agreement (SLA). Currently, this is achieved by the live migration capability, which is an advanced state-of-the-art technology of Virtualization. However, existing migration techniques suffer from high network bandwidth utilization, large network data transfer, large migration time as well as the destinations VM failure during migration. In this paper, we propose Reliable Lazy Copy (RLC) - a fast, efficient and a reliable migration technique. RLC provides a reasonable solution for high-efficiency and less disruptive migration scheme by utilizing the three phases of the process migration. For effective network bandwidth utilization and reducing the total migration time, we introduce a learning phase to estimate the writable working set (WWS) prior to the migration, resulting in an almost single time transfer of the pages. Our approach decreases the total data transfer by 1.16 x - 12.21x and the total migration time by a factor of 1.42x - 9.84x against the existing approaches, thus providing a fast and an efficient, reliable VM migration of the VMs in the cloud.
international conference on parallel architectures and compilation techniques | 2016
Nitin Chugh; Vinay Vasista; Suresh Purini; Uday Bondhugula
This paper describes an automatic approach to accelerate image processing pipelines using FPGAs. An image processing pipeline can be viewed as a graph of interconnected stages that processes images successively. Each stage typically performs a point-wise, stencil, or other more complex operations on image pixels. Recent efforts have led to the development of domain-specific languages (DSL) and optimization frameworks for image processing pipelines. In this paper, we develop an approach to map image processing pipelines expressed in the PolyMage DSL to efficient parallel FPGA designs. Our approach exploits reuse and available memory bandwidth (or chip resources) maximally. When compared to Darkroom, a state-of-the-art approach to compile high-level DSL to FPGAs, our approach (a) leads to designs that deliver significantly higher throughput, and (b) supports a greater variety of filters. Furthermore, the designs we generate obtain an improvement even over pre-optimized FPGA implementations provided by vendor libraries for some of the benchmarks.
ieee/acm international conference utility and cloud computing | 2013
Jaspal Singh Dhillon; Suresh Purini; Sanidhya Kashyap
When multiple virtual machines (VMs) are co scheduled on the same physical machine, they may undergo a performance degradation. The performance degradation is due to the contention for shared resources like last level cache, hard disk, network bandwidth etc. This can lead to service-level agreement violations and thereby customer dissatisfaction. The classical approach to solve the co scheduling problem involves a central authority which decides a co schedule by solving a constrained optimization problem with an objective function such as average performance degradation. In this paper, we use the theory of stable matchings to provide an alternate game theoretic perspective to the co scheduling problem wherein each VM selfishly tries to minimize its performance degradation. We show that the co scheduling problem can be formulated as a Stable Roommates Problem (SRP). Since certain instances of the SRP do not have any stable matching, we reduce the problem to the Stable Marriages Problem (SMP) via an initial approximation. Gale and Shapley proved that any instance of the SMP has a stable matching and can be found in quadratic time. From a game theoretic perspective, the SMP can be thought of as a matching game which always has a Nash equilibrium. There are distributed algorithms for both the SRP and SMP problems. A VM agent in a distributed algorithm need not reveal its preference list to any other VM. This allows each VM to have a private cost function. A principal advantage of this problem formulation is that it opens up the possibility of applying the rich theory of matching markets from game theory to address various aspects of the VM co scheduling problem such as stability, coalitions and privacy both from a theoretical and practical standpoint. We also propose a new workload characterization technique for a combination of compute and memory intensive workloads. The proposed technique uses a sentinel program and it requires only two runs per workload for characterization. VMs can use this technique in deciding their partner preference ranks in the SRP and SMP problems. The characterization technique has also been used in proposing two new centralized VM co scheduling algorithms whose performance is close to the optimal Blossom algorithm.
international conference on cloud computing | 2015
Kapil Kumar; Nehal J. Wani; Suresh Purini
The memory and core requirements of a virtual machine depend on the performance requirements of the applications hosted on it. In this paper, we propose algorithms for dynamic memory and core scaling using a combination of machine learning and feedback control techniques. These algorithms work for sequential and parallel applications such as scientific computations where speedup is the primary performance metric. Then we use these algorithms to address the simultaneous memory and core allocation problem, which is more complex due to possible correlation between these resource requirements. All these algorithms can be applied in a black box fashion without instrumenting the source code of applications.
india software engineering conference | 2017
Jitendra Yasaswi; Sri Kailash; Anil Chilupuri; Suresh Purini; C. V. Jawahar
In this work, we propose a novel hybrid approach for automatic plagiarism detection in programming assignments. Most of the well known plagiarism detectors either employ a text-based approach or use features based on the property of the program at a syntactic level. However, both these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection. Our proposed method uses static features extracted from the intermediate representation of a program in a compiler infrastructure such as gcc. We demonstrate the use of unsupervised learning techniques on the extracted feature representations and show that our system is robust to code obfuscation. We test our method on assignments from introductory programming course. The preliminary results show that our system is better when compared to other popular tools like MOSS. For visualizing the local and global structure of the features, we obtained the low-dimensional representations of our features using a popular technique called t-SNE, a variation of Stochastic Neighbor Embedding, which can preserve neighborhood identity in low-dimensions. Based on this idea of preserving neighborhood identity, we mine interesting information such as the diversity in student solution approaches to a given problem. The presence of well defined clusters in low-dimensional visualizations demonstrate that our features are capable of capturing interesting programming patterns.
ieee acm international conference utility and cloud computing | 2016
Yash Khandelwal; Suresh Purini; Puduru Viswanadha Reddy
In this paper, we formulate the optimal coalition formation in federated clouds as an integer linear programming problem under the cloud service brokerage model proposed by Mashayekhy et al. Then we propose a fast polynomial time greedy algorithm to find a near optimal coalition. The profit generated by the federation obtained using the greedy algorithm is within a negligible 0.06 percent of the optimal on an average. The greedy algorithm finds a federation 200 times faster on an average when compared with the Merge-Split algorithm. The payoff distribution within a federation is determined using exact Banzhaf index computation whereas the Merge-Split algorithm arrives at a payoff using an estimate of Banzhaf values. By computing the payoff distribution after the federation formation, we are able to achieve 66x speedup when compared with the Merge-Split algorithm.
field programmable gate arrays | 2016
Ronak Kogta; Suresh Purini; Ajit Mathew
A high-level synthesis compiler translates a source program written in a high level programming language such as C or SystemC into an equivalent circuit. The performance of the generated circuit in terms of metrics such as area, frequency and clock cycles depends on the compiler optimizations enabled and their order of application. Finding an optimal sequence for a given program is a hard combinatorial optimization problem. In this paper, we propose a practical and search time efficient technique for finding a near-optimal sequence for a given program. The main idea is to strike a balance between the search for a universally good sequence (like that of O3) which works for all programs vis-a-vis finding a good sequence on a per-program basis. Towards that, we construct a rich downsampled sequence set, which caters to different program classes, from the unbounded optimization sequence space by applying heuristic search algorithms on a set of Microkernel benchmark programs. The optimization metric that we use while constructing the downsampled sequence set is the execution time on a scalar processor. Given a new program, we try all the sequences from the downsampled sequence setand pick the best. Applying this technique in the LegUp high-level synthesis compiler, we are able to obtain 23% and 40% improvement on CHStone and Machsuite benchmark programs respectively. We also propose techniques to further reduce the size of the downsampled sequence set to improve the sequence search time.
ieee region 10 conference | 2015
Ashutosh Mehta; Shivani Maurya; Nawaz Sharief; Babu M. Pranay; Srivatsava Jandhyala; Suresh Purini
Real-time multimedia applications which demand very low decoding delays are increasing day-by-day. To address this challenge, in error-resilient applications, many approximate computing architectures for delay critical units have been proposed. In this paper, we propose an architecture for an approximate multiplier, accuracy of which can be configured during the run-time. According to the requirement of the application, the multiplier can be configured to operate in an exact mode or in any of the approximate modes, reducing its decoding delay and the dynamic power consumed. The architecture for the proposed approximate multiplier has been synthesized and simulated using Cadence design tools. Using 16-bit multiplication, it has been demonstrated that, the pass-rate and the propagation delay of the proposed multiplier is comparable or better than most of the published inaccurate multipliers. The proposed approximate multiplier is successfully used in a JPEG conversion application and performances of different accuracy modes are compared.
arXiv: Hardware Architecture | 2018
Vinamra Benara; Sahithi Rampalli; Ziaul Choudhury; Suresh Purini; Uday Bondhugula