Alexey Tumanov
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexey Tumanov.
symposium on cloud computing | 2012
Charles Reiss; Alexey Tumanov; Gregory R. Ganger; Randy H. Katz; Michael Kozuch
To better understand the challenges in developing effective cloud-based resource schedulers, we analyze the first publicly available trace data from a sizable multi-purpose cluster. The most notable workload characteristic is heterogeneity: in resource types (e.g., cores:RAM per machine) and their usage (e.g., duration and resources needed). Such heterogeneity reduces the effectiveness of traditional slot- and core-based scheduling. Furthermore, some tasks are constrained as to the kind of machine types they can use, increasing the complexity of resource assignment and complicating task migration. The workload is also highly dynamic, varying over time and most workload features, and is driven by many short jobs that demand quick scheduling decisions. While few simplifying assumptions apply, we find that many longer-running jobs have relatively stable resource utilizations, which can help adaptive resource schedulers.
european conference on computer systems | 2011
Roy Bryant; Alexey Tumanov; Olga Irzak; Adin Scannell; Kaustubh R. Joshi; Matti A. Hiltunen; H. Andrés Lagar-Cavilla; Eyal de Lara
We introduce cloud micro-elasticity, a new model for cloud Virtual Machine (VM) allocation and management. Current cloud users over-provision long-lived VMs with large memory footprints to better absorb load spikes, and to conserve performance-sensitive caches. Instead, we achieve elasticity by swiftly cloning VMs into many transient, short-lived, fractional workers to multiplex physical resources at a much finer granularity. The memory of a micro-elastic clone is a logical replica of the parent VM state, including caches, yet its footprint is proportional to the workload, and often a fraction of the nominal maximum. We enable micro-elasticity through a novel technique dubbed VM state coloring, which classifies VM memory into sets of semantically-related regions, and optimizes the propagation, allocation and deduplication of these regions. Using coloring, we build Kaleidoscope and empirically demonstrate its ability to create micro-elastic cloned servers. We model the impact of micro-elasticity on a demand dataset from AT&Ts cloud, and show that fine-grained multiplexing yields infrastructure reductions of 30% relative to state-of-the art techniques for managing elastic clouds.
symposium on cloud computing | 2014
Timothy Zhu; Alexey Tumanov; Michael Kozuch; Mor Harchol-Balter; Gregory R. Ganger
Meeting service level objectives (SLOs) for tail latency is an important and challenging open problem in cloud computing infrastructures. The challenges are exacerbated by burstiness in the workloads. This paper describes PriorityMeister -- a system that employs a combination of per-workload priorities and rate limits to provide tail latency QoS for shared networked storage, even with bursty workloads. PriorityMeister automatically and proactively configures workload priorities and rate limits across multiple stages (e.g., a shared storage stage followed by a shared network stage) to meet end-to-end tail latency SLOs. In real system experiments and under production trace workloads, PriorityMeister outperforms most recent reactive request scheduling approaches, with more workloads satisfying latency SLOs at higher latency percentiles. PriorityMeister is also robust to mis-estimation of underlying storage device performance and contains the effect of misbehaving workloads.
ieee virtual reality conference | 2007
Alexey Tumanov; Robert S. Allison; Wolfgang Stuerzlinger
Application designers of collaborative distributed virtual environments must account for the influence of the network connection and its detrimental effects on user performance. Based upon analysis and classification of existing latency compensation techniques, this paper introduces a novel approach to latency amelioration in the form of a two-tier predictor-estimator framework. The technique is variability-aware due to its proactive sender-side prediction of a pose a variable time into the future. The prediction interval required is estimated based on current and past network delay characteristics. This latency estimate is subsequently used by a Kalman filter-based predictor to replace the measurement event with a predicted pose that matches the events arrival time at the receiving workstation. The compensation technique was evaluated in a simulation through an offline playback of real head motion data and network delay traces collected under a variety of real network conditions. The experimental results indicate that the variability-aware approach significantly outperforms a state-of-the-art one, which assumes a constant system delay
european conference on computer systems | 2017
Aaron Harlap; Alexey Tumanov; Andrew Chung; Gregory R. Ganger; Phillip B. Gibbons
Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time. But, exploiting such dynamic resource availability and the often fluctuating markets for them requires agile elasticity and effective acquisition strategies. Proteus aggressively exploits such transient revocable resources to do machine learning (ML) cheaper and/or faster. Its parameter server framework, AgileML, efficiently adapts to bulk additions and revocations of transient machines, through a novel 3-stage active-backup approach, with minimal use of more costly non-transient resources. Its BidBrain component adaptively allocates resources from multiple EC2 spot markets to minimize average cost per work as transient resource availability and cost change over time. Our evaluations show that Proteus reduces cost by 85% relative to non-transient pricing, and by 43% relative to previous approaches, while simultaneously reducing runtimes by up to 37%.
arXiv: Distributed, Parallel, and Cluster Computing | 2017
Robert Nishihara; Philipp Moritz; Stephanie Wang; Alexey Tumanov; William Paul; Johann Schleier-Smith; Richard Liaw; Mehrdad Niknami; Michael I. Jordan; Ion Stoica
Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution frameworks: computation with millisecond latency at high throughput, adaptive construction of arbitrary task graphs, and execution of heterogeneous kernels over diverse sets of resources. We assert that a new distributed execution framework is needed for such ML applications and propose a candidate approach with a proof-of-concept architecture that achieves a 63x performance improvement over a state-of-the-art execution framework for a representative application.
ACM Transactions on Storage | 2014
Lianghong Xu; James Cipar; Elie Krevat; Alexey Tumanov; Nitin Gupta; Michael Kozuch; Gregory R. Ganger
Elastic storage systems can be expanded or contracted to meet current demand, allowing servers to be turned off or used for other tasks. However, the usefulness of an elastic distributed storage system is limited by its agility: how quickly it can increase or decrease its number of servers. Due to the large amount of data they must migrate during elastic resizing, state of the art designs usually have to make painful trade-offs among performance, elasticity, and agility. This article describes the state of the art in elastic storage and a new system, called SpringFS, that can quickly change its number of active servers, while retaining elasticity and performance goals. SpringFS uses a novel technique, termed bounded write offloading, that restricts the set of servers where writes to overloaded servers are redirected. This technique, combined with the read offloading and passive migration policies used in SpringFS, minimizes the work needed before deactivation or activation of servers. Analysis of real-world traces from Hadoop deployments at Facebook and various Cloudera customers and experiments with the SpringFS prototype confirm SpringFS’s agility, show that it reduces the amount of data migrated for elastic resizing by up to two orders of magnitude, and show that it cuts the percentage of active servers required by 67--82%, outdoing state-of-the-art designs by 6--120%.
european conference on computer systems | 2018
Jun Woo Park; Alexey Tumanov; Angela H. Jiang; Michael Kozuch; Gregory R. Ganger
The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8--23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.
symposium on cloud computing | 2012
Alexey Tumanov; James Cipar; Gregory R. Ganger; Michael Kozuch
operating systems design and implementation | 2016
Sangeetha Abdu Jyothi; Carlo Curino; Ishai Menache; Shravan M. Narayanamurthy; Alexey Tumanov; Jonathan Yaniv; Ruslan Mavlyutov; Íñigo Goiri; Subramaniam Venkatraman Krishnan; Jana Kulkarni; Sriram Rao