Yuxiong He | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuxiong He is active.

Explore More

Publication

Featured researches published by Yuxiong He.

international conference on distributed computing systems | 2012

Provably-Efficient Job Scheduling for Energy and Fairness in Geographically Distributed Data Centers

Shaolei Ren; Yuxiong He; Fei Xu

Decreasing the soaring energy cost is imperative in large data centers. Meanwhile, limited computational resources need to be fairly allocated among different organizations. Latency is another major concern for resource management. Nevertheless, energy cost, resource allocation fairness, and latency are important but often contradicting metrics on scheduling data center workloads. In this paper, we explore the benefit of electricity price variations across time and locations. We study the problem of scheduling batch jobs, which originate from multiple organizations/users and are scheduled to multiple geographically-distributed data centers. We propose a provably-efficient online scheduling algorithm -- Gre Far -- which optimizes the energy cost and fairness among different organizations subject to queueing delay constraints. Gre Far does not require any statistical information of workload arrivals or electricity prices. We prove that it can minimize the cost (in terms of an affine combination of energy cost and weighted fairness) arbitrarily close to that of the optimal offline algorithm with future information. Moreover, by appropriately setting the control parameters, Gre Far achieves a desirable tradeoff among energy cost, fairness and latency.

acm sigplan symposium on principles and practice of parallel programming | 2006

Adaptive scheduling with parallelism feedback

Kunal Agrawal; Yuxiong He; Wen Jing Hsu; Charles E. Leiserson

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on the allotted processors. In this context, the number of processors allotted to a particular job may vary during the jobs execution, and the thread scheduler must adapt to these changes in processor resources. For overall system efficiency, the thread scheduler should also provide parallelism feedback to the job scheduler to avoid allotting a job more processors than it can use productively. This paper provides an overview of several adaptive thread schedulers we have developed that provide provably good history-based feedback about the jobs parallelism without knowing the future of the job. These thread schedulers complete the job in near-optimal time while guaranteeing low waste. We have analyzed these thread schedulers under stringent adversarial conditions, showing that the thread schedulers are robust to various system environments and allocation policies. To analyze the thread schedulers under this adversarial model, we have developed a new technique, called trim analysis, which can be used to show that the thread scheduler provides good behavior on the vast majority of time steps, and performs poorly on only a few. When our thread schedulers are used with dynamic equipartitioning and other related job scheduling algorithms, they are O(1)-competitive against an optimal offline scheduling algorithm with respect to both mean response time and makespan for batched jobs and nonbatched jobs, respectively. Our algorithms are the first nonclairvoy-ant scheduling algorithms to offer such guarantees.

conference on information and knowledge management | 2012

G-SPARQL: a hybrid engine for querying large attributed graphs

Sherif Sakr; Sameh Elnikety; Yuxiong He

We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which of large interest for applications which model their data as large graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both of structural predicates and value-based predicates (on the attributes of the graph nodes and edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe a hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph is stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database while the execution of other parts of the query plan are processed using memory-based algorithms, as necessary. Experimental results on real datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors.

international acm sigir conference on research and development in information retrieval | 2014

Predictive parallelization: taming tail latencies in web search

Myeongjae Jeon; Saehoon Kim; Seung-won Hwang; Yuxiong He; Sameh Elnikety; Alan L. Cox; Scott Rixner

Web search engines are optimized to reduce the high-percentile response time to consistently provide fast responses to almost all user queries. This is a challenging task because the query workload exhibits large variability, consisting of many short-running queries and a few long-running queries that significantly impact the high-percentile response time. With modern multicore servers, parallelizing the processing of an individual query is a promising solution to reduce query execution time, but it gives limited benefits compared to sequential execution since most queries see little or no speedup when parallelized. The root of this problem is that short-running queries, which dominate the workload, do not benefit from parallelization. They incur a large parallelization overhead, taking scarce resources from long-running queries. On the other hand, parallelization substantially reduces the execution time of long-running queries with low overhead and high parallelization efficiency. Motivated by these observations, we propose a predictive parallelization framework with two parts: (1) predicting long-running queries, and (2) selectively parallelizing them. For the first part, prediction should be accurate and efficient. For accuracy, we study a comprehensive feature set covering both term features (reflecting dynamic pruning efficiency) and query features (reflecting query complexity). For efficiency, to keep overhead low, we avoid expensive features that have excessive requirements such as large memory footprints. For the second part, we use the predicted query execution time to parallelize long-running queries and process short-running queries sequentially. We implement and evaluate the predictive parallelization framework in Microsoft Bing search. Our measurements show that under moderate to heavy load, the predictive strategy reduces the 99th-percentile response time by 50% (from 200 ms to 100 ms) compared with prior approaches that parallelize all queries.

symposium on cloud computing | 2012

Zeta: scheduling interactive services with partial execution

Yuxiong He; Sameh Elnikety; James R. Larus; Chenyu Yan

This paper presents a scheduling model for a class of interactive services in which requests are time bounded and lower result quality can be traded for shorter execution time. These applications include web search engines, finance servers, and other interactive, on-line services. We develop an efficient scheduling algorithm, Zeta, that allocates processor time among service requests to maximize the quality and minimize the variance of the response. Zeta exploits the concavity of the request quality profile to distribute processing time among outstanding requests. By executing some requests partially (and obtaining much or most benefit of a full execution), Zeta frees resources for other requests, which might have timed out and produced no results. Compared to scheduling algorithms that consider only deadline or quality profile information, Zeta improves overall response quality and reduces response quality variance, yielding significant improvement in the high-percentile response quality. We implemented and deployed Zeta in the Microsoft Bing web search engine and evaluated its performance in a production environment with realistic workloads. Measurements show that at the same response quality and latency as the production system, Zeta increases system capacity by 29% by improving both average and high percentile response quality. We also implemented Zeta in a finance server that computes option prices. In this application, Zeta improves average response quality by 28% and the 99-percentile quality by 80%. Using a simulation, we also compared Zeta to the offline optimal schedule and other scheduling algorithms. Although Zeta is only close to optimal, it provides better performance than prior algorithms under a wide variety of operating conditions.

international conference on data engineering | 2012

Horton: Online Query Execution Engine for Large Distributed Graphs

Mohamed Sarwat; Sameh Elnikety; Yuxiong He; Gabriel Kliot

Graphs are used in many large-scale applications, such as social networking. The management of these graphs poses new challenges as such graphs are too large for a single server to manage efficiently. Current distributed techniques such as map-reduce and Pregel are not well-suited to processing interactive ad-hoc queries against large graphs. In this paper we demonstrate Horton, a distributed interactive query execution engine for large graphs. Horton defines a query language that allows the expression of regular language reach ability queries and provides a query execution engine with a query optimizer that allows interactive execution of queries on large distributed graphs in parallel. In the demo, we show the functionality of Horton managing a large graph for a social networking application called Codebook, whose graph represents data on software components, developers, development artifacts such as bug reports, and their interactions in large software projects.

international conference on distributed computing systems | 2006

An empirical evaluation of work stealing with parallelism feedback

Kunal Agrawal; Yuxiong He; Charles E. Leiserson

A-STEAL is a provably good adaptive work-stealing thread scheduler that provides parallelism feedback to a multiprocessor job scheduler. A-STEAL uses a simple multiplicative-increase, multiplicative-decrease algorithm to provide continual parallelism feedback to the job scheduler in the form of processor requests. Although jobs scheduled by A-STEAL can be shown theoretically to complete in near-optimal time asymptotically while utilizing at least a constant fraction of the allotted processors, the constants in the analysis leave it open on whether A-STEAL works well in practice. This paper confirms with simulation studies that A-STEAL performs well when scheduling adaptively parallel work-stealing jobs on large-scale multiprocessors. Our studies monitored the behavior of A-STEAL on a simulated multiprocessor system using synthetic workloads. We measured the completion time and waste of A-STEAL on over 2300 job runs using a variety of processor availability profiles. Linear-regression analysis indicates that ASTEAL provides almost perfect linear speedup. In addition, A-STEAL typically wasted less than 20% of the processor cycles allotted to the job. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora, Blumofe, and Plaxton which does not employ parallelism feedback. On moderately to heavily loaded large machines with predetermined availability profiles, A-STEAL typically completed jobs more than twice as quickly, despite being allotted the same or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP. We compared the utilization of A-STEAL and ABP when many jobs with varying characteristics are using the same multiprocessor. These experiments provide evidence that A-STEAL consistently provides higher utilization than ABP for a variety of job mixes.

international conference on data engineering | 2014

Mercury: A memory-constrained spatio-temporal real-time search on microblogs

Amr Magdy; Mohamed F. Mokbel; Sameh Elnikety; Suman Nath; Yuxiong He

This paper presents Mercury; a system for real-time support of top-k spatio-temporal queries on microblogs, where users are able to browse recent microblogs near their locations. With high arrival rates of microblogs, Mercury ensures real-time query response within a tight memory-constrained environment. Mercury bounds its search space to include only those microblogs that have arrived within certain spatial and temporal boundaries, in which only the top-k microblogs, according to a spatio-temporal ranking function, are returned in the search results. Mercury employs: (a) a scalable dynamic in-memory index structure that is capable of digesting all incoming microblogs, (b) an efficient query processor that exploits the in-memory index through spatio-temporal pruning techniques that reduce the number of visited microblogs to return the final answer, (c) an index size tuning module that dynamically finds and adjusts the minimum index size to ensure that incoming queries will be answered accurately, and (d) a load shedding technique that trades slight decrease in query accuracy for significant storage savings. Extensive experimental results based on a real-time Twitter Firehose feed and actual locations of Bing search queries show that Mercury supports high arrival rates of up to 64K microblogs/second and average query latency of 4 msec.

very large data bases | 2013

Horton+: a distributed system for processing declarative reachability queries over partitioned graphs

Mohamed Sarwat; Sameh Elnikety; Yuxiong He; Mohamed F. Mokbel

Horton+ is a graph query processing system that executes declarative reachability queries on a partitioned attributed multi-graph. It employs a query language, query optimizer, and a distributed execution engine. The query language expresses declarative reachability queries, and supports closures and predicates on node and edge attributes to match graph paths. We introduce three algebraic operators, select, traverse, and join, and a query is compiled into an execution plan containing these operators. As reachability queries access the graph elements in a random access pattern, the graph is therefore maintained in the main memory of a cluster of servers to reduce query execution time. We develop a distributed execution engine that processes a query plan in parallel on the graph servers. Since the query language is declarative, we build a query optimizer that uses graph statistics to estimate predicate selectivity. We experimentally evaluate the system performance on a cluster of 16 graph servers using synthetic graphs as well as a real graph from an application that uses reachability queries. The evaluation shows (1) the efficiency of the optimizer in reducing query execution time, (2) system scalability with the size of the graph and with the number of servers, and (3) the convenience of using declarative queries.

architectural support for programming languages and operating systems | 2015

Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services

E. Haque; Yong hun Eom; Yuxiong He; Sameh Elnikety; Ricardo Bianchini; Kathryn S. McKinley

Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing individual requests to reduce tail latency is challenging because (1) service demand is unknown when requests arrive; (2) blindly parallelizing all requests quickly oversubscribes hardware resources; and (3) parallelizing the numerous short requests will not improve tail latency. This paper introduces Few-to-Many (FM) incremental parallelization, which dynamically increases parallelism to reduce tail latency. FM uses request service demand profiles and hardware parallelism in an offline phase to compute a policy, represented as an interval table, which specifies when and how much software parallelism to add. At runtime, FM adds parallelism as specified by the interval table indexed by dynamic system load and request execution time progress. The longer a request executes, the more parallelism FM adds. We evaluate FM in Lucene, an open-source enterprise search engine, and in Bing, a commercial Web search engine. FM improves the 99th percentile response time up to 32% in Lucene and up to 26% in Bing, compared to prior state-of-the-art parallelization. Compared to running requests sequentially in Bing, FM improves tail latency by a factor of two. These results illustrate that incremental parallelism is a powerful tool for reducing tail latency.

Explore More