Hongqiang Harry Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongqiang Harry Liu is active.

Explore More

Publication

Featured researches published by Hongqiang Harry Liu.

acm special interest group on data communication | 2015

Dynamic scheduling of network updates

Xin Jin; Hongqiang Harry Liu; Rohan Gandhi; Srikanth Kandula; Ratul Mahajan; Ming Zhang; Jennifer Rexford; Roger Wattenhofer

We present Dionysus, a system for fast, consistent network updates in software-defined networks. Dionysus encodes as a graph the consistency-related dependencies among updates at individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different switches. This dynamic scheduling is the key to its speed; prior update methods are slow because they pre-determine a schedule, which does not adapt to runtime conditions. Testbed experiments and data-driven simulations show that Dionysus improves the median update speed by 53--88% in both wide area and data center networks compared to prior methods.

acm special interest group on data communication | 2013

zUpdate: updating data center networks with zero loss

Hongqiang Harry Liu; Xin Wu; Ming Zhang; Lihua Yuan; Roger Wattenhofer; David A. Maltz

Datacenter networks (DCNs) are constantly evolving due to various updates such as switch upgrades and VM migrations. Each update must be carefully planned and executed in order to avoid disrupting many of the mission-critical, interactive applications hosted in DCNs. The key challenge arises from the inherent difficulty in synchronizing the changes to many devices, which may result in unforeseen transient link load spikes or even congestions. We present one primitive, zUpdate, to perform congestion-free network updates under asynchronous switch and traffic matrix changes. We formulate the update problem using a network model and apply our model to a variety of representative update scenarios in DCNs. We develop novel techniques to handle several practical challenges in realizing zUpdate as well as implement the zUpdate prototype on OpenFlow switches and deploy it on a testbed that resembles real DCN topology. Our results, from both real-world experiments and large-scale trace-driven simulations, show that zUpdate can effectively perform congestion-free updates in production DCNs.

acm special interest group on data communication | 2012

Optimizing cost and performance for content multihoming

Hongqiang Harry Liu; Ye Wang; Yang Richard Yang; Hao Wang; Chen Tian

Many large content publishers use multiple content distribution networks to deliver their content, and many commercial systems have become available to help a broader set of content publishers to benefit from using multiple distribution networks, which we refer to as content multihoming. In this paper, we conduct the first systematic study on optimizing content multihoming, by introducing novel algorithms to optimize both performance and cost for content multihoming. In particular, we design a novel, efficient algorithm to compute assignments of content objects to content distribution networks for content publishers, considering both cost and performance. We also design a novel, lightweight client adaptation algorithm executing at individual content viewers to achieve scalable, fine-grained, fast online adaptation to optimize the quality of experience (QoE) for individual viewers. We prove the optimality of our optimization algorithms and conduct systematic, extensive evaluations, using real charging data, content viewer demands, and performance data, to demonstrate the effectiveness of our algorithms. We show that our content multihoming algorithms reduce publishing cost by up to 40%. Our client algorithm executing in browsers reduces viewer QoE degradation by 51%.

acm special interest group on data communication | 2015

Duet: cloud scale load balancing with hardware and software

Rohan Gandhi; Hongqiang Harry Liu; Y. Charlie Hu; Guohan Lu; Jitendra Padhye; Lihua Yuan; Ming Zhang

Load balancing is a foundational function of datacenter infrastructures and is critical to the performance of online services hosted in datacenters. As the demand for cloud services grows, expensive and hard-to-scale dedicated hardware load balancers are being replaced with software load balancers that scale using a distributed data plane that runs on commodity servers. Software load balancers offer low cost, high availability and high flexibility, but suffer high latency and low capacity per load balancer, making them less than ideal for applications that demand either high throughput, or low latency or both. In this paper, we present Duet, which offers all the benefits of software load balancer, along with low latency and high availability -- at next to no cost. We do this by exploiting a hitherto overlooked resource in the data center networks -- the switches themselves. We show how to embed the load balancing functionality into existing hardware switches, thereby achieving organic scalability at no extra cost. For flexibility and high availability, Duet seamlessly integrates the switch-based load balancer with a small deployment of software load balancer. We enumerate and solve several architectural and algorithmic challenges involved in building such a hybrid load balancer. We evaluate Duet using a prototype implementation, as well as extensive simulations driven by traces from our production data centers. Our evaluation shows that Duet provides 10x more capacity than a software load balancer, at a fraction of a cost, while reducing latency by a factor of 10 or more, and is able to quickly adapt to network dynamics including failures.

acm special interest group on data communication | 2015

Traffic engineering with forward fault correction

Hongqiang Harry Liu; Srikanth Kandula; Ratul Mahajan; Ming Zhang; David Gelernter

Faults such as link failures and high switch configuration delays can cause heavy congestion and packet loss. Because it takes time to detect and react to faults, these conditions can last long---even tens of seconds. We propose forward fault correction (FFC), a proactive approach to handling faults. FFC spreads network traffic such that freedom from congestion is guaranteed under arbitrary combinations of up to k faults. We show how FFC can be practically realized by compactly encoding the constraints that arise from this large number of possible faults and solving them efficiently using sorting networks. Experiments with data from real networks show that, with negligible loss in overall network throughput, FFC can reduce data loss by a factor of 7--130 in well-provisioned networks, and reduce the loss of high-priority traffic to almost zero in well-utilized networks.

symposium on operating systems principles | 2017

CrystalNet: Faithfully Emulating Large Production Networks

Hongqiang Harry Liu; Yibo Zhu; Jitu Padhye; Jiaxin Cao; Sri Tallapragada; Nuno P. Lopes; Andrey Rybalchenko; Guohan Lu; Lihua Yuan

Network reliability is critical for large clouds and online service providers like Microsoft. Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages. A promising way to minimize such network outages is to proactively validate all network operations in a high-fidelity network emulator, before they are carried out in production. To this end, we present CrystalNet, a cloud-scale, high-fidelity network emulator. It runs real network device firmwares in a network of containers and virtual machines, loaded with production configurations. Network engineers can use the same management tools and methods to interact with the emulated network as they do with a production network. CrystalNet can handle heterogeneous device firmwares and can scale to emulate thousands of network devices in a matter of minutes. To reduce resource consumption, it carefully selects a boundary of emulations, while ensuring correctness of propagation of network changes. Microsofts network engineers use CrystalNet on a daily basis to test planned network operations. Our experience shows that CrystalNet enables operators to detect many issues that could trigger significant outages.

hot topics in networks | 2016

FreeFlow: High Performance Container Networking

Tianlong Yu; Shadi A. Noghabi; Shachar Raindel; Hongqiang Harry Liu; Jitu Padhye; Vyas Sekar

With the tremendous popularity gained by container technology, many applications are being containerized: splitting into numerous containers connected by networks. However, current container networking solutions have either bad performance or poor portability, which undermines the advantages of containerization. In this paper, we propose FreeFlow, a container networking solution which achieves both high performance and good portability. FreeFlow is designed according to the observation that strict isolations are unnecessary among containers trusting each other, and it can significantly boost the communication quality of containers by compromising isolation a little bit. Specifically, we enable containers on the same physical machine to communicate via shared-memory and the ones on different physical machines communicate via high performance networking options, e.g. RDMA and DPDK. Naively wrapping up all the solutions together will result in poor potability of containers and huge complexity in application development. Instead, FreeFlow leverages a network abstraction which supports all common network APIs and a centralized network orchestrator which decides how to deliver data transparently to applications in the containers.

symposium on operating systems principles | 2017

Automatically Repairing Network Control Planes Using an Abstract Representation

Aaron Gember-Jacobson; Aditya Akella; Ratul Mahajan; Hongqiang Harry Liu

The forwarding behavior of computer networks is governed by the configuration of distributed routing protocols and access filters---collectively known as the network control plane. Unfortunately, control plane configurations are often buggy, causing networks to violate important policies: e.g., specific traffic classes (defined in terms of source and destination endpoints) should always be able to reach their destination, or always traverse a waypoint. Manually repairing these configurations is daunting because of their inter-twined nature across routers, traffic classes, and policies. Inspired by recent work in automatic program repair, we introduce CPR, a system that automatically computes correct, minimal repairs for network control planes. CPR casts configuration repair as a MaxSMT problem whose constraints are based on a digraph-based representation of a control planes semantics. Crucially, this representation must capture the dependencies between traffic classes arising from the cross-traffic-class nature of control plane constructs. The MaxSMT formulation must account for these dependencies whilst also accounting for all policies and preferring repairs that minimize the size (e.g., number of lines) of the configuration changes. Using configurations from 96 data center networks, we show that CPR produces repairs in less than a minute for 98% of the networks, and these repairs requiring changing the same or fewer lines of configuration than hand-written repairs in 79% of cases.

acm special interest group on data communication | 2018

Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems

Daehyeok Kim; Amirsaman Memaripour; Anirudh Badam; Yibo Zhu; Hongqiang Harry Liu; Jitu Padhye; Shachar Raindel; Steven Swanson; Vyas Sekar; Srinivasan Seshan

Storage systems in data centers are an important component of large-scale online services. They typically perform replicated transactional operations for high data availability and integrity. Today, however, such operations suffer from high tail latency even with recent kernel bypass and storage optimizations, and thus affect the predictability of end-to-end performance of these services. We observe that the root cause of the problem is the involvement of the CPU, a precious commodity in multi-tenant settings, in the critical path of replicated transactions. In this paper, we present HyperLoop, a new framework that removes CPU from the critical path of replicated transactions in storage systems by offloading them to commodity RDMA NICs, with non-volatile memory as the storage medium. To achieve this, we develop new and general NIC offloading primitives that can perform memory operations on all nodes in a replication group while guaranteeing ACID properties without CPU involvement. We demonstrate that popular storage applications can be easily optimized using our primitives. Our evaluation results with microbenchmarks and application benchmarks show that HyperLoop can reduce 99th percentile latency ≈ 800X with close to 0% CPU consumption on replicas.

acm special interest group on data communication | 2017

Closing the Network Diagnostics Gap with Vigil

Behnaz Arzani; Selim Ciraci; Luiz F. O. Chamon; Yibo Zhu; Hongqiang Harry Liu; Jitu Padhye; Geoff Outhred; Boon Thau Loo

Closing the Network Diagnostics Gap with Vigil Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang Liu, Jitu Padhye, Geoff Outhred, Boon Thau Loo Vigil started with an ambitious goal: For every TCP retransmission in our data centers, we wanted to pinpoint the network link that caused the packet drop that triggered the retransmission with negligible diagnostic overhead or changes to the networking infrastructure. This goal may sound like an overkill—after all, TCP is supposed to be able to deal with a few packet losses. Packet losses might occur due to simple congestion instead of network equipment failures. Even network failures might be transient. Above all, there is a danger of drowning in a sea of data without generating any actionable intelligence.

Explore More