Featured Researches

Performance

Direct N-body application on low-power and energy-efficient parallel architectures

The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct N -body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations. We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to "traditional" technologies for HPC, which purely focus on peak-performance than on power-efficiency.

Read more
Performance

Discrete-time Queueing Model of Age of Information with Multiple Information Sources

Information freshness in IoT-based status update systems has recently been studied through the Age of Information (AoI) and Peak AoI (PAoI) performance metrics. In this paper, we study a discrete-time server arising in multi-source IoT systems which accepts incoming information packets from multiple information sources so as to be forwarded to a remote monitor for status update purposes. Under the assumption of Bernoulli information packet arrivals and a common geometric service time distribution across all the sources, we numerically obtain the exact per-source distributions of AoI and PAoI in matrix-geometric form for three different queueing disciplines: i) Non-Preemptive Bufferless (NPB) ii) Preemptive Bufferless (PB) iii) Non-Preemptive Single Buffer with Replacement (NPSBR). The proposed numerical algorithm employs the theory of Discrete-Time Markov Chains (DTMC) of Quasi-Birth-Death (QBD) type and is matrix analytical, i.e, the algorithm is based on numerically stable and efficient vector-matrix operations.Numerical examples are provided to validate the accuracy and effectiveness of the proposed queueing model. We also present a numerical example on the optimum choice of the Bernoulli parameters in a practical IoT system with two sources with diverse AoI requirements.

Read more
Performance

Dispatching to Parallel Servers: Solutions of Poisson's Equation for First-Policy Improvement

Policy iteration techniques for multiple-server dispatching rely on the computation of value functions. In this context, we consider the continuous-space M/G/1-FCFS queue endowed with an arbitrarily-designed cost function for the waiting times of the incoming jobs. The associated value function is a solution of Poisson's equation for Markov chains, which in this work we solve in the Laplace transform domain by considering an ancillary, underlying stochastic process extended to (imaginary) negative backlog states. This construction enables us to issue closed-form value functions for polynomial and exponential cost functions and for piecewise compositions of the latter, in turn permitting the derivation of interval bounds for the value function in the form of power series or trigonometric sums. We review various cost approximation schemes and assess the convergence of the interval bounds these induce on the value function. Namely: Taylor expansions (divergent, except for a narrow class of entire functions with low orders of growth), and uniform approximation schemes (polynomials, trigonometric), which achieve optimal convergence rates over finite intervals. This study addresses all the steps to implementing dispatching policies for systems of parallel servers, from the specification of general cost functions towards the computation of interval bounds for the value functions and the exact implementation of the first-policy improvement step.

Read more
Performance

Distributed Server Allocation for Content Delivery Networks

We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics.

Read more
Performance

Dockless Bike-Sharing Systems with Unusable Bikes: Removing, Repair and Redistribution under Batch Policies

This paper discusses a large-scale dockless bike-sharing system (DBSS) with unusable bikes, which can be removed, repaired, redistributed and reused under two batch policies: One for removing the unusable bikes from each parking region to a maintenance shop, and the other for redistributing the repaired bikes from the maintenance shop to some suitable parking regions. For such a bike-sharing system, this paper proposes and develops a new computational method by applying the RG-factorizations of block-structured Markov processes in the closed queueing networks. Different from previous works in the literature of queueing networks, a key contribution of our computational method is to set up a new nonlinear matrix equation to determine the relative arrival rates, and to show that the nonlinearity comes from two different groups of processes: The failure and removing processes; and the repair and redistributing processes. Once the relative arrival rate is introduced to each node, these nodes are isolated from each other, so that the Markov processes of all the nodes are independent of each other, thus the Markov system of each node is described as an elegant block-structured Markov process whose stationary probabilities can be easily computed by the RG-factorizations. Based on this, this paper can establish a more general product-form solution of the closed queueing network, and provides performance analysis of the DBSS through a comprehensive discussion for the bikes' failure, removing, repair, redistributing and reuse processes under two batch policies. We hope that our method opens a new avenue to quantitative evaluation of more general DBSSs with unusable bikes.

Read more
Performance

Domain-Sharding for Faster HTTP/2 in Lossy Cellular Networks

HTTP/2 (h2) is a new standard for Web communications that already delivers a large share of Web traffic. Unlike HTTP/1, h2 uses only one underlying TCP connection. In a cellular network with high loss and sudden spikes in latency, which the TCP stack might interpret as loss, using a single TCP connection can negatively impact Web performance. In this paper, we perform an extensive analysis of real world cellular network traffic and design a testbed to emulate loss characteristics in cellular networks. We use the emulated cellular network to measure h2 performance in comparison to HTTP/1.1, for webpages synthesized from HTTP Archive repository data. Our results show that, in lossy conditions, h2 achieves faster page load times (PLTs) for webpages with small objects. For webpages with large objects, h2 degrades the PLT. We devise a new domain-sharding technique that isolates large and small object downloads on separate connections. Using sharding, we show that under lossy cellular conditions, h2 over multiple connections improves the PLT compared to h2 with one connection and HTTP/1.1 with six connections. Finally, we recommend content providers and content delivery networks to apply h2-aware domain-sharding on webpages currently served over h2 for improved mobile Web performance.

Read more
Performance

Download Time Analysis for Distributed Storage Codes with Locality and Availability

The paper presents techniques for analyzing the expected download time in distributed storage systems that employ systematic availability codes. These codes provide access to hot data through the systematic server containing the object and multiple recovery groups. When a request for an object is received, it can be replicated (forked) to the systematic server and all recovery groups. We first consider the low-traffic regime and present the close-form expression for the download time. By comparison across systems with availability, maximum distance separable (MDS), and replication codes, we demonstrate that availability codes can reduce download time in some settings but are not always optimal. In the high-traffic regime, the system consists of multiple inter-dependent Fork-Join queues, making exact analysis intractable. Accordingly, we present upper and lower bounds on the download time, and an M/G/1 queue approximation for several cases of interest. Via extensive numerical simulations, we evaluate our bounds and demonstrate that the M/G/1 queue approximation has a high degree of accuracy.

Read more
Performance

Duet Benchmarking: Improving Measurement Accuracy in the Cloud

We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3x to 12.5x (5.03x on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8x to 82.4x (37.4x on average) for the SPEC CPU 2017 workloads.

Read more
Performance

DynIMS: A Dynamic Memory Controller for In-memory Storage on HPC Systems

In order to boost the performance of data-intensive computing on HPC systems, in-memory computing frameworks, such as Apache Spark and Flink, use local DRAM for data storage. Optimizing the memory allocation to data storage is critical to delivering performance to traditional HPC compute jobs and throughput to data-intensive applications sharing the HPC resources. Current practices that statically configure in-memory storage may leave inadequate space for compute jobs or lose the opportunity to utilize more available space for data-intensive applications. In this paper, we explore techniques to dynamically adjust in-memory storage and make the right amount of space for compute jobs. We have developed a dynamic memory controller, DynIMS, which infers memory demands of compute tasks online and employs a feedback-based control model to adapt the capacity of in-memory storage. We test DynIMS using mixed HPCC and Spark workloads on a HPC cluster. Experimental results show that DynIMS can achieve up to 5X performance improvement compared to systems with static memory allocations.

Read more
Performance

Dynamic Load Balancing with Tokens

Efficiently exploiting the resources of data centers is a complex task that requires efficient and reliable load balancing and resource allocation algorithms. The former are in charge of assigning jobs to servers upon their arrival in the system, while the latter are responsible for sharing server resources between their assigned jobs. These algorithms should take account of various constraints, such as data locality, that restrict the feasible job assignments. In this paper, we propose a token-based mechanism that efficiently balances load between servers without requiring any knowledge on job arrival rates and server capacities. Assuming a balanced fair sharing of the server resources, we show that the resulting dynamic load balancing is insensitive to the job size distribution. Its performance is compared to that obtained under the best static load balancing and in an ideal system that would constantly optimize the resource utilization.

Read more

Ready to get started?

Join us today