Featured Researches

Performance

ALERT: Accurate Learning for Energy and Timeliness

An increasing number of software applications incorporate runtime Deep Neural Networks (DNNs) to process sensor data and return inference results to humans. Effective deployment of DNNs in these interactive scenarios requires meeting latency and accuracy constraints while minimizing energy, a problem exacerbated by common system dynamics. Prior approaches handle dynamics through either (1) system-oblivious DNN adaptation, which adjusts DNN latency/accuracy tradeoffs, or (2) application-oblivious system adaptation, which adjusts resources to change latency/energy tradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation. ALERT, our runtime scheduler, uses a probabilistic model to detect environmental volatility and then simultaneously select both a DNN and a system resource configuration to meet latency, accuracy, and energy constraints. We evaluate ALERT on CPU and GPU platforms for image and speech tasks in dynamic environments. ALERT's holistic approach achieves more than 13% energy reduction, and 27% error reduction over prior approaches that adapt solely at the application or system level. Furthermore, ALERT incurs only 3% more energy consumption and 2% higher DNN-inference error than an oracle scheme with perfect application and system knowledge.

Read more
Performance

ALTIS: Modernizing GPGPU Benchmarking

This paper presents Altis, a benchmark suite for modern GPGPU computing. Previous benchmark suites such as Rodinia and SHOC have served the research community well, but were developed years ago when hardware was more limited, software supported fewer features, and production hardware-accelerated workloads were scarce. Since that time, GPU compute density and memory capacity has grown exponentially, programmability features such as unified memory, demand paging, and HyperQ have matured, and new workloads such as deep neural networks (DNNs), graph analytics, and crypto-currencies have emerged in production environments, stressing the hardware and software in ways that previous benchmarks did not anticipate. Drawing inspiration from Rodinia and SHOC, Altis is a benchmark suite designed for modern GPU architectures and modern GPU runtimes, representing a diverse set of application domains. By adopting and extending applications from Rodinia and SHOC, adding new applications, and focusing on CUDA platforms, Altis better represents modern GPGPU workloads to enable support GPGPU research in both architecture and system software.

Read more
Performance

Accelerating Discrete Wavelet Transforms on Parallel Architectures

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. We show that corresponding separable parts can be merged into non-separable units, which halves the number of steps. In addition, we introduce an optional optimization approach leading to a reduction in the number of arithmetic operations. The discussed schemes were adapted on the OpenCL framework and pixel shaders, and then evaluated using GPUs of two biggest vendors. We demonstrate the performance of the proposed non-separable methods by comparison with existing separable schemes. The non-separable schemes outperform their separable counterparts on numerous setups, especially considering the pixel shaders.

Read more
Performance

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22% for reduction and16%for scan.

Read more
Performance

Achieving Zero Asymptotic Queueing Delay for Parallel Jobs

Zero queueing delay is highly desirable in large-scale computing systems. Existing work has shown that it can be asymptotically achieved by using the celebrated Power-of- d -choices (pod) policy with a probe overhead d=ω( logN 1−λ ) , and it is impossible when d=O( 1 1−λ ) , where N is the number of servers and λ is the load of the system. However, these results are based on the model where each job is an indivisible unit, which does not capture the parallel structure of jobs in today's predominant parallel computing paradigm. This paper thus considers a model where each job consists of a batch of parallel tasks. Under this model, we propose a new notion of zero (asymptotic) queueing delay that requires the job delay under a policy to approach the job delay given by the max of its tasks' service times, i.e., the job delay assuming its tasks entered service right upon arrival. This notion quantifies the effect of queueing on a job level for jobs consisting of multiple tasks, and thus deviates from the conventional zero queueing delay for single-task jobs in the literature. We show that zero queueing delay for parallel jobs can be achieved using the batch-filling policy (a variant of the celebrated pod policy) with a probe overhead d=ω( 1 (1−λ)logk ) in the sub-Halfin-Whitt heavy-traffic regime, where k is the number of tasks in each job { and k properly scales with N (the number of servers)}. This result demonstrates that for parallel jobs, zero queueing delay can be achieved with a smaller probe overhead. We also establish an impossibility result: we show that zero queueing delay cannot be achieved if d=exp(o( logN logk )) .

Read more
Performance

AdaptMemBench: Application-Specific MemorySubsystem Benchmarking

Optimizing scientific applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers. Factors beyond working setsize affect performance. A benchmark framework that exploresthe performance in an application-specific manner is essential tocharacterize memory performance and at the same time informmemory-efficient coding practices. We present AdaptMemBench,a configurable benchmark framework that measures achievedmemory performance by emulating application-specific accesspatterns with a set of kernel-independent driver templates. Thisframework can explore the performance characteristics of a widerange of access patterns and can be used as a testbed for potentialoptimizations due to the flexibility of polyhedral code generation.We demonstrate the effectiveness of AdaptMemBench with casestudies on commonly used computational kernels such as triadand multidimensional stencil patterns.

Read more
Performance

Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability

In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at both cluster and data center levels. However, literature power capping approaches do not fit well the nature of important applications based on first-class multi-thread technology. For these applications performance may not grow linearly as a function of the thread-level parallelism because of the need for thread synchronization while accessing shared resources, such as shared data. In this paper we consider the problem of maximizing the application performance under a power cap by dynamically tuning the thread-level parallelism and the power state of the CPU-cores. Based on experimental observations, we design an adaptive technique that selects in linear time the optimal combination of thread-level parallelism and CPU-core power state for the specific workload profile of the multi-threaded application. We evaluate our proposal by relying on different benchmarks, configured to use different thread synchronization methods, and compare its effectiveness to different state-of-the-art techniques.

Read more
Performance

Adaptive Selection of Deep Learning Models on Embedded Systems

The recent ground-breaking advances in deep learning networks ( DNNs ) make them attractive for embedded systems. However, it can take a long time for DNNs to make an inference on resource-limited embedded devices. Offloading the computation into the cloud is often infeasible due to privacy concerns, high latency, or the lack of connectivity. As such, there is a critical need to find a way to effectively execute the DNN models locally on the devices. This paper presents an adaptive scheme to determine which DNN model to use for a given input, by considering the desired accuracy and inference time. Our approach employs machine learning to develop a predictive model to quickly select a pre-trained DNN to use for a given input and the optimization constraint. We achieve this by first training off-line a predictive model, and then use the learnt model to select a DNN model to use for new, unseen inputs. We apply our approach to the image classification task and evaluate it on a Jetson TX2 embedded deep learning platform using the ImageNet ILSVRC 2012 validation dataset. We consider a range of influential DNN models. Experimental results show that our approach achieves a 7.52% improvement in inference accuracy, and a 1.8x reduction in inference time over the most-capable single DNN model.

Read more
Performance

Age of Information for Single Buffer Systems with Vacation Server

In this research, we consider age-related metrics for queueing systems with vacation server. Assuming that there is a single buffer at the queue to receive packets, we consider three variations of this single buffer system, namely Conventional Buffer System (CBS), Buffer Relaxation System (BRS), and Conventional Buffer System with Preemption in Service (CBS-P). We introduce a decomposition approach to derive the closed-form expressions for expected Age of Information (AoI), expected Peak Age of Information (PAoI) as well as the variance of peak age for these systems. We then consider these three systems with non-independent vacations, and use polling system as an example to show that the decomposition approach can be applied to derive closed-form expressions of PAoI for general situation. We explore the conditions under which one of these systems has advantage over the others, and we further perform numerical studies to validate our results and develop insights.

Read more
Performance

Age of Information in a Decentralized Network of Parallel Queues with Routing and Packets Losses

The paper deals with Age of Information (AoI) in a network of multiple sources and parallel queues with buffering capabilities, preemption in service and losses in served packets. The queues do not communicate between each other and the packets are dispatched through the queues according to a predefined probabilistic routing. By making use of the Stochastic Hybrid System (SHS) method, we provide a derivation of the average AoI of a system of two parallel queues (with and without buffer capabilities) and compare the results with those of a single queue. We show that known results of packets delay in Queuing Theory do not hold for the AoI. Unfortunately, the complexity of computing the average AoI using the SHS method increases highly with the number of queues. We therefore provide an upper bound of the average AoI in a system of an arbitrary number of M/M/1/(N+1) queues and show its tightness in various regimes. This upper bound allows providing a tight approximation of the average AoI with a very low complexity. We then provide a game framework that allows each source to determine its best probabilistic routing decision. By using Mean Field Games, we provide an analysis of the routing game framework, propose an efficient iterative method to find the routing decision of each source and prove its convergence to the desired equilibrium.

Read more

Ready to get started?

Join us today