Olatunji Ruwase | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olatunji Ruwase is active.

Explore More

Publication

Featured researches published by Olatunji Ruwase.

acm/ieee international conference on mobile computing and networking | 2008

Ditto: a system for opportunistic caching in multi-hop wireless networks

Fahad R. Dogar; Amar Phanishayee; Himabindu Pucha; Olatunji Ruwase; David G. Andersen

This paper presents the design, implementation, and evaluation of Ditto, a system that opportunistically caches overheard data to improve subsequent transfer throughput in wireless mesh networks. While mesh networks have been proposed as a way to provide cheap, easily deployable Internet access, they must maintain high transfer throughput to be able to compete with other last-mile technologies. Unfortunately, doing so is difficult because multi-hop wireless transmissions interfere with each other, reducing the available capacity on the network. This problem is particularly severe in common gateway-based scenarios in which nearly all transmissions go through one or a few gateways from the mesh network to the Internet. Ditto exploits on-path as well as opportunistic caching based on overhearing to improve the throughput of data transfers and to reduce load on the gateways. It uses content-based naming to provide application independent caching at the granularity of small chunks, a feature that is key to being able to cache partially overheard data transfers. Our evaluation of Ditto shows that it can achieve significant performance gains for cached data, increasing throughput by up to 7x over simpler on-path caching schemes, and by up to an order of magnitude over no caching.

acm symposium on parallel algorithms and architectures | 2008

Parallelizing dynamic information flow tracking

Olatunji Ruwase; Phillip B. Gibbons; Todd C. Mowry; Shimin Chen; Michael Kozuch; Michael P. Ryan

Dynamic information flow tracking (DIFT) is an important tool for detecting common security attacks and memory bugs. A DIFT tool tracks the flow of information through a monitored programs registers and memory locations as the program executes, detecting and containing/fixing problems on-the-fly. Unfortunately, sequential DIFT tools are quite slow, and DIFT is quite challenging to parallelize. In this paper, we present a new approach to parallelizing DIFT-like functionality. Extending our recent work on accelerating sequential DIFT, we consider a variant of DIFT that tracks the information flow only through unary operations relaxed DIFT, and yet makes sense for detecting security attacks and memory bugs. We present a parallel algorithm for relaxed DIFT, based on symbolic inheritance tracking, which achieves linear speed-up asymptotically. Moreover, we describe techniques for reducing the constant factors, so that speed-ups can be obtained even with just a few processors. We implemented the algorithm in the context of a Log-Based Architectures (LBA) system, which provides hardware support for logging a program trace and delivering it to other (monitoring) processors. Our simulation results on SPEC benchmarks and a video player show that our parallel relaxed DIFT reduces the overhead to as low as 1.2X using 9 monitoring cores on a 16-core chip multiprocessor.

international symposium on computer architecture | 2015

Page overlays: an enhanced virtual memory framework to enable fine-grained memory management

Vivek Seshadri; Gennady Pekhimenko; Olatunji Ruwase; Onur Mutlu; Phillip B. Gibbons; Michael Kozuch; Todd C. Mowry; Trishul M. Chilimbi

Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity-e.g., fine-grained de-duplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure. We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads. We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.

programming language design and implementation | 2010

Decoupled lifeguards: enabling path optimizations for dynamic correctness checking tools

Olatunji Ruwase; Shimin Chen; Phillip B. Gibbons; Todd C. Mowry

Dynamic correctness checking tools (a.k.a. lifeguards) can detect a wide array of correctness issues, such as memory, security, and concurrency misbehavior, in unmodified executables at run time. However, lifeguards that are implemented using dynamic binary instrumentation (DBI) often slow down the monitored application by 10-50X, while proposals that replace DBI with hardware still see 3-8X slowdowns. The remaining overhead is the cost of performing the lifeguard analysis itself. In this paper, we explore compiler optimization techniques to reduce this overhead. The lifeguard software is typically structured as a set of event-driven handlers, where the events are individual instructions in the monitored applications dynamic instruction stream. We propose to decouple the lifeguard checking code from the application that it is monitoring so that the lifeguard analysis can be invoked at the granularity of hot paths in the monitored application. In this way, we are able to find many more opportunities for eliminating redundant work in the lifeguard analysis, even starting with well-optimized applications and hand-tuned lifeguard handlers. Experimental results with two lifeguard frameworks - one DBI-based and one hardware-assisted - show significant reduction in monitoring overhead.

international symposium on microarchitecture | 2009

Flexible Hardware Acceleration for Instruction-Grain Lifeguards

Shimin Chen; Michael Kozuch; Phillip B. Gibbons; Michael P. Ryan; Theodoros Strigkos; Todd C. Mowry; Olatunji Ruwase; Evangelos Vlachos; Babak Falsafi

Instruction-grain lifeguards monitor executing programs at the granularity of individual instructions to quickly detect bugs and security attacks, but their fine-grain nature incurs high monitoring overheads. This article identifies three common sources of these overheads and proposes three techniques that together constitute a general-purpose hardware acceleration framework for lifeguards.

knowledge discovery and data mining | 2015

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

Feng Yan; Olatunji Ruwase; Yuxiong He; Trishul M. Chilimbi

Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy on hard tasks, such as image and speech recognition. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. To enable training of extremely large DNNs, models are partitioned across machines. To expedite training on very large data sets, multiple model replicas are trained in parallel on different subsets of the training examples with a global parameter server maintaining shared weights across these replicas. The correct choice for model and data partitioning and overall system provisioning is highly dependent on the DNN and distributed system hardware characteristics. These decisions currently require significant domain expertise and time consuming empirical state space exploration. This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability. Also, we use these performance models to build a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time. We evaluate our performance models and scalability optimizer using a state-of-the-art distributed DNN training framework on two benchmark applications. The results show our performance models estimate DNN training time with high estimation accuracy and our scalability optimizer correctly chooses the best configurations, minimizing the training time of distributed DNNs.

ieee international conference on high performance computing data and analytics | 2016

SERF: efficient scheduling for fast deep neural network serving via judicious parallelism

Feng Yan; Olatunji Ruwase; Yuxiong He; Evgenia Smirni

Deep neural networks (DNNs) has enabled a variety of artificial intelligence applications. These applications are backed by large DNN models running in serving mode on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize the request response latencies. This paper characterizes the behavior of different parallelism techniques for supporting scalable and responsive serving systems for large DNNs. We identify and model two important properties of DNN workloads: homogeneous request service demand, and interference among requests running concurrently due to cache/memory contention. These properties motivate the design of SERF, a dynamic scheduling framework that is powered by an interference-aware queueing-based analytical model. We evaluate SERF in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions.

architectural support for programming languages and operating systems | 2017

Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Samyam Rajbhandari; Yuxiong He; Olatunji Ruwase; Michael Carbin; Trishul M. Chilimbi

Convolutional Neural Networks (CNN) are a class of Ar- tificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI prob- lems in a variety of domains, such as speech recognition, object recognition, and natural language processing. CNNs are, however, computationally intensive to train. This paper presents the first characterization of the per- formance optimization opportunities for training CNNs on CPUs. Our characterization includes insights based on the structure of the network itself (i.e., intrinsic arithmetic inten- sity of the convolution and its scalability under parallelism) as well as dynamic properties of its execution (i.e., sparsity of the computation). Given this characterization, we present an automatic framework called spg-CNN for optimizing CNN training on CPUs. It comprises of a computation scheduler for efficient parallel execution, and two code generators: one that opti- mizes for sparsity, and the other that optimizes for spatial reuse in convolutions. We evaluate spg-CNN using convolutions from a variety of real world benchmarks, and show that spg-CNN can train CNNs faster than state-of-the-art approaches by an order of magnitude.

Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference on | 2017

HyperDrive: exploring hyperparameters with POP scheduling

Jeff Rasley; Yuxiong He; Feng Yan; Olatunji Ruwase; Rodrigo Fonseca

The quality of machine learning (ML) and deep learning (DL) models are very sensitive to many different adjustable parameters that are set before training even begins, commonly called hyperparameters. Efficient hyperparameter exploration is of great importance to practitioners in order to find high-quality models with affordable time and cost. This is however a challenging process due to a huge search space, expensive training runtime, sparsity of good configurations, and scarcity of time and resources. We develop a scheduling algorithm POP that quickly identifies among promising, opportunistic and poor configurations of hyperparameters. It infuses probabilistic model-based classification with dynamic scheduling and early termination to jointly optimize quality and cost. We also build a comprehensive hyperparameter exploration infrastructure, HyperDrive, to support existing and future scheduling algorithms for a wide range of usage scenarios across different ML/DL frameworks and learning domains. We evaluate POP and HyperDrive using complex and deep models. The results show that we speedup the training process by up to 6.7x compared with basic approaches like random/grid search and up to 2.1x compared with state-of-the-art approaches while achieving similar model quality compared with prior work.

network and distributed system security symposium | 2004