Masazumi Matsubara | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masazumi Matsubara is active.

Explore More

Publication

Featured researches published by Masazumi Matsubara.

international conference on distributed computing systems | 2013

Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis

Qingyang Wang; Yasuhiko Kanemasa; Jack Li; Deepal Jayasinghe; Toshihiro Shimizu; Masazumi Matsubara; Motoyuki Kawaba; Calton Pu

Identifying the location of performance bottlenecks is a non-trivial challenge when scaling n-tier applications in computing clouds. Specifically, we observed that an n-tier application may experience significant performance loss when there are transient bottlenecks in component servers. Such transient bottlenecks arise frequently at high resource utilization and often result from transient events (e.g., JVM garbage collection) in an n-tier system and bursty workloads. Because of their short lifespan (e.g., milliseconds), these transient bottlenecks are difficult to detect using current system monitoring tools with sampling at intervals of seconds or minutes. We describe a novel transient bottleneck detection method that correlates throughput (i.e., request service rate) and load (i.e., number of concurrent requests) of each server in an n-tier system at fine time granularity. Both throughput and load can be measured through passive network tracing at millisecond-level time granularity. Using correlation analysis, we can identify the transient bottlenecks at time granularities as short as 50ms. We validate our method experimentally through two case studies on transient bottlenecks caused by factors at the system software layer (e.g., JVM garbage collection) and architecture layer (e.g., Intel SpeedStep).

international conference on timely results in operating systems | 2013

Impact of DVFS on n-tier application performance

Qingyang Wang; Yasuhiko Kanemasa; Jack Li; Chien An Lai; Masazumi Matsubara; Calton Pu

Dynamic Voltage and Frequency Scaling (DVFS) has been widely deployed and proven to reduce energy consumption at low CPU utilization levels; however, our measurements of the n-tier application benchmark (RUBBoS) performance showed significant performance degradation at high utilization levels, with response time several times higher and throughput loss of up to 20%, when DVFS is turned on. Using a combination of benchmark measurements and simulation, we found two kinds of problems: large response time fluctuations due to push-back wave queuing in n-tier systems and throughput loss due to rapidly alternating bottlenecks. These problems arise from anti-synchrony between DVFS adjustment period and workload burst cycles (similar cycle length but out of phase). Simulation results (confirmed by extensive measurements) show the anti-synchrony happens routinely for a wide range of configurations. We show that a workload-sensitive DVFS adaptive control mechanism can disrupt the anti-synchrony and reduce the performance impact of DVFS at high utilization levels to 25% or less of the original.

ieee international conference on services computing | 2013

Revisiting Performance Interference among Consolidated n-Tier Applications: Sharing is Better Than Isolation

Yasuhiko Kanemasa; Qingyang Wang; Jack Li; Masazumi Matsubara; Calton Pu

Performance unpredictability is one of the major concerns slowing down the migration of mission-critical applications into cloud computing infrastructures. An example of non-intuitive result is the measured n-tier application performance in a virtualized environment that showed increasing workload caused a competing, co-located constant workload to decrease its response time. In this paper, we investigate the sensitivity of measured performance in relation to two factors: (1) consolidated server specification of virtual machine resource availability, and (2) burstiness of n-tier application workload. Our first and surprising finding is that specifying a complete isolation, e.g., 50-50 even split of CPU between two co-located virtual machines (VMs) results in significantly lower performance compared to a fully-shared allocation, e.g., up to 100% CPU for both co-located VMs. This happens even at relatively modest resource utilization levels (e.g., 40% CPU in the VMs). Second, we found that an increasingly bursty workload also increases the performance loss among the consolidated servers, even at similarly modest utilization levels (e.g., 70% overall). A potential solution to the first problem (performance loss due to resource allocation) is cross-tier-priority scheduling (giving higher priority to shorter jobs), which can reduce the performance loss by a factor of two in our experiments. In contrast, bursty workloads are a more difficult problem: our measurements show they affect both the isolation and sharing strategies in virtual machine resource allocation.

international conference on cloud computing | 2013

An Experimental Study of Rapidly Alternating Bottlenecks in n-Tier Applications

Qingyang Wang; Yasuhiko Kanemasa; Jack Li; Deepal Jayasinghe; Toshihiro Shimizu; Masazumi Matsubara; Motoyuki Kawaba; Calton Pu

Identifying the location of performance bottlenecks is a non-trivial challenge when scaling n-tier applications in computing clouds. Specifically, we observed that an n-tier application may experience significant performance loss when bottlenecks alternate rapidly between component servers. Such rapidly alternating bottlenecks arise naturally and often from resource dependencies in an n-tier system and bursty workloads. These rapidly alternating bottlenecks are difficult to detect because the saturation in each participating server may have a very short lifespan (e.g., milliseconds) compared to current system monitoring tools and practices with sampling at intervals of seconds or minutes. Using passive network tracing at fine-granularity (e.g., aggregate at every 50ms), we are able to correlate throughput (i.e., request service rate) and load (i.e., number of concurrent requests) in each server of an n-tier system. Our experimental results show conclusive evidence of rapidly alternating bottlenecks caused by system software (JVM garbage collection) and middleware (VM collocation).

ieee international conference on services computing | 2013

Variations in Performance Measurements of Multi-core Processors: A Study of n-Tier Applications

Junhee Park; Qingyang Wang; Deepal Jayasinghe; Jack Li; Yasuhiko Kanemasa; Masazumi Matsubara; Daisaku Yokoyama; Masaru Kitsuregawa; Calton Pu

The prevalence of multi-core processors has raised the question of whether applications can use the increasing number of cores efficiently in order to provide predictable quality of service (QoS). In this paper, we study the horizontal scalability of n-tier application performance within a multicore processor (MCP). Through extensive measurements of the RUBBoS benchmark, we found one major source of performance variations within MCP: the mapping of cores to virtual CPUs can significantly lower on-chip cache hit ratio, causing performance drops of up to 22% without obvious changes in resource utilization. After we eliminated these variations by fixing the MCP core mapping, we measured the impact of three mainstream hypervisors (the dominant Commercial Hypervisor, Xen, and KVM) on intra-MCP horizontal scalability. On a quad-core dual-processor (total 8 cores), we found some interesting similarities and dissimilarities among the hypervisors. An example of similarities is a non-monotonic scalability trend (throughput increasing up to 4 cores and then decreasing for more than 4 cores) when running a browse-only CPU-intensive workload. This problem can be traced to the management of last level cache of CPU packages. An example of dissimilarities among hypervisors is their handling of write operations in mixed read/write, I/O-intensive workloads. Specifically, the Commercial Hypervisor is able to provide more than twice the throughput compared to KVM. Our measurements show that both MCP cache architecture and the choice of hypervisors indeed have an impact on the efficiency and horizontal scalability achievable by applications. However, despite their differences, all three mainstream hypervisors have difficulties with the intra-MCP horizontal scalability beyond 4 cores for n-tier applications.

Proceedings Innovative Architecture for Future Generation High-Performance Processors and Systems | 1997

Effectiveness of register preloading on CP-PACS node processor

Hiroshi Nakamura; K. Itakura; Masazumi Matsubara; Taisuke Boku; Kisaburo Nakazawa

CP-PACS is a massively parallel processor (MPP) for large scale scientific computations. On September 1996, CP-PACS equipped with 2048 processors began its operation at University of Tsukuba. At that time, CP-PACS was the fastest MPP in the world on LINPACK benchmark. CP-PACS was designed to achieve very high performance in large scientific/engineering applications. A is well known that ordinary data cache is not effective in such applications because data size is much larger than cache size and because there is little temporal locality. Thus, a special mechanism for hiding long memory access latency is indispensable. Cache prefetching is a well-known technique for this purpose. In addition to cache prefetching, CP-PACS node processors implement register preloading mechanism. This mechanism enables the processor to transfer required floating-point data directly (not via data cache) between main memory and floating-point registers in pipelined way. We compare register preloading with cache prefetching by measuring real performance of CP-PACS processor and HP PA-8000 processor which implement cache prefetching and/or register preloading.

ieee international conference on high performance computing data and analytics | 2001

PIO: Parallel I/O System for Massively Parallel Processors

Taisuke Boku; Masazumi Matsubara; K. Itakura

In this paper, we propose a parallel I/O system utilizing parallel commodity network attached to multiple I/O processors of parallel processing systems. I/O requests from user application are automatically distributed to parallel I/O channels to achieve the best utilization of parallel network, and the system provides highly scalable performance according to the number of channels. We also provides an API for easy user-level programming and a set of utilities for high-speed data transfer and real-time parallelized visualization based on it.

Archive | 2002