Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hitoshi Oi is active.

Publication


Featured researches published by Hitoshi Oi.


computing frontiers | 2006

Instruction folding in a hardware-translation based java virtual machine

Hitoshi Oi

Bytecode hardware-translation improves the performance of a Java Virtual Machine (JVM) with small hardware resource and complexity overhead. Instruction folding is a technique to further improve the performance of a JVM by reducing the redundancy in the stack-based instruction execution. However, the variable instruction length of the Java bytecode makes the folding logic complex. In this paper, we propose a folding scheme with reduced hardware complexity and evaluate its performance. For seven benchmark cases, the proposed scheme folded 6.6% to 37.1% of the bytecodes which correspond to 84.2% to 102% of the PicoJava-IIs performance.


frontier of computer science and technology | 2007

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

Hitoshi Oi

The speed gap between processors and hard disk drives (HDDs) has been spreading at a faster speed than the gap between processors and main memory devices. The most adopted methods for reducing this speed gap are caching and widening the datapath (such as striping a file into multiple disks). Solid state disks (SSDs) are data storage devices that look like (ordinary) magnetic hard disk drives to the programs but are actually made of the same semiconductor devices and therefore are much faster than HDDs. In this paper, we present a case study of evaluating a DRAM-based SSD with OLTP workload which has a high bandwidth requirement and is write-intensive. We also discuss the suitable ways of using the SSD by taking its advantages and disadvantages.


ieee systems conference | 2013

Power-efficiency study using SPECjEnterprise2010

Hitoshi Oi; Sho Niboshi

In this paper, we present a case study of the power consumption and performance trade-offs in Java application servers. We use the industry-standard benchmark for the Java application servers, SPECjEnterprise2010, on two platforms with different CPUs, AMD Phenom II and Intel Atom, We investigated the performance and power consumption behaviors against the increasing system size and the relative performance between Phenom vs Atom. Phenom is capable of dynamic frequency scaling (DFS) and we studied the effects of clock frequency control parameters on the performance and power consumption. In terms of the maximum system sizes with valid quality of service (QoS) metrics, Phenom could handle 9.7 times more transactions than Atom. In terms of dynamic power consumption normalized to the system size, Atom was 2.5 times more power-efficient than Phenom. Increasing the sampling rate, one of the DFS parameters, was effective in reducing the power consumption in low load level regions. It reduced the dynamic power up to 7.7 Watt, about 40% lower than the default setting.


Journal of Systems and Software | 2008

Local variable access behavior of a hardware-translation based Java virtual machine

Hitoshi Oi

Hardware bytecode translation is a technique to improve the performance of the Java virtual machine (JVM), especially on the portable devices for which the overhead of dynamic compilation is significant. However, since the translation is done on a single bytecode basis, a naive implementation of the JVM generates frequent memory accesses for local variables which can be not only a performance bottleneck but also an obstacle for instruction folding. A solution to this problem is to add a small register file to the data path of the microprocessor which is dedicated for storing local variables. However, the effectiveness of such a local variable register file depends on the size and the local variable access behavior of the applications. In this paper, we analyze the local variable access behavior of various Java applications. In particular, we will investigate the fraction of local variable accesses that are covered by the register file of a varying size, which determines the chip area overhead and the operation speed. We also evaluate the effectiveness of the sliding register window for parameter passing in context of JVM and on-the-fly optimization of local variable to register file mapping. With two types of exceptions, a 16-entry register file achieves coverages of up to 98%. The first type of exception is represented by the SAXON XSLT processor for which the effect of cold miss is significant. Adding the sliding window feature to the register file for parameter passing turns 6.2-13.3% of total accesses from miss to hit to the register file for the SAXON with XSLTMark. The second type of exception is represented by the FFT, which accesses more than 16 local variables for most of method invocations. In this case, on-the-fly profiling is effective. The hit ratio of a 16-entry register file for the FFT is increased from 44% to 83% by an array of 8-bit counters.


high performance computing and communications | 2011

Performance Modeling of a Consolidated Java Application Server

Hitoshi Oi; Kazuaki Takahashi

System-level virtualization enables multiple servers to be consolidated on a single hardware platform and to share its resources more efficiently. We are currently developing a performance model of a consolidated multi-tier Java application server. The model breaks down the CPU utilization of the workload into servers and transaction types, and use these service time parameters in the network of queues to predict the performance. For the target of initial development, we use SPECjAppServer2004 running on a quad-core server consolidated by Xen. In this paper, we present the current status of performance model development. We have found that the measured CPU utilization seems lower than the actual system saturation level. As a result, the performance model is saturated at a larger system size. We also have found that while the behavior of Manage transactions is most sensitive to the system size, its service times are lower than other transactions. When the CPU utilization of 4-core execution is predicted by the data from 1 to 3-core executions, the prediction errors range from-3.6 to 43.4%, with the largest error occurring in the database domain.


network computing and applications | 2009

Optimizations of Large Receive Offload in Xen

Fumio Nakanjima; Hitoshi Oi

Xen provides us with logically independent computing environments (domains) and I/O devices can be multiplexed so that each domain considers as if it has own instances of I/O devices. These benefits come with the performance overhead and network interface is one of most typical cases. Previously, we ported the large receive offload (LRO) into the physical and virtual network interfaces of Xen and evaluated its effectiveness. In this paper, two optimizations are attempted to further improve the network performance of Xen. First, copying packets at the bridge within the driver domain is eliminated. The aggregated packets are flushed to the upper layer in the network stack when the kernel polls the network device driver.Our second optimization is to increase the number of aggregated packets by waiting for every other polling before flushing the packets. Compared to the original LRO, the first optimization reduces the packet handling overhead in the driver domain from 13.4 to 13.0 (clock cycles per transferred byte). However, it also increases the overhead in the guest domain from 7.1 to 7.7 and the overall improvement in throughput is negligible. The second optimization reduces the overhead in driver and guest domains from 13.4 to 3.3 and from 7.1 to 5.9, respectively. The receive throughput is improved from 577Mbps to 748Mbps.


languages, compilers, and tools for embedded systems | 2005

On the design of the local variable cache in a hardware translation-based java virtual machine

Hitoshi Oi

Hardware bytecode translation is a technique to improve the performance of the Java Virtual Machine (JVM), especially on the portable devices for which dynamic compilation is infeasible. However, since the translation is done on a single bytecode basis, it is likely to generate frequent memory accesses for local variables which can be a performance bottleneck.In this paper, we propose to add a small register file to the datapath of the hardware-translation based JVM and use it as a local variable cache. We evaluate the effectiveness of the local variable cache against the size of the local variable cache which determines the chip area overhead and the operating speed. We also discuss the mechanisms for the efficient parameter passing and the on-the-fly profiling.With two types of exceptions, a 16-entry local variable cache achieved hit ratios of 60 to 98%. The first type of exceptions is represented by the FFT, which accesses more than 16 local variables. In this case, on-the-fly profiling was effective. The hit ratio of 16-entry cache for the FFT was increased from 44 to 83%. The second type of exception is represented by the SAXON XSLT processor for which cold misses were significant. The proposed parameter passing mechanism turned 6.4 to 13.3% of total accesses from miss to hit to the local variable cache.


international conference on cloud computing | 2010

Application of Fuzzy Control Theory in Resource Management of a Consolidated Server

Sho Niboshi; Hitoshi Oi

A virtualized system incorporates multiple systems into a single physical computer as virtual domains. A lot of data centers and server systems have been organized using virtualization technology to merge several computer systems. On the shared system, resource manager is the key affecting the performance. However, the resource management in current systems does not provide accurate resource allocation, because it only utilizes information from virtual machines and disregards the state of running applications. The paper demonstrates the CPU resource controller taking the state of application as inputs to produce the minimum resource retaining application performance in acceptable level. In particular, it employs two-layered controller. The first layer controller makes resource request based on the relationship between the state and resource demand of each application, modeled by fuzzy control theory. This approach is efficient to represent resource allocation model since fuzzy control theory deals imprecise and uncertain problems. The second layer controller adjusts the requests to the system capacity and builds the layout of resource capacity based on the relative Quality of Service performances between applications. For the separation of resource, common resource controller imposes a hard limit on the amount of resource a given domain can consume. The controller allocates resource with most effective capacity configuration. Under certain specified conditions, the controller does not set the capacities and allows domains to use the free time if the resource is idle. This results in eliminating unused resources and achieves relative high resource usage. Finally, the resource controller is evaluated with a virtualized system, and its advantages over conventional resource allocation methods are shown.


frontier of computer science and technology | 2006

Towards a Low Power Virtual Machine forWireless Sensor Network Motes

Hitoshi Oi; Chris J. Bleakley

Virtual machines (VMs) have been proposed as an efficient programming model for wireless sensor network (WSN) devices. However, the processing overhead required for VM execution has a significant impact on the power consumption and battery lifetime of these devices. This paper analyses the sources of power consumption in the Mate VM for WSNs. The paper proposes a generalised processor architecture allowing for hardware acceleration of VM execution. The paper proposes a number of hardware accelerators for Mate VM execution and assesses their effectiveness


international symposium on parallel and distributed processing and applications | 2012

Workload Analysis of SPECjEnterprise2010

Hitoshi Oi; Sho Niboshi

In this paper, we present a case study of measuring the performance of SPECjEnterprise2010 and its workload analysis on two different configurations, where either the application or database server is the performance bottleneck. The CPU utilization of the application and database servers behave differently for the increasing system size. They draw square-root and quadruple like functions, respectively. By measuring each transaction type separately, we find that the source of these non-linear factors is the Browse transaction. Also, the sum of the CPU utilization of each transaction type executed individually overestimates the total CPU utilization when all transaction types are executed simultaneously. In the performance modeling methodology of SPECjAppServer2004 found in the literature, the CPU time for each transaction is assumed to be constant and it is obtained by the measurement of individual execution. However, from our observations, this methodology cannot be directly applied to SPECjEnterprise2010.

Collaboration


Dive into the Hitoshi Oi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge