Takuya Nakaike | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takuya Nakaike is active.

Explore More

Publication

Featured researches published by Takuya Nakaike.

international symposium on computer architecture | 2015

Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8

Takuya Nakaike; Rei Odaira; Matthew Gaudet; Maged M. Michael; Hisanobu Tomari

Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.

Ibm Journal of Research and Development | 2015

Transactional memory support in the IBM POWER8 processor

Hung Q. Le; Guy Lynn Guthrie; Derek Edward Williams; Maged M. Michael; Brad Frey; William J. Starke; Cathy May; Rei Odaira; Takuya Nakaike

With multi-core processors, parallel programming has taken on greater importance. Traditional parallel programming techniques based on critical sections controlled by locking have several well-known drawbacks. To allow for more efficient parallel programming with higher performance, the IBM POWER8™ processor implements a hardware transactional memory facility. Transactional memory allows groups of load and store operations to execute and commit as a single atomic unit without the use of traditional locks, thereby improving performance and simplifying the parallel programming model. The POWER8 transactional memory facility provides a robust capability to execute transactions that can survive interrupts. It also allows non-speculative accesses within transactions, which facilitates debugging and thread-level speculation. Unique challenges caused by implementing transactional memory on top of the Power ISA (Instruction Set Architecture) weakly consistent memory model are addressed. We detail the Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses.

ieee international symposium on workload characterization | 2016

Workload characterization for microservices

Takanori Ueda; Takuya Nakaike; Moriyoshi Ohara

The microservice architecture is a new framework to construct a Web service as a collection of small services that communicate with each other. It is becoming increasingly popular because it can accelerate agile software development, deployment, and operation practices. As a result, cloud service providers are expected to host an increasing number of microservices that can generate significant resource pressure on the cloud infrastructure. We want to understand the characteristics of microservice workloads to design an infrastructure optimized for microservices. In this paper, we used Acme Air, an open-source benchmark for Web services, and analyzed the behavior of two versions of the benchmark, microservice and monolithic, for two widely used language runtimes, Node.js and Java. We observed a significant overhead due to the microservice architecture; the performance of the microservice version can be 79.2% lower than the monolithic version on the same hardware configuration. On Node.js, the microservice version consumed 4.22 times more time in the libraries of Node.js than the monolithic version to process one user request. On Java, the microservice version also consumed more time in the application server than the monolithic version. We explain these performance differences from both hardware and software perspectives. We discuss the network virtualization in Docker, an infrastructure for microservices that has nonnegligible impact on performance. These findings give clues to develop optimization techniques in a language runtime and hardware for microservice workloads.

ieee international symposium on workload characterization | 2014

Thread-level speculation on off-the-shelf hardware transactional memory.

Rei Odaira; Takuya Nakaike

Thread-level speculation can speed up a single-thread application by splitting its execution into multiple tasks and speculatively executing those tasks in multiple threads. Efficient thread-level speculation requires hardware support for memory conflict detection, store buffering, and execution rollback, and in addition, previous research has also proposed advanced optimization facilities, such as ordered transactions and data forwarding. Recently, implementations of hardware transactional memory (HTM) are coming into the market with minimal hardware support for thread-level speculation. However, few implementations offer advanced optimization facilities. Thus, it is important to determine how well thread-level speculation can be realized on the current HTM implementations, and what optimization facilities should be implemented in the future. In our research, we studied thread-level speculation on the off-the-shelf HTM implementation in Intel TSX. We manually modified potentially parallel benchmarks in SPEC CPU2006 for thread-level speculation. Our experimental results showed that thread-level speculation resulted in up to an 11% speed-up even without the advanced optimization facilities, but actually degraded the performance in most cases. In contrast to our expectations, the main reason for the performance loss was not the lack of hardware support for ordered transactions but the transaction aborts due to memory conflicts. Our investigation suggests that future hardware should support not only ordered transactions but also memory data forwarding, data synchronization, multi-version cache, and word-level conflict detection for thread-level speculation.

programming language design and implementation | 2010

Lock elision for read-only critical sections in Java

Takuya Nakaike; Maged M. Michael

It is not uncommon in parallel workloads to encounter shared data structures with read-mostly access patterns, where operations that update data are infrequent and most operations are read-only. Typically, data consistency is guaranteed using mutual exclusion or read-write locks. The cost of atomic update of lock variables result in high overheads and high cache coherence traffic under active sharing, thus slowing down single thread performance and limiting scalability. In this paper, we present SOLERO (Software Optimistic Lock Elision for Read-Only critical sections), a new lock implementation called for optimizing read-only critical sections in Java based on sequential locks. SOLERO is compatible with the conventional lock implementation of Java. However, unlike the conventional implementation, only critical sections that may write data or have side effects need to update lock variables, while read-only critical sections need only read lock variables without writing them. Each writing critical section changes the lock value to a new value. Hence, a read-only critical section is guaranteed to be consistent if the lock is free and its value does not change from the beginning to the end of the read-only critical section. Using Java workloads including SPECjbb2005 and the HashMap and TreeMap Java classes, we evaluate the performance impact of applying SOLERO to read-mostly locks. Our experimental results show performance improvements across the board, often substantial, in both single thread speed and scalability over the conventional lock implementation (mutual exclusion) and read-write locks. SOLERO improves the performance of SPECjbb2005 by 3-5% on single and multiple threads. The results using the HashMap and TreeMap benchmarks show that SOLERO outperforms the conventional lock implementation and read-write locks by substantial multiples on multi-threads.

symposium on code generation and optimization | 2010

Coloring-based coalescing for graph coloring register allocation

Rei Odaira; Takuya Nakaike; Tatsushi Inagaki; Hideaki Komatsu; Toshio Nakatani

Graph coloring register allocation tries to minimize the total cost of spilled live ranges of variables. Live-range splitting and coalescing are often performed before the coloring to further reduce the total cost. Coalescing of split live ranges, called sub-ranges, can decrease the total cost by lowering the interference degrees of their common interference neighbors. However, it can also increase the total cost because the coalesced sub-ranges can become uncolorable. In this paper, we propose coloring-based coalescing, which first performs trial coloring and next coalesces all copyrelated sub-ranges that were assigned the same color. The coalesced graph is then colored again with the graph coloring register allocation. The rationale is that coalescing of differently colored sub-ranges could result in spilling because there are some interference neighbors that prevent them from being assigned the same color. Experiments on Java programs show that the combination of live-range splitting and coloring-based coalescing reduces the static spill cost by more than 6% on average, comparing to the baseline coloring without splitting. In contrast, well-known iterated and optimistic coalescing algorithms, when combined with splitting, increase the cost by more than 20%. Coloring-based coalescing improves the execution time by up to 15% and 3% on average, while the existing algorithms improve by up to 12% and 1% on average.

ieee international symposium on workload characterization | 2013

Do C and Java programs scale differently on Hardware Transactional Memory

Rei Odaira; José G. Castaños; Takuya Nakaike

People program in many different programming languages in the multi-core era, but how does each programming language affect application scalability with transactional memory? As commercial implementations of Hardware Transactional Memory (HTM) enter the market, the HTM support in two major programming languages, C and Java, is of critical importance to the industry. We studied the scalability of the same transactional memory applications written in C and Java, using the STAMP benchmarks. We performed our HTM experiments on an IBM mainframe zEnterprise EC12. We found that in 4 of the 10 STAMP benchmarks Java was more scalable than C. The biggest factor in this higher scalability was the efficient thread-local memory allocator in our Java VM. In two of the STAMP benchmarks C was more scalable because in C padding can be inserted efficiently among frequently updated fields to avoid false sharing. We also found Java VM services could cause severe aborts. By fixing or avoiding these problems, we confirmed that C and Java had similar HTM scalability for the STAMP benchmarks.

ieee international symposium on workload characterization | 2010

Real Java applications in software transactional memory

Takuya Nakaike; Rei Odaira; Toshio Nakatani; Maged M. Michael

Transactional Memory (TM) shows promise as a new concurrency control mechanism to replace lock-based synchronization. However, there have been few studies of TM systems with real applications, and the real-world benefits and barriers of TM remain unknown. In this paper, we present a detailed analysis of the behavior of real applications on a software transactional memory system. Based on this analysis, we aim to clarify what programming work is required to achieve reasonable performance in TM-based applications. We selected three existing Java applications: (1) HSQLDB, (2) the Geronimo application server, and (3) the GlassFish application server, because each application has a scalability problem caused by lock contentions. We identified the critical sections where lock contentions frequently occur, and modified the source code so that the critical sections are executed transactionally. However, this simple modification proved insufficient to achieve reasonable performance because of excessive data conflicts. We found that most of the data conflicts were caused by application-level optimizations such as reusing objects to reduce the memory usage. After modifying the source code to disable those optimizations, the TM-based applications showed higher or competitive performance compared to lock-based applications. Another finding is that the number of variables that actually cause data conflicts is much smaller than the number of variables that can be accessed in critical sections. This implies that the performance tuning of TM-based applications may be easier than that of lock-based applications where we need to take care of all of the variables that can be accessed in the critical sections.

symposium on applications and the internet | 2004

JSP splitting for improving execution performance

Takuya Nakaike; Goh Kondoh; Hiroaki Nakamura; Fumihiko Kitayama; Shinichi Hirose

Splitting a JSP (JavaServer Pages) page into fragments can improve the execution performance of JSP pages when the Web application server can separately cache the Web page fragments obtained by executing the JSP fragments. If a JSP page is split into fragments according to the update frequency of each portion of the Web page obtained by executing the JSP page, all of the split JSP fragments do not need to be executed again when only a single cached part of a Web page expires. In addition, the fragments of a JSP page can be reused by other JSP pages. In both cases, the execution results of all of the JSP fragments split from the JSP page must be the same as from the JSP page before it was split. In this paper, we propose JSP splitting, which is a method of splitting a JSP page into fragments maintaining the data and control dependences existing in the original JSP page. JSP splitting automatically detects the portions needed to maintain the data and control dependences of a JSP page for the portions that developers want to split from the JSP page. We implemented JSP splitting with a GUI tool, and confirmed that the split JSP fragments were executed in the same as the way as the JSP page before the split. Experimental results show that the response time to access a Web page can be reduced by splitting a JSP page into fragments and setting different caching periods for the Web page fragments obtained by executing the JSP fragments.

ieee international symposium on workload characterization | 2014

Characterization of call-graph profiles in Java workloads

Takuya Nakaike; Hiroshi Inoue; Toshio Suganuma; Moriyoshi Ohara

Method inlining is a standard optimization in optimizing compilers. Java Just-In-Time (JIT) compilers use a size threshold for their inlining decisions. They favor inlining methods that are smaller than the threshold and do not inline the other larger methods. This has been a simple and effective strategy, but we should be able to make even better inlining based on a cost-benefit model, assuming accurate and lightweight call-graph profiles. We characterized the call-graph profiles of twelve SPECjvm2008 benchmarks with two metrics, method size and call frequency, to study the opportunities that make better tradeoff between performance benefits and compilation costs over the existing size-based inlining algorithm in the dynamic compilation environment. We found that inlining methods that are smaller than a size threshold is not always beneficial for performance. Most of the call-graph edges to the small methods had quite low call frequencies. Excluding the low-weighted call-graph edges from inlining can be an opportunity to reduce the compilation cost. On the other hand, a few call-graph edges to the methods that are larger than the threshold had high call frequencies. Inlining the highly-weighted call-graph edges can be an opportunity to improve performance. We developed a new cost-benefit-based inlining algorithm to capture those identified opportunities and to make inlining decisions more cost effective. Our evaluation showed that the new algorithm improves performance significantly while reduces the compilation cost, compared to the widely-used size-based inlining algorithm.

Explore More