Takeshi Nanri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takeshi Nanri is active.

Explore More

Publication

Featured researches published by Takeshi Nanri.

IEEE Antennas and Propagation Magazine | 2012

Development of a CUDA Implementation of the 3D FDTD Method

Matthew Livesey; James F. Stack; Fumie Costen; Takeshi Nanri; Norimasa Nakashima; Seiji Fujino

The use of general-purpose computing on a GPU is an effective way to accelerate the FDTD method. This paper introduces flexibility to the theoretically best available approach. It examines the performance on both Tesla- and Fermi-architecture GPUs, and identifies the best way to determine the GPU parameters for the proposed method.

international parallel and distributed processing symposium | 2009

A robust dynamic optimization for MPI Alltoall operation

Hyacinthe Nzigou Mamadou; Takeshi Nanri; Kazuaki Murakami

The performance of the Message Passing Interface collective communications is a critical issue to high performance computing widely discussed. In this paper we propose a mechanism that dynamically selects the most efficient MPI Alltoall algorithm for a given system/workload situation. This implementation method starts by grouping the fast algorithms based on respective performance prediction models that were obtained by using the point-to-point model P-LogP. The experiments performed on different parallel machines equipped with Infiniband and Gigabit Ethernet interconnects produced encouraging results, with negligible overhead to find the most appropriate algorithm to carry on the operation. In most cases, the dynamic Alltoall largely outperforms the traditional MPI implementations on different platforms.

Proceedings of the 14th European PVM/MPI User's Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface | 2007

Dynamic optimization of load balance in MPI broadcast

Takesi Soga; Kouji Kurihara; Takeshi Nanri; Motoyoshi Kurokawa; Kazuaki Murakami

There are many algorithms that compose broadcast from point-to-point communications, such as Binary Tree and Binomial Tree. Though many implementations of these algorithms are proposed in MPI libraries like MPICH [1], most of them are based on an assumption that all processes begin the broadcast at the same time. That means the orders of the point-to-point communications in the broadcast are arranged numerically, according to the rank of each process. However, naturally each process starts broadcast at different times, mainly because of the imbalance of workload of each process. That causes unnecessary waiting time on processes. Also, in a broadcast algorithm such as binomial tree algorithm, the amount of communication is different for each process therefore load imbalance is increased by the occurrence of both heavy workload and heavy communications on a same process. Our method purposes to solve these problems dynamically. This method solves these problems by profiling the workload of each rank at runtime and adjusting the orders of point-to-point communications according to the information. In the various algorithms of MPI_Bcast, binomial tree is one of the most popular one. It broadcasts a message to all M processes in the group with logM steps of point-to-point communications. At each step, each process that has already received the data sends data to the process which has not received yet. Because the number of send operations is different in each rank, if a heavy workload is assigned to the rank that invokes many send operations in the tree, whole load-imbalance causes the longer wait time at the ranks that receives the message from the heavy-loaded rank.

international symposium on computing and networking | 2015

On-the-Fly Automated Storage Tiering with Caching and both Proactive and Observational Migration

Kazuichi Oe; Takeshi Nanri; Koji Okamura

A hybrid storage system consisting of SSDs and HDDs improves the performance of user IOs by moving IO concentrated areas from the HDDs to the SSDs. It is important for the system to identify such areas and move them to the SSDs immediately. We studied the IO concentrations of operational storage servers and found one type that they continue for up to an hour within a narrow region of storage volume. The narrow region is either the same continued area or the area which moves to the neighboring after 3 minutes in average had passed. Moreover, it appeared at random logical block addresses (LBAs). We decide to call such IO concentrations Wandering IO Concentrated Areas (WIOCAs). To handle these concentrations, we developed an on-the-fly automated storage tiering (OTF-AST) with a caching system and both proactive and observational migration features. This method dynamically migrates the WIOCAs to the SSD of the OTF-AST and hands over the other IO concentrations to cache. The proactive migration feature migrates candidate WIOCAs to an area that will likely become one in the near future. On the other hand, the observational migration feature determines the HDD overhead by checking the user response time and migrates the sub-LUNs that were but are no longer parts of a WIOCA from the SSD to the HDD when the HDD overhead is low. Experimental results showed that our method was 113% faster than Facebook FlashCache on one of the Microsoft Research Cambridge workloads.

high performance computing for computational science vector and parallel processing | 2016

The design of advanced communication to reduce memory usage for exa-scale systems

Shinji Sumimoto; Yuichiro Ajima; Kazushige Saga; Takafumi Nose; Naoyuki Shida; Takeshi Nanri

Current MPI (Message Passing Interface) communication libraries require larger memories in proportion of the number of processes, and can not be used for exa-scale systems. This paper proposes a global memory based communication design to reduce memory usage for exa-scale communication. To realize exa-scale communication, we propose true global memory based communication primitives called Advanced Communication Primitives (ACPs). ACPs provide global address, which is able to use remote atomic memory operations on the global memory, RDMA (Remote Direct Memory Access) based remote memory copy operation, global heap allocator and global data libraries. ACPs are different from the other communication libraries because ACPs are global memory based so that house keeping memories can be distributed to other processes and programmers explicitly consider memory usage by using ACPs. The preliminary result of memory usage by ACPs is 70 MB on one million processes.

parallel, distributed and network-based processing | 2015

Channel Interface: A Primitive Model for Memory Efficient Communication

Takeshi Nanri; Takeshi Soga; Yuichiro Ajima; Yoshiyuki Morie; Hiroaki Honda; Taizo Kobayashi; Toshiya Takami; Shinji Sumimoto

Though the size of the system is getting larger towards exa-scale computation, the amount of available memory on computing nodes is expected to remain the same or to decrease. Therefore, memory efficiency is becoming an important issue for achieving scalability. This paper pointed out the problem of memory-inefficiency in the de-facto standard parallel programming model, Message Passing Interface (MPI). To solve this problem, the channel interface was introduced in the paper. This interface enables the programmers to appropriately allocate and de-allocate channels so that the program consumes just-enough amount of memory for communication. In addition to that, by limiting the message transfer supported by a channel as simple as possible, the memory consumption and the overhead for handling messages with this interface can be minimal. This paper showed a sample implementation of this interface. Then, the memory efficiency of the implementation is examined by the models of the memory consumption and the performance.

international symposium on antennas and propagation | 2012

Impact of GPU memory access patterns on FDTD

Matthew Livesey; James F. Stack; Fumie Costen; Takeshi Nanri; Norimasa Nakashima; Seiji Fujino

The application of General Purpose computing on a GPU is an effective way to accelerate the FDTD method. This work explores the different domain decomposition techniques from the literature and extends the theoretically best approach with additional flexibility. We examine the performance on both Tesla and Fermi architecture GPUs and identify the best way to determine the GPU parameters for the proposed method.

international conference on high performance computing and simulation | 2011

Effect of dynamic algorithm selection of Alltoall communication on environments with unstable network speed

Takeshi Nanri; Motoyoshi Kurokawa

As the HPC systems increase their size, performance of collective communications is becoming an important issue. Usually, decisions for which algorithm of those communications to be used are done based on statically specified thresholds of the size of messages and the number of processes. However, on recent HPC systems that are hiring Fat Tree or Torus topology as their interconnect, the network speed has become unpredictable. The main reason is the effect of contentions. This effect depends heavily on the relative locations of the compute nodes. On the other hand, to reduce the number of idle nodes, there are attempts for building job schedulers to attach compute nodes flexibly, without considering their relative positions among each other. With this policy, the network performance becomes unstable. As an approach for finding an appropriate algorithm even on such environment, a dynamic method, STAR-MPI, has been proposed. This method examines each algorithm at runtime, and uses the empirical data to choose the suitable one for the given situation. This paper first examined the effect of STAR-MPI on an environment with unstable network speed. The results of experiments on this environment showed that the dynamic approach was effective, but the cost for testing slow algorithms limited the effect. Then, the authors proposed an enhancement, in which algorithms that have been predicted relatively slow were discarded from the list of candidates. The predictions were done by using the performance models of the algorithms with the latency and the bandwidth measured at the first call of the collective communication. At this point, the effect of this enhancement shown in experimental results was not significant. However, the results showed that there was a possibility for achieving better performance by using more cost-effective way of prediction and tuning thresholds and factors used in the enhancement.

symposium on computer architecture and high performance computing | 2007

Performance Analysis and Linear Optimization Modeling of All-to-all Collective Communication Algorithms

Hyacinthe Nzigou Mamadou; G. de Melo Baptista Domingues; Takeshi Nanri; Kazuaki Murakami

Bioinformatics is among the most active research areas in computer science. In this study, we investigate a suite of workloads in bioinformatics on two multiprocessor systems with different configurations, and examine the effects of the configurations on the performance of the workloads. Our result indicates that the configurations of the multiprocessor systems have significant impact on the performance and scalability of the workloads. For example, a number of workloads have significantly higher scalability on one of the systems, but poorer absolute performance than on the other system. However, traditional scalability failed to capture the impacts of the system configurations on the workloads. We present insights on what kinds of workloads will run faster on which systems and propose new metrics to capture the impacts of multiple processor configurations on the workloads. These findings not only provide an easy way to compare results running on different systems, but also enable re-configuration of the underlying systems to run specific workloads efficiently. We also show how processor mapping and loop spreading may help map the workoads to the underlining multiprocessor configuration and achieve consistent scalability for these workloads.The performance of collective communication operations still represents a critical issue for high performance computing systems. Users of parallel machines need to have a good grasp of how different communication patterns and styles affect the performance of message-passing applications. This paper reports our contribution of the analysis of collective communication algorithms in the context of MPI programming paradigm by extending a standard point- to-point communication model, which is P-LogP. We focus on MPI Alltoall since this function is one of the most communication intensive collective operations known. In order to reduce the gap between the predicted and the measured run-time, all the system parameters are also taken into account with the total performance estimation, by applying the linear regression modeling with the empirical data. Results on InfiniBand clusters show that the final performance prediction models can accurately capture the entire system communication behavior of all algorithms, even for large size messages and large number of processors.

computational sciences and optimization | 2009

A Dynamic Solution for Efficient MPI Collective Communications

Hyacinthe Nzigou Mamadou; Feng Long Gu; Vivien Oddou; Takeshi Nanri; Kazuaki Murakami

The performance of the Message Passing Interface collective communication algorithms is a critical issue widely discussed in both academia and industry. In order to achieve and maintain high performance in the MPI implementations even in regard to random system behavior, the collective operations must be adapted to both the cluster platform and the workload of the user program. In this paper we propose the DYN\_Alltoall, a dynamic version of the traditional MPI\_Alltoall implementation, which is based on performance predictions derived from P-LogP model. The experiments which were performed on clusters equipped with different interconnect networks, Infiniband and Gigabit Ethernet, produced encouraging results with negligible overhead to find the most appropriate algorithm. In most cases, the dynamic Alltoall largely outperforms the traditional MPI implementations on different platforms.

Explore More