Kenji Suehiro
NEC
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kenji Suehiro.
international conference on supercomputing | 1998
Kenji Suehiro; Hitoshi Murai; Yoshiki Seo
This paper describes new fast integer sorting methods for single vector and shared-memory parallel vector computers, based on the bucket sort algorithm. Existing vectorization methods for bucket sort have made great efforts to avoid store conflicts of vector scatter operations, and therefore are not so efftcient. The vectorization methods shown in this paper-the retry method, the split vector method and the mask vector method-all actively utilize the nature of the store conflicts to achieve high performance. The parallelization method in this paper uses a feature of shared-memory machines and dynamically changes the partitioning of histogram arrays without any overhead. By combining the retry and the parallelization methods, we got the worlds fastest results for the IS program (Class B) in the NAS Parallel Benchmarks on the NBC
parallel computing | 2004
Takashi Yanagawa; Kenji Suehiro
X4. Our methods are also applicable to a wide range of particle simulation programs.
international conference on parallel processing | 1996
Tsunehiko Kamachi; Kazuhiro Kusano; Kenji Suehiro; Yoshiki Seo; Masanori Tamura; Shoichi Sakon
The Earth Simulator (ES) is a large scale, distributed memory, parallel computer system consisting of 640 processor nodes (PN) with shared memory vector multiprocessors (64GFLOPS/PN, 5120 APs in total, AP: arithmetic processor). All the nodes are connected via a high speed (16GB/s) single-stage crossbar network called the Interconnection Network (IN).The operating system for the Earth Simulator is based on SUPER-UX, the UNIX operating system for the SX series scientific supercomputers. In order to realize high-performance parallel processing on the highly parallel machine, the operating system is enhanced for scalability.The Earth Simulator system is managed as a two-level cluster system called the Super Cluster System. In the Super Cluster System, the Earth Simulator system is divided into 40 clusters (16PNs/cluster). A single controller called Super Cluster Control Station (SCCS) manages all these clusters. This management system provides Single System Image (SSI) operation, management and job control for the large scale multi-node system.The Job Scheduler (JS) and NQS running on the SCCS control all jobs of the system. They schedule the resources such as processing nodes and files which have not usually been treated as scheduling resources. This allows efficient scheduling of large scale jobs.The MPI library (MPI/ES) and the HPF compiler (HPF/ES) are available for distributed parallel programming on the Earth Simulator. MPI/ES conforms to the MPI 2.0 standard and is optimized to exploit the hardware features. HPF/ES conforms to the core part of HPF 2.0 and supports some features of the HPF 2.0 approved extensions and HPF/JA 1.0 extensions. HPF/ES suitably handles the 3-level parallelism of the Earth Simulator system, that is, vectorization, shared-memory parallelization, and distributed-memory parallelization. Moreover, HPF/ES extends the language to easily handle irregular problems.
Archive | 1994
Yoshiki Seo; Tsunehiko Kamachi; Yukimitsu Watanabe; Kazuhiro Kusano; Kenji Suehiro; Yukimasa Shiroto
This paper presents methods for generating communication on compiling HPF programs for distributed-memory machines. We introduce the concept of an iteration template corresponding to an iteration space. Our HPF compiler performs the loop iteration mapping through the two-level mapping of the iteration template in the same way as the data mapping is performed in HPF. Making use of this unified mapping model of the data and the loops, communication for nonlocal accesses is handled based on data-realignment between the user-declared alignment and the optimal alignment, which ensures that only local accesses occur inside the loop. This strategy results in effective means of dealing with communication for arrays with undefined mapping, a simple manner for generating communication, and high portability of the HPF compiler. Experimental results on the NEC Cenju-3 distributed-memory machine demonstrate the effectiveness of our approach: the execution time of the compiler-generated program was within 10% of that of the hand-parallelized program.
Scientific Programming | 1997
Tsunehiko Kamachi; Andreas Müller; Roland Rühl; Yoshiki Seo; Kenji Suehiro; M. Tamura
This paper presents a performance estimator, a prototype of which is implemented within the parallel programming environment PCASE. The estimation is based on static performance prediction, using not only the design information of target machines but also benchmarking results. Additionally, communication costs are estimated based on a hierarchical memory machine model, which enables users to understand all the underlying communication costs on distributed memory machines for each parallel loop. With this performance suggestion, it is possible to interactively optimize data distribution and appropriately select vectorized or parallelized loops. Moreover, the skeleton profiling method is presented. It makes high speed trace (execution count of each statement) generation possible by deleting statements which do not affect the execution path of a program.
Concurrency and Computation: Practice and Experience | 2002
Hitoshi Murai; Takuya Araki; Yasuharu Hayashi; Kenji Suehiro; Yoshiki Seo
We have developed a compilation system which extends High Performance Fortran (HPF) in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPFs data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI) primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compilers applicability to a variety of application classes.
ieee international conference on high performance computing data and analytics | 2005
Yasuharu Hayashi; Kenji Suehiro
We are developing HPF/SX V2, a High Performance Fortran (HPF) compiler for vector parallel machines. It provides some unique extensions as well as the features of HPF 2.0 and HPF/JA. In particular, this paper describes four of them: (1) the ON directive of HPF 2.0; (2) the REFLECT and LOCAL directives of HPF/JA; (3) vectorization directives; and (4) automatic parallelization. We evaluate these features through some benchmark programs on NEC SX‐5. The results show that each of them achieved a 5–8 times speedup in 8‐CPU parallel execution and the four features are useful for vector parallel execution. We also evaluate the overall performance of HPF/SX V2 by using over 30 well‐known benchmark programs from HPFBench, APR Benchmarks, GENESIS Benchmarks, and NAS Parallel Benchmarks. About half of the programs showed good performance, while the other half suggest weakness of the compiler, especially on its runtimes. It is necessary to improve them to put the compiler to practical use. Copyright
Archive | 1999
Kenji Suehiro; Hitoshi Murai
We have developed the HPF (High Performance Fortran) compiler HPF/SX V2 as an interface for distributed memory parallel programming. HPF is a de facto standard language for parallel programs. It is possible to write parallel programs just by inserting comment directives into existing serial Fortran programs in HPF. This paper treats two parallelization methods in the HPF/SX V2 on an SMP (Symmetric Multiprocessor) cluster system, each node of which is built by connecting multiple vector PEs (Processor Elements) with a shared memory. The one is hybrid parallelization, which consists of vectorization on a PE, multi-thread parallelization within a node, and distributed memory parallelization across nodes. The other is flat parallelization, which consists of vectorization and distributed memory parallelization. We compare hybrid parallelization with flat parallelization by evaluating several typical codes. The result shows that hybrid parallelization is particularly beneficial, when reduction of memory is expected.
Nec Research & Development | 2003
Kenji Suehiro; Yasuharu Hayashi; Haruhito Hirosawa; Yoshiki Seo
Nec Research & Development | 1998
Yasuharu Hayashi; Shoichi Sakon; Yoshiki Seo; Kenji Suehiro; M. Tamura; H. Murai