Tohru Nojiri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tohru Nojiri is active.

Explore More

Publication

Featured researches published by Tohru Nojiri.

international solid-state circuits conference | 2010

A 45nm 37.3GOPS/W heterogeneous multi-core SoC

Yoichi Yuyama; Masayuki Ito; Yoshikazu Kiyoshige; Yusuke Nitta; Shigezumi Matsui; Osamu Nishii; Atsushi Hasegawa; Makoto Ishikawa; Tetsuya Yamada; Junichi Miyakoshi; Koichi Terada; Tohru Nojiri; Masashi Satoh; Hiroyuki Mizuno; Kunio Uchiyama; Yasutaka Wada; Keiji Kimura; Hironori Kasahara; Hideo Maejima

We develop a heterogeneous multi-core SoC for applications, such as digital TV systems with IP networks (IP-TV) including image recognition and database search. Figure 5.3.1 shows the chip features. This SoC is capable of decoding 1080i audio/video data using a part of SoC (one general-purpose CPU core, video processing unit called VPU5 and sound processing unit called SPU) [1]. Four dynamically reconfigurable processors called FE [2] are integrated and have a total theoretical performance of 41.5GOPS and power consumption of 0.76W. Two 1024-way matrix-processors called MX-2 [3] are integrated and have a total theoretical performance of 36.9GOPS and power consumption of 1.10W. Overall, the performance per watt of our SoC is 37.3GOPS/W at 1.15V, the highest among comparable processors [4–6] excluding special-purpose codecs. The operation granularity of the CPU, FE and MX-2 are 32bit, 16bit, and 4bit respectively, and thus we can assign the appropriate processor for each task in an effective manner. A heterogeneous multi-core approach is one of the most promising approaches to attain high performance with low frequency, or low power, for consumer electronics application and scientific applications, compared to homogeneous multi-core SoCs [4]. For example, for image-recognition application in the IP-TV system, the FEs are assigned to calculate optical flow operation [7] of VGA (640×480) size video data at 15fps, which requires 0.62GOPS. The MX-2s are used for face detection and calculation of the feature quantity of the VGA video data at 15fps, which requires 30.6GOPS. In addition, general-purpose CPU cores are used for database search using the results of the above operations, which requires further enhancement of CPU. The automatic parallelization compilers analyze parallelism of the data flow, generate coarse grain tasks, schedule tasks to minimize execution time considering data transfer overhead for general-purpose CPU and FE.

international symposium on microarchitecture | 2009

Domain Partitioning Technology for Embedded Multicore Processors

Tohru Nojiri; Yuki Kondo; Naohiko Irie; Masayuki Ito; Hajime Sasaki; Hideo Maejima

Todays embedded systems require both real-time control functions and IT functions. Integrating multiple operating systems on a multicore processor is one way to meet these requirements. However, in this approach, one operating systems failure can bring down the other operating systems. To address this issue, the authors propose a multidomain embedded system architecture with a physical partitioning controller.

Heterogeneous Multicore Processor Technologies for Embedded Systems | 2012

Heterogeneous Multicore Processor Technologies for Embedded Systems

Kunio Uchiyama; Fumio Arakawa; Hironori Kasahara; Tohru Nojiri; Hideyuki Noda; Yasuhiro Tawara; Akio Idehara; Kenichi Iwata; Hiroaki Shikano

To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book. Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view;Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems;Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs;Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications.

international symposium on computer architecture | 1986

Microprogrammable processor for object-oriented architecture

Tohru Nojiri; Shumpei Kawasaki; Kousuke Sakoda

An advanced microprocessor has been developed for the high performance execution of object oriented language programs. In object oriented languages, improvement of frequent or complex operations such as dynamic type checking, procedure calls, and storage management, contributes toward the increase of overall performance. In order to improve their performance, the microprocessor adopts large on-chip register files, a large EPROM for microstore, and ingenious instruction dispatching and tag-handling mechanisms. By specially treating frequently accessed data, i.e., allocating activation records in register files, much of the data traffic can be effectively localized within the chip, and the complexity of procedure calls as well as the burden imposed on storage management can be alleviated. The tag-handling mechanisms efficiently perform dynamic type checking. As the result, the microprocessor, together with an efficient microprogram, executes object oriented language programs much faster than existing computers. Furthermore, it can efficiently execute other high-level languages by using corresponding microprograms, especially AI-languages.

2012 International Green Computing Conference (IGCC) | 2012

Cooling efficiency aware workload placement using historical sensor data on IT-facility collaborative control

Masayoshi Mase; Jun Okitsu; Eiichi Suzuki; Tohru Nojiri; Kentaro Sano; Hayato Shimizu

A priority metric for IT equipment is proposed, and a method for extracting it from historical sensor data is devised. The proposed method consists of classification of cooling efficiency from sampled sensor data and calculation of the priority metric from statistics on these cooling-efficiency classes. A proof-of-concept experiment using the priority metric was conducted in a server room with a thermal environment including IT equipment units and cooling facilities. The results of the experiment indicate that IT-workload consolidation based on the proposed metric equalizes variation in server-room temperatures.

Archive | 2012

Heterogeneous Multicore Architecture

Kunio Uchiyama; Fumio Arakawa; Hironori Kasahara; Tohru Nojiri; Hideyuki Noda; Yasuhiro Tawara; Akio Idehara; Kenichi Iwata; Hiroaki Shikano

In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater flexibility, it is necessary to develop parallel processing on chips by taking advantage of the advances being made in semiconductor integration. Figure 2.1 illustrates the basic architecture of our heterogeneous multicore [1, 2]. Several low-power CPU cores and special purpose processor (SPP) cores, such as a digital signal processor, a media processor, and a dynamically reconfigurable processor, are embedded on a chip. In the figure, the number of CPU cores is m. There are two types of SPP cores, SPPa and SPPb, on the chip. The values n and k represent the respective number of SPPa and SPPb cores. Each processor core includes a processing unit (PU), a local memory (LM), and a data transfer unit (DTU) as the main elements. The PU executes various kinds of operations. For example, in a CPU core, the PU includes arithmetic units, register files, a program counter, control logic, etc., and executes machine instructions. With some SPP cores like the dynamic reconfigurable processor, the PU executes a large quantity of data in parallel using its array of arithmetic units. The LM is a small-size and low-latency memory and is mainly accessed by the PU in the same core during the PU’s execution. Some cores may have caches as well as an LM or may only have caches without an LM. The LM is necessary to meet the real-time requirements of embedded systems. The access time to a cache is non-deterministic because of cache misses. On the other hand, the access to an LM is deterministic. By putting a program and data in the LM, we can accurately estimate the execution cycles of a program that has hard real-time requirements. A data transfer unit (DTU) is also embedded in the core to achieve parallel execution of internal operation in the core and data transfer operations between cores and memories. Each PU in a core processes the data on its LM or its cache, and the DTU simultaneously executes memory-to-memory data transfer between cores. The DTU is like a direct memory controller (DMAC) and executes a command that transfers data between several kinds of memories, then checks and waits for the end of the data transfer, etc. Some DTUs are capable of command chaining, where multiple commands are executed in order. The frequency and voltage controller (FVC) connected to each core controls the frequency, voltage, and power supply of each core independently and reduces the total power consumption of the chip. If the frequencies or power supplies of the core’s PU, DTU, and LM can be independently controlled, the FVC can vary their frequencies and power supplies individually. For example, the FVC can stop the frequency of the PU and run the frequencies of the DTU and LM when the core is executing only data transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory that is commonly used by cores. Each core is connected to the on-chip interconnect, which may be several types of buses or crossbar switches. The chip is also connected to the off-chip main memory, which has a large capacity but high latency.

2013 IEEE COOL Chips XVI | 2013

Hardware support for resource partitioning in real-time embedded systems

Tetsuro Honmura; Yuki Kondoh; Tetsuya Yamada; Masashi Takada; Takumi Nitoh; Tohru Nojiri; Keisuke Toyama; Yasuhiko Saitoh; Hirofumi Nishi; Mikiko Sato; Mitaro Namiki

Todays embedded systems require multiple functions such as real-time control and information technology and integrating these functions on a multi-core processor is one effective solution. However, this increases overhead as it is necessary to partition resources in this approach to protect them. We developed hardware support called ExVisor/XVS to reduce the overhead of partitioning resources to achieve real-time characteristics. This features a physical address management module (PAM) that uses direct address translation by using a single level page table based on an embedded systems memory usage. We evaluated the overhead in a virtual machines (VM) resource access through register transfer level (RTL) simulation and implementation on a field-programmable gate array (FPGA), and it was only less than 5.6% compared with the resource access time by a single core processor.

Archive | 2012

Application Programs and Systems

Kunio Uchiyama; Fumio Arakawa; Hironori Kasahara; Tohru Nojiri; Hideyuki Noda; Yasuhiro Tawara; Akio Idehara; Kenichi Iwata; Hiroaki Shikano

This chapter describes the evaluation of a heterogeneous multicore architecture consisting of a widely used advanced audio codec (AAC) [1] audio encoder implemented on a fabricated chip. The AAC encoder is supported for audio playback by various embedded systems. The processing scheme on the heterogeneous multicore architecture with support of hierarchical memories and data transfer units was newly investigated, and the execution time and power consumption of the encoding were measured.

Archive | 1997