Is this you? Create Your Porfile

Tero Säntti

Information Technology University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tero Säntti is active.

Explore More

Publication

Featured researches published by Tero Säntti.

Microprocessors and Microsystems | 2013

Mapping multiple applications with unbounded and bounded number of cores on many-core networks-on-chip

Bo Yang; Liang Guang; Tero Säntti; Juha Plosila

Abstract With increasing processing capability and communication scalability, the many-core Network-on-Chip (NoC) provides great potential for massively parallel computing. Running multiple applications simultaneously on a many-core NoC is a promising approach to implement high performance parallel processing. This paper presents a novel methodology for mapping multiple applications adaptively with unbounded or bounded number of cores. Composed of application mapping and task mapping, the proposed two-step mapping methodology provides minimized communication energy consumption and execution time for multiple applications. It is evaluated by several kernels and real applications with a variety of settings on a NoC simulator. The quantitative experiments demonstrate the superior performance and energy efficiency of the proposed mapping methods.

design and diagnostics of electronic circuits and systems | 2010

Tree-model based mapping for energy-efficient and low-latency Network-on-Chip

Bo Yang; Thomas Canhao Xu; Tero Säntti; Juha Plosila

With the NoC size growing constantly, efficient algorithms are needed to provide power/performance-aware task mapping on massively parallel systems. In this paper a novel tree-model based mapping algorithm is proposed, to achieve high energy efficiency and low latency on NoC platforms. A NoC is abstracted as a tree composed of a root node and median nodes at different levels. By mapping tasks starting from the root of the tree, our algorithm minimizes the communication cost and consequently reduces the energy consumption and network delay. Experimental results show that the run-time of our algorithm is decreased by 90% on average compared to the Greedy Incremental (GI) algorithm. Full system simulation also shows that for Radix traffic, compared to the original random mapping, the GI achieves 18.7% and 17.3% reduction in energy consumption and average network latency respectively, while our algorithm achieves 24.7% and 40.8% reduction respectively.

norchip | 2010

Multi-application multi-step mapping method for many-core Network-on-Chips

Bo Yang; Liang Guang; Thomas Canhao Xu; Alexander Wei Yin; Tero Säntti; Juha Plosila

Massive parallel computing performed on many-core Network-on-Chips (NoCs) is the future of the computing. One feasible approach to implement parallel computing is to deploy multiple applications on the NoC simultaneously. In this paper, we propose a multi-application mapping method starting with the application mapping which finds a region on the NoC for each application and then task mapping which maps all tasks of the application into each region. In the application mapping step, several strategies based on the maximal empty rectangle (MER) technique are introduced for finding an optimal region for each application. In the task mapping step, a tree-model based algorithm is used with the purpose of reducing the communication latency and energy consumption. The experiment results show that the proposed method can achieve considerable reduction of network latency and energy consumption (up to 18%) for a given set of applications.

norchip | 2004

Communication scheme for an advanced java co-processor

Tero Säntti; Juha Plosila

This paper describes interface strategies for a Java co-processor (from now on JPU). The interface units are interxad changeable, and share a common communication scheme towards the co-processor. The rst version of the interface is designed for single CPU and single co-processor environment. The other is for a network of multiple CPUs and coxad processors. The co-processor does not need to know what kind of environment is is placed in, as all communication goes through the interface unit. This modularity of the dexad sign makes the co-processor more reusable and allows sysxad tem level scalability. This work is a part of a project focusing on design of an advanced Java co-processor for Java intensive SoC applications.

learning and intelligent optimization | 2012

Parameter-optimized simulated annealing for application mapping on networks-on-chip

Bo Yang; Liang Guang; Tero Säntti; Juha Plosila

Application mapping is an important issue in designing systems based on many-core networks-on-chip (NoCs). Simulated Annealing (SA) has been often used for searching for the optimized solution of application mapping problem. The parameters applied in the SA algorithm jointly control the annealing schedule and have great impact on the runtime and the quality of the final solution of the SA algorithm. The optimized parameters should be selected in a systematic way for each particular mapping problem, instead of using an identical set of empirical parameters for all problems. In this work, we apply an optimization method, Nelder-Mead simplex method, to obtain optimized parameters of SA. The experiment shows that with optimized parameters, we can get an average 237 times speedup of the SA algorithm, compared to the work where the empirical values are used for setting parameters. For the set of benchmarks, the proposed parameter-optimized SA algorithm achieves comparable communication energy consumption using less than 1% of iterations of that used in the reference work.

biennial baltic electronics conference | 2010

Efficient bytecode optimizations for a multicore Java co-processor system

Joonas Tyystjarvi; Tero Säntti; Juha Plosila

As the bytecode produced by the Java compiler is unoptimized, the bytecode generated from certain types of idiomatic Java code is inefficient for execution in an interpreter. This effect is amplified in a co-processor system, in which a single processor must process heap accesses and virtual method calls from multiple threads. Two types of optimizations are presented which can be performed directly on bytecode during class loading and which do not require a large amount of processing time. These optimizations are shown to improve performance greatly, up to 29 % in the Embedded Caffeinemark Benchmark suite. Even higher improvements were measured in multithreaded programs.

convention of electrical and electronics engineers in israel | 2010

Multi-application mapping algorithm for Network-on-Chip platforms

Bo Yang; Liang Guang; Thomas Canhao Xu; Tero Säntti; Juna Plosila

Multi- and many-core architectures have become the mainstream computing platforms for implementing Systems-on-Chip (SoC). To efficiently utilize the abundant processing resources on future many-core Network-on-Chip (NoC) platforms, the design focus should shift from single-application to multi-application scenarios. In this paper, we propose a multiple application mapping algorithm which maps multiple applications simultaneously onto different regions on the NoC. By optimizing the placement of both applications and tasks, the algorithm aims at shortening the average communication distance which in turn achieves lower network latency and energy consumption for a set of applications. The experimental results show that, compared to the random mapping, the proposed algorithm achieves 59% and 58% reductions of average network delay and energy consumption respectively.

genetic and evolutionary computation conference | 2012

t(k)-SA: accelerated simulated annealing algorithm for application mapping on networks-on-chip

Bo Yang; Liang Guang; Tero Säntti; Juha Plosila

Simulated Annealing (SA) algorithm is a promising method for solving combinatorial optimization problems. The only limitation of applying the SA algorithm to application mapping problem on many-core networks-on-chip (NoCs) is its low speed. To alleviate this limitation, an accelerated SA algorithm called tk-SA algorithm is proposed in this work. The tk-SA algorithm starts the annealing process from a lower initial temperature tk with an optimized initial mapping solution. Based on the analysis of the typical behavior of the general SA algorithm, an efficient method is proposed for determining the temperature tk. Quantitative evaluations verify that the method is capable of obtaining an appropriate tk such that the tk-SA algorithm can reproduce the behavior of the full-range SA from temperature tk. Experimental results show that compared with a parameter-optimized SA algorithm, the proposed tk-SA algorithm achieves an average speedup of 1.55 without loss of solution quality.

international symposium on system-on-chip | 2005

Towards a Formal Power Estimation Framework for Hardware Systems

J. Tuominen; Tero Säntti; Juha Plosila

Conventionally, the correctness of functional and non-functional properties of hardware components is ensured during design process by simulation. Moreover, different description languages are needed during development phases. Thus, by adopting the Action Systems, we are able to use the same formalism from specification down to implementation. In this study, exploit the possibilities to formally model power consumption in Action Systems context. The purpose is to develop formal power estimation flow, which can be used to monitor the power consumption from abstract level down to the gate level implementation.

2009 International Conference for Technical Postgraduates (TECHPOS) | 2009

Efficient execution of switch instructions on a multicore Java co-processor system

Joonas Tyystjarvi; Tero Säntti; Juha Plosila

Techniques are presented for reducing the performance overhead of switch instructions in a multicore hardware-accelerated Java virtual machine. The bytecode instruction set is extended with two new instructions suitable for hardware implementation and the complicated switch instructions are converted in the software portion of the virtual machine into series of hardware-implemented instructions using these extensions. The performance, logic and memory usage impact of this technique is evaluated and compared with a pure software implementation. Various techniques for performing a key search in lookup switches are also evaluated.

Explore More