Johannes Ax | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Johannes Ax is active.

Explore More

Publication

Featured researches published by Johannes Ax.

international symposium on system-on-chip | 2014

A communication model and partitioning algorithm for streaming applications for an embedded MPSoC

Wayne Kelly; Martin Flasskamp; Gregor Sievers; Johannes Ax; Jianing Chen; Christian Klarhorst; Christoph Ragg; Thorsten Jungeblut; Andrew Sorensen

Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulated-annealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%.

international symposium on circuits and systems | 2012

A TCMS-based architecture for GALS NoCs

Thorsten Jungeblut; Johannes Ax; Mario Porrmann; Ulrich Rückert

In this work we propose a TCMS (Tightly Coupled Mesochronous Synchronizer)-based architecture of Globally-Asynchronous Locally-Synchronous (GALS) Network-on-Chips (NoC). The NoC is based on the GigaNoC approach, a scalable NoC featuring packet-switched wormhole routing. At a clock frequency of 750MHz a link bandwidth of up to 6 GByte/s is achieved. To provide a high computational performance, the processing engines (PEs) are based on the CoreVA VLIW architecture. The resource efficiency of mesochronous (TMCS-based) and asynchronous (FIFO-based) communication links is analyzed. In addition an asynchronous coupling of the PE to the switch boxes is evaluated. This allows for multi-voltage/multi-frequency scenarios, where the performance of each PE is adapted to the current performance requirements. Analyses have shown, that TCMS-based communication links and asynchronously coupled PEs allow for the high efficiency of GALS-based NoCs with moderate additional resource requirements.

international symposium on circuits and systems | 2015

Evaluation of interconnect fabrics for an embedded MPSoC in 28 nm FD-SOI

Gregor Sievers; Johannes Ax; Nils Kucza; Martin Flaßkamp; Thorsten Jungeblut; Wayne Kelly; Mario Porrmann; Ulrich Rückert

Embedded many-core architectures contain dozens to hundreds of CPU cores that are connected via a highly scalable NoC interconnect. Our Multiprocessor-System-on-Chip CoreVA-MPSoC combines the advantages of tightly coupled bus-based communication with the scalability of NoC approaches by adding a CPU cluster as an additional level of hierarchy. In this work, we analyze different cluster interconnect implementations with 8 to 32 CPUs and compare them in terms of resource requirements and performance to hierarchical NoCs approaches. Using 28 nm FD-SOI technology the area requirement for 32 CPUs and AXI crossbar is 5.59 mm2 including 23.61% for the interconnect at a clock frequency of 830 MHz. In comparison, a hierarchical MPSoC with 4 CPU cluster and 8 CPUs in each cluster requires only 4.83 mm2 including 11.61% for the interconnect. To evaluate the performance, we use a compiler for streaming applications to map programs to the different MPSoC configurations. We use this approach for a design-space exploration to find the most efficient architecture and partitioning for an application.

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015

Comparison of Shared and Private L1 Data Memories for an Embedded MPSoC in 28nm FD-SOI

Gregor Sievers; Julian Daberkow; Johannes Ax; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert

Tightly coupling CPUs within clusters allows for low latency, high bandwidth communication in MPSoCs. Our CoreVA-MPSoC integrates multiple clusters using an on-chip network. In this work we introduce a shared L1 data memory that can be accessed by all CPUs of a cluster with low latency. A Mesh-of-Trees (MoT) and a crossbar as topology for the shared memory interconnect are presented. A cluster that integrates the shared L1 memory is compared with an architecture that features an AXI interconnect and a local L1 data memory for each CPU. In addition, we consider an architecture that integrates both. We present implementation results using a 28nm FD-SOI standard cell technology. The shared L1 memory shows similar area results compared to the local memory architecture. Place and route results of a cluster with 8 CPUs, 128kB local-and 128kB shared L1 data memory divided into 16 memory banks show a frequency of 728MHz and an area of 1.77mm2. To map programs to the different CPU cluster configurations a compiler for streaming applications is used. An architecture with both local and shared L1 data memory and 4 memory banks shows best performance results in combination with a high resource efficiency.

rapid simulation and performance evaluation methods and tools | 2016

Performance estimation of streaming applications for hierarchical MPSoCs

Martin Flasskamp; Gregor Sievers; Johannes Ax; Christian Klarhorst; Thorsten Jungeblut; Wayne Kelly; Michael Thies; Mario Porrmann

Parallel programming and effective partitioning of applications for embedded many-core architectures requires optimization algorithms. However, these algorithms have to quickly evaluate thousands of different partitions. We present a fast performance estimator embedded in a parallelizing compiler for streaming applications. The estimator combines a single execution-based simulation and an analytic approach. Experimental results demonstrate that the estimator has a mean error of 2.6% and computes its estimation 2848 times faster compared to a cycle accurate simulator.

network on chip architectures | 2015

System-Level Analysis of Network Interfaces for Hierarchical MPSoCs

Johannes Ax; Gregor Sievers; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann

Network Interfaces (NIs) are used in Multiprocessor System-on-Chips (MPSoCs) to connect CPUs to a packet switched Network-on-Chip. In this work we introduce a new NI architecture for our hierarchical CoreVA-MPSoC. The CoreVA-MPSoC targets streaming applications in embedded systems. The main contribution of this paper is a system-level analysis of different NI configurations, considering both software and hardware costs for NoC communication. Different configurations of the NI are compared using a benchmark suite of 10 streaming applications. The best performing NI configuration shows an average speedup of 20 for a CoreVA-MPSoC with 32 CPUs compared to a single CPU. Furthermore, we present physical implementation results using a 28 nm FD-SOI standard cell technology. A hierarchical MPSoC with 8 CPU clusters and 4 CPUs in each cluster running at 800MHz requires an area of 4.56mm2.

Computing Platforms for Software-Defined Radio | 2017

The CoreVA-MPSoC: A Multiprocessor Platform for Software-Defined Radio

Gregor Sievers; Boris Hübener; Johannes Ax; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann

The advancement in performance of mobile devices goes hand in hand with increasing demand for communication bandwidth. In the past decade an almost unmanageable number of different wireless communication standards has emerged. In addition, the complexity of many of those standards has led to a steadily increasing demand for high performance modem signal processing. Future SDR baseband processing can significantly benefit from the massive parallelism provided by homogeneous many-core architectures. In this chapter, the CoreVA-MPSoC is presented as an example of an embedded hierarchical multiprocessor architecture for SDR processing. Parallelism is introduced at different levels of the CoreVA-MPSoC: basic building block is the resource-efficient VLIW processor CoreVA, providing fine-grained concurrency at the instruction level. Multiple CoreVA CPUs are combined within a CPU cluster and connected via a high speed, low latency interconnect. Finally, a dedicated Network on Chip is used to combine an arbitrary number of CPU clusters on a single chip. In addition to the hardware architecture, an MPSoC compiler for streaming applications is presented and utilized for the mapping of SDR applications to the CoreVA-MPSoC under throughput, latency, energy, and memory constraints.

2017 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) | 2017