Gregor Sievers | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gregor Sievers is active.

Explore More

Publication

Featured researches published by Gregor Sievers.

networking architecture and storages | 2010

Design Space Exploration for Memory Subsystems of VLIW Architectures

Thorsten Jungeblut; Gregor Sievers; Mario Porrmann; Ulrich Rückert

In this work we present a design space exploration of the memory subsystem of our configurable CoreVA VLIW architecture. The development of resource efficient processor architectures is based on a two-stage tool flow using a high-level processor specification as a reference. We evaluate several memory configurations like one memory port or two memory ports, as well as different write-miss-allocation modes. Applications ranging from LTE protocol stack over baseband processing up to cryptography and multimedia are evaluated in terms of execution time and energy efficiency. Analyses have shown that the application specific configuration of the memory subsystem can improve energy by up to 25%. Our environment allows the rapid profiling and evaluation of algorithms to choose the most efficient configuration.

international symposium on system-on-chip | 2014

A communication model and partitioning algorithm for streaming applications for an embedded MPSoC

Wayne Kelly; Martin Flasskamp; Gregor Sievers; Johannes Ax; Jianing Chen; Christian Klarhorst; Christoph Ragg; Thorsten Jungeblut; Andrew Sorensen

Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulated-annealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%.

embedded and ubiquitous computing | 2014

CoreVA: A Configurable Resource-Efficient VLIW Processor Architecture

Boris Hübener; Gregor Sievers; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert

Mobile signal processing applications have a limited energy budget and require resource-efficient processing elements. General purpose VLIW CPUs offer a high energy efficiency and allow for the execution of a wide range of applications in this domain. In this work we present the configurable 32 bit VLIW processor architecture CoreVA. Besides the number of issue slots, it allows for a fine-grained configuration of the amount and characteristics of the processors functional units (e.g., ALUs, MACs, or LD/ST units). A design-space exploration is performed to evaluate how these functional units impact area and power consumption. The basic configuration with one ALU, MAC, DIV, and LD/ST unit has a power consumption of 11.796 mW and an area of 0.142 mm2 at a clock frequency of 750 MHz in a 28 nm FD-SOI process. The maximum clock frequency in this process node is 833 MHz. To bear a relation of the hardware requirements to possible performance gains of the application, a signal processing algorithm is used as a benchmark to evaluate the energy consumption of different hardware configurations. The lowest energy consumption is observed with a configuration of 4 issue slots using 4 ALUs, 4 MACs, and 2 LD/ST units. This is an improvement by a factor of 1.68 compared to the single issue slot configuration.

international symposium on circuits and systems | 2015

Evaluation of interconnect fabrics for an embedded MPSoC in 28 nm FD-SOI

Gregor Sievers; Johannes Ax; Nils Kucza; Martin Flaßkamp; Thorsten Jungeblut; Wayne Kelly; Mario Porrmann; Ulrich Rückert

Embedded many-core architectures contain dozens to hundreds of CPU cores that are connected via a highly scalable NoC interconnect. Our Multiprocessor-System-on-Chip CoreVA-MPSoC combines the advantages of tightly coupled bus-based communication with the scalability of NoC approaches by adding a CPU cluster as an additional level of hierarchy. In this work, we analyze different cluster interconnect implementations with 8 to 32 CPUs and compare them in terms of resource requirements and performance to hierarchical NoCs approaches. Using 28 nm FD-SOI technology the area requirement for 32 CPUs and AXI crossbar is 5.59 mm2 including 23.61% for the interconnect at a clock frequency of 830 MHz. In comparison, a hierarchical MPSoC with 4 CPU cluster and 8 CPUs in each cluster requires only 4.83 mm2 including 11.61% for the interconnect. To evaluate the performance, we use a compiler for streaming applications to map programs to the different MPSoC configurations. We use this approach for a design-space exploration to find the most efficient architecture and partitioning for an application.

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015

Comparison of Shared and Private L1 Data Memories for an Embedded MPSoC in 28nm FD-SOI

Gregor Sievers; Julian Daberkow; Johannes Ax; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert

Tightly coupling CPUs within clusters allows for low latency, high bandwidth communication in MPSoCs. Our CoreVA-MPSoC integrates multiple clusters using an on-chip network. In this work we introduce a shared L1 data memory that can be accessed by all CPUs of a cluster with low latency. A Mesh-of-Trees (MoT) and a crossbar as topology for the shared memory interconnect are presented. A cluster that integrates the shared L1 memory is compared with an architecture that features an AXI interconnect and a local L1 data memory for each CPU. In addition, we consider an architecture that integrates both. We present implementation results using a 28nm FD-SOI standard cell technology. The shared L1 memory shows similar area results compared to the local memory architecture. Place and route results of a cluster with 8 CPUs, 128kB local-and 128kB shared L1 data memory divided into 16 memory banks show a frequency of 728MHz and an area of 1.77mm2. To map programs to the different CPU cluster configurations a compiler for streaming applications is used. An architecture with both local and shared L1 data memory and 4 memory banks shows best performance results in combination with a high resource efficiency.

ieee sensors | 2013

Pareto-optimal signal processing on low-power microprocessors

Peter Christ; Gregor Sievers; Julian Einhaus; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert

Miniaturised wireless body sensors equipped with low-power microcontrollers are used in various energy-constrained applications. The signal-processing algorithms often require running in real-time on a low computational and memory budget. In this paper we present a framework for the exploration of the design space of resource-efficient signal processing suitable for embedded processors. Using a velocity estimation algorithm for an athlete, we show which configurations of the algorithm perform best in respect to classification accuracy and runtime. Altering the sampling frequency, the feature combination, the classifier (Artificial Neural Network (ANN), Decision Tree (DT)), or the classifiers parametrisation, we obtained 15 Pareto-optimal configurations out of 1008 simulations. The highest classification accuracy of 93.92% was obtained using an ANN, and required 22422 clock cycles per classification. The lowest cycle count of 204 was obtained with a DT configuration which resulted in 84.66 % accuracy.

rapid simulation and performance evaluation methods and tools | 2016

Performance estimation of streaming applications for hierarchical MPSoCs

Martin Flasskamp; Gregor Sievers; Johannes Ax; Christian Klarhorst; Thorsten Jungeblut; Wayne Kelly; Michael Thies; Mario Porrmann

Parallel programming and effective partitioning of applications for embedded many-core architectures requires optimization algorithms. However, these algorithms have to quickly evaluate thousands of different partitions. We present a fast performance estimator embedded in a parallelizing compiler for streaming applications. The estimator combines a single execution-based simulation and an analytic approach. Experimental results demonstrate that the estimator has a mean error of 2.6% and computes its estimation 2848 times faster compared to a cycle accurate simulator.

network on chip architectures | 2015

System-Level Analysis of Network Interfaces for Hierarchical MPSoCs

Johannes Ax; Gregor Sievers; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann

Network Interfaces (NIs) are used in Multiprocessor System-on-Chips (MPSoCs) to connect CPUs to a packet switched Network-on-Chip. In this work we introduce a new NI architecture for our hierarchical CoreVA-MPSoC. The CoreVA-MPSoC targets streaming applications in embedded systems. The main contribution of this paper is a system-level analysis of different NI configurations, considering both software and hardware costs for NoC communication. Different configurations of the NI are compared using a benchmark suite of 10 streaming applications. The best performing NI configuration shows an average speedup of 20 for a CoreVA-MPSoC with 32 CPUs compared to a single CPU. Furthermore, we present physical implementation results using a 28 nm FD-SOI standard cell technology. A hierarchical MPSoC with 8 CPU clusters and 4 CPUs in each cluster running at 800MHz requires an area of 4.56mm2.

norchip | 2013

Design-space exploration of the configurable 32 bit VLIW processor CoreVA for signal processing applications

Gregor Sievers; Peter Christ; Julian Einhaus; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert

In this paper we present the results of a design-space exploration for a classification algorithm with respect to the inherent parallelism of the CoreVA CPU. The CoreVA is a configurable VLIW processor which has been mainly designed for energy-constrained applications. Energy-efficient signal-processing is essential for real-time applications on wireless body sensors (WBSs). Using a velocity-estimation algorithm for a runner as an example, we show which hardware and algorithm configurations perform best in respect to classification accuracy, runtime and energy consumption. We obtained 9 Pareto-optimal configurations out of 504 simulations. The highest classification accuracy of 93.4% requires 34687 clock cycles and has an energy consumption of 1.559 μJ. The lowest energy requirements of 0.015μJ per classification are observed with a Pareto-optimal configuration at 76.3% accuracy. The three-issue VLIW configuration shows the best results with respect to the area-energy trade-off.

Computing Platforms for Software-Defined Radio | 2017

The CoreVA-MPSoC: A Multiprocessor Platform for Software-Defined Radio

Gregor Sievers; Boris Hübener; Johannes Ax; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann

The advancement in performance of mobile devices goes hand in hand with increasing demand for communication bandwidth. In the past decade an almost unmanageable number of different wireless communication standards has emerged. In addition, the complexity of many of those standards has led to a steadily increasing demand for high performance modem signal processing. Future SDR baseband processing can significantly benefit from the massive parallelism provided by homogeneous many-core architectures. In this chapter, the CoreVA-MPSoC is presented as an example of an embedded hierarchical multiprocessor architecture for SDR processing. Parallelism is introduced at different levels of the CoreVA-MPSoC: basic building block is the resource-efficient VLIW processor CoreVA, providing fine-grained concurrency at the instruction level. Multiple CoreVA CPUs are combined within a CPU cluster and connected via a high speed, low latency interconnect. Finally, a dedicated Network on Chip is used to combine an arbitrary number of CPU clusters on a single chip. In addition to the hardware architecture, an MPSoC compiler for streaming applications is presented and utilized for the mapping of SDR applications to the CoreVA-MPSoC under throughput, latency, energy, and memory constraints.

Explore More