Thorsten Jungeblut
Bielefeld University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thorsten Jungeblut.
IEEE Journal of Solid-state Circuits | 2013
Sven Lütkemeier; Thorsten Jungeblut; Hans Kristian Otnes Berge; Snorre Aunet; Mario Porrmann; Ulrich Rückert
An energy-efficient SoC with 32 b subthreshold RISC processor cores, 32 kB conventional cache memory, and 9T ultra-low voltage (ULV) SRAM based on a flexible and extensible architecture was fabricated on a 2.7 mm2 test chip in 65 nm low power CMOS. The processor cores are based on a custom standard cell library that was designed using a multiobjective approach to optimize noise margins, switching energy, and propagation delay simultaneously. The cores operate over a supply voltage range from 200 mV (best samples) to 1.2 V with clock frequencies from 10 kHz to 94 MHz at room temperature. The lowest energy consumption per cycle of 9.94 pJ is observed at 325 mV and 133 kHz. A 2 kb ULV SRAM macro achieves minimum energy per operation at averages of 321 mV (0.030 σ/μ), 567 fJ (0.037 σ/μ), and 730 kHz (0.184 σ/μ), for equal number of 32 b read/write operations. The off-chip performance and power management subsystem provides dynamic voltage and frequency scaling (DVFS) combined with an adaptive supply voltage generation for dynamic PVT compensation.
international solid-state circuits conference | 2012
Sven Luetkemeier; Thorsten Jungeblut; Mario Porrmann; Ulrich Rueckert
In recent years, subthreshold operation has become a research focus for digital systems with limited energy budget (e.g. mobile, battery-powered devices, radio frequency identification (RFID), wireless sensor networks, or biomedical applications). Subthreshold operation allows for such low power consumption by reducing the supply voltage of the circuit below the threshold voltage of the transistors. As dynamic power depends quadratically on supply voltage, and static power depends exponentially on supply voltage, considerable power savings are achieved. At the same time propagation delays increase due to reduced transistor currents. Effectively, energy consumption per cycle can typically be reduced by a factor of 10 using subthreshold operation.
networking architecture and storages | 2010
Thorsten Jungeblut; Gregor Sievers; Mario Porrmann; Ulrich Rückert
In this work we present a design space exploration of the memory subsystem of our configurable CoreVA VLIW architecture. The development of resource efficient processor architectures is based on a two-stage tool flow using a high-level processor specification as a reference. We evaluate several memory configurations like one memory port or two memory ports, as well as different write-miss-allocation modes. Applications ranging from LTE protocol stack over baseband processing up to cryptography and multimedia are evaluated in terms of execution time and energy efficiency. Analyses have shown that the application specific configuration of the memory subsystem can improve energy by up to 25%. Our environment allows the rapid profiling and evaluation of algorithms to choose the most efficient configuration.
international symposium on system-on-chip | 2014
Wayne Kelly; Martin Flasskamp; Gregor Sievers; Johannes Ax; Jianing Chen; Christian Klarhorst; Christoph Ragg; Thorsten Jungeblut; Andrew Sorensen
Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulated-annealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%.
embedded and ubiquitous computing | 2014
Boris Hübener; Gregor Sievers; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert
Mobile signal processing applications have a limited energy budget and require resource-efficient processing elements. General purpose VLIW CPUs offer a high energy efficiency and allow for the execution of a wide range of applications in this domain. In this work we present the configurable 32 bit VLIW processor architecture CoreVA. Besides the number of issue slots, it allows for a fine-grained configuration of the amount and characteristics of the processors functional units (e.g., ALUs, MACs, or LD/ST units). A design-space exploration is performed to evaluate how these functional units impact area and power consumption. The basic configuration with one ALU, MAC, DIV, and LD/ST unit has a power consumption of 11.796 mW and an area of 0.142 mm2 at a clock frequency of 750 MHz in a 28 nm FD-SOI process. The maximum clock frequency in this process node is 833 MHz. To bear a relation of the hardware requirements to possible performance gains of the application, a signal processing algorithm is used as a benchmark to evaluate the energy consumption of different hardware configurations. The lowest energy consumption is observed with a configuration of 4 issue slots using 4 ALUs, 4 MACs, and 2 LD/ST units. This is an improvement by a factor of 1.68 compared to the single issue slot configuration.
international symposium on circuits and systems | 2012
Thorsten Jungeblut; Johannes Ax; Mario Porrmann; Ulrich Rückert
In this work we propose a TCMS (Tightly Coupled Mesochronous Synchronizer)-based architecture of Globally-Asynchronous Locally-Synchronous (GALS) Network-on-Chips (NoC). The NoC is based on the GigaNoC approach, a scalable NoC featuring packet-switched wormhole routing. At a clock frequency of 750MHz a link bandwidth of up to 6 GByte/s is achieved. To provide a high computational performance, the processing engines (PEs) are based on the CoreVA VLIW architecture. The resource efficiency of mesochronous (TMCS-based) and asynchronous (FIFO-based) communication links is analyzed. In addition an asynchronous coupling of the PE to the switch boxes is evaluated. This allows for multi-voltage/multi-frequency scenarios, where the performance of each PE is adapted to the current performance requirements. Analyses have shown, that TCMS-based communication links and asynchronously coupled PEs allow for the high efficiency of GALS-based NoCs with moderate additional resource requirements.
international symposium on circuits and systems | 2015
Gregor Sievers; Johannes Ax; Nils Kucza; Martin Flaßkamp; Thorsten Jungeblut; Wayne Kelly; Mario Porrmann; Ulrich Rückert
Embedded many-core architectures contain dozens to hundreds of CPU cores that are connected via a highly scalable NoC interconnect. Our Multiprocessor-System-on-Chip CoreVA-MPSoC combines the advantages of tightly coupled bus-based communication with the scalability of NoC approaches by adding a CPU cluster as an additional level of hierarchy. In this work, we analyze different cluster interconnect implementations with 8 to 32 CPUs and compare them in terms of resource requirements and performance to hierarchical NoCs approaches. Using 28 nm FD-SOI technology the area requirement for 32 CPUs and AXI crossbar is 5.59 mm2 including 23.61% for the interconnect at a clock frequency of 830 MHz. In comparison, a hierarchical MPSoC with 4 CPU cluster and 8 CPUs in each cluster requires only 4.83 mm2 including 11.61% for the interconnect. To evaluate the performance, we use a compiler for streaming applications to map programs to the different MPSoC configurations. We use this approach for a design-space exploration to find the most efficient architecture and partitioning for an application.
international embedded systems symposium | 2009
Ralf Dreesen; Thorsten Jungeblut; Michael Thies; Mario Porrmann; Uwe Kastens; Ulrich Rückert
During a typical development process of an embedded application specific processor (ASIP), the architecture is implemented multiple times on different levels of abstractions. As a result of this redundant specification, certain inconsistencies may show up. For example, the implementation of an instruction in the simulator may differ from the HDL implementation. To detect such inconsistencies, we use register trace comparison. Our key contribution is a generic method for systematic trace synchronization. Therefore, we convert a micro-architectural trace into an architectural trace. This method considers pipeline hazards and non-uniform write latencies. To simplify the validation of a processor, we further have implemented an automatic validation environment that includes a tool which points the developer directly to erroneous instructions. The flow has been validated during the development of our CoreVA architecture for mobile applications.
2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015
Gregor Sievers; Julian Daberkow; Johannes Ax; Martin Flasskamp; Wayne Kelly; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert
Tightly coupling CPUs within clusters allows for low latency, high bandwidth communication in MPSoCs. Our CoreVA-MPSoC integrates multiple clusters using an on-chip network. In this work we introduce a shared L1 data memory that can be accessed by all CPUs of a cluster with low latency. A Mesh-of-Trees (MoT) and a crossbar as topology for the shared memory interconnect are presented. A cluster that integrates the shared L1 memory is compared with an architecture that features an AXI interconnect and a local L1 data memory for each CPU. In addition, we consider an architecture that integrates both. We present implementation results using a 28nm FD-SOI standard cell technology. The shared L1 memory shows similar area results compared to the local memory architecture. Place and route results of a cluster with 8 CPUs, 128kB local-and 128kB shared L1 data memory divided into 16 memory banks show a frequency of 728MHz and an area of 1.77mm2. To map programs to the different CPU cluster configurations a compiler for streaming applications is used. An architecture with both local and shared L1 data memory and 4 memory banks shows best performance results in combination with a high resource efficiency.
ieee sensors | 2013
Peter Christ; Gregor Sievers; Julian Einhaus; Thorsten Jungeblut; Mario Porrmann; Ulrich Rückert
Miniaturised wireless body sensors equipped with low-power microcontrollers are used in various energy-constrained applications. The signal-processing algorithms often require running in real-time on a low computational and memory budget. In this paper we present a framework for the exploration of the design space of resource-efficient signal processing suitable for embedded processors. Using a velocity estimation algorithm for an athlete, we show which configurations of the algorithm perform best in respect to classification accuracy and runtime. Altering the sampling frequency, the feature combination, the classifier (Artificial Neural Network (ANN), Decision Tree (DT)), or the classifiers parametrisation, we obtained 15 Pareto-optimal configurations out of 1008 simulations. The highest classification accuracy of 93.92% was obtained using an ANN, and required 22422 clock cycles per classification. The lowest cycle count of 204 was obtained with a DT configuration which resulted in 84.66 % accuracy.