Matthias Gries | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthias Gries is active.

Explore More

Publication

Featured researches published by Matthias Gries.

international solid-state circuits conference | 2010

A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS

Jason Howard; Saurabh Dighe; Yatin Hoskote; Sriram R. Vangal; David Finan; Gregory Ruhl; David Jenkins; Howard Wilson; Nitin Borkar; Gerhard Schrom; Fabric Pailet; Shailendra Jain; Tiju Jacob; Satish Yada; Sraven Marella; Praveen Salihundam; Vasantha Erraguntla; Michael Konow; Michael Riepen; Guido Droege; Joerg Lindemann; Matthias Gries; Thomas Apel; Kersten Henriss; Tor Lund-Larsen; Sebastian Steibl; Shekhar Borkar; Vivek De; Rob F. Van der Wijngaart; Timothy G. Mattson

Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm2 and is implemented in 45nm high-к metal-gate CMOS [2].

IEEE Journal of Solid-state Circuits | 2011

A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling

Jason Howard; Saurabh Dighe; Sriram R. Vangal; Gregory Ruhl; Nitin Borkar; Shailendra Jain; Vasantha Erraguntla; Michael Konow; Michael Riepen; Matthias Gries; Guido Droege; Tor Lund-Larsen; Sebastian Steibl; S. Borkar; Vivek De; R Van Der Wijngaart

This paper describes a multi-core processor that integrates 48 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 64 2D-mesh network-on-chip architecture. Located at each mesh node is a five-port virtual cut-through packet-switched router shared between two IA-32 cores. Core-to-core communication uses message passing while exploiting 384 KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1 V supply, the cores operate at 1 GHz while the 2D-mesh operates at 2 GHz. As performance and voltage scales, the processor dissipates between 25 W and 125 W. The processor is implemented in 45 nm Hi-K CMOS and has 1.3 billion transistors.

Network Processor Design | 2003

Chapter 4 – Design Space Exploration of Network Processor Architectures

Lothar Thiele; Samarjit Chakraborty; Matthias Gries; Simon Künzli

It is noted that network processors (NPs) generally consist of multiple processing units such as CPU cores, microengines, and dedicated hardware for computing-intensive tasks, memory units, caches, interconnections, and I/O interfaces. Following a system-on-a-chip (SoC) design method, these resources are then put on a single chip and they must interoperate in order to perform packet processing tasks at line speed. The process of determining the optimal hardware and software architecture for such processors includes issues involving resource allocation and partitioning. The chapter presents a framework for the design space exploration of embedded systems. It is observed that the architecture exploration and evaluation of network processors involve many tradeoffs and a complex interplay between hardware and software. The chapter focuses on high level of abstraction, where the goal is to quickly identify interesting architectures that can be further evaluated by taking lower-level details into account. Task models, task scheduling, operating system issues, and packet processor architectures collectively play a role in different phases of the design space exploration of packet processor devices.

design automation conference | 2002

A framework for evaluating design tradeoffs in packet processing architectures

Lothar Thiele; Samarjit Chakraborty; Matthias Gries; Sirnon Kunzli

We present an analytical method to evaluate embedded network packet processor architectures, and to explore their design space. Our approach is in contrast to those based on simulation, which tend to be infeasible when the design space is very large. We illustrate the feasibility of our method using a detailed case study.

IEEE Transactions on Very Large Scale Integration Systems | 2001

FunState-an internal design representation for codesign

Karsten Strehl; Lothar Thiele; Matthias Gries; Dirk Ziegenbein; Rolf Ernst; Jürgen Teich

In this paper, an internal design model called FunState (functions driven by state machines) is presented that enables the representation of different types of system components and scheduling mechanisms using a mixture of functional programming and state machines. It is shown how properties relevant for scheduling and verification of specification models like boolean dataflow, cyclostatic dataflow, synchronous dataflow, marked graphs, and communicating state machines as well as Petri nets may be represented in the FunState model. Examples of methods suited for FunState are described, such as scheduling and verification. They are based on the representation of the models state transitions in form of a periodic graph.

embedded software | 2001

Embedded Software in Network Processors - Models and Algorithms

Lothar Thiele; Samarjit Chakraborty; Matthias Gries; Alexander Maxiaguine; Jonas Greutert

We introduce a task model for embedded systems operating on packet streams, such as network processors. This model along with a calculus meant for reasoning about packet streams allows a unified treatment of several problems arising in the network packet processing domain such as packet scheduling, task scheduling and architecture/algorithm explorations in the design of network processors. The model can take into account quality of service constraints such as data throughput and deadlines associated with packets. To illustrate its potential, we provide two applications: (a)a new task scheduling algorithm for network processors to support a mix of real-time and non-real-time flows, (b)a scheme for design space exploration of network processors.

great lakes symposium on vlsi | 2010

A virtual platform environment for exploring power, thermal and reliability management control strategies in high-performance multicores

Andrea Bartolini; Matteo Cacciari; Andrea Tilli; Luca Benini; Matthias Gries

The use of high-end multicore processors today can incur high power density with significant variability in spatial and temporal usage of resources by workloads. This situation leads to power and temperature hotspots, which in turn may lead to non-uniform ageing and accelerated chip failure. These drawbacks can be mitigated by online tuning of system performance and adopting closed-loop thermal and reliability management policies. The development and evaluation of these policies cannot be performed solely on real hardware - due to observability and flexibility limitations or just by relying on trace-driven simulation, due to dependencies present among power, thermal effects, reliability and performance. We present a complete and virtual platform to develop, simulate and evaluate power, temperature and reliability management control strategies for high-performance multicores. The accuracy and effectiveness of our solution are ensured by integrating a established system simulator (Simics) with models for power consumption, temperature distribution and aging. The models are based on characterization on real hardware. Control strategies exploration and design are carried out in the MATLAB/Simulink framework allowing the use of control theory tools. Fast prototyping is achieved by developing a suitable interface between Simics and MATLAB/Simulink, enabling co-simulation of hardware platforms and controllers.

international conference on parallel architectures and compilation techniques | 2011

Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer

Nikolas Ioannou; Michael Kauschke; Matthias Gries; Marcelo Cintra

To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than todays multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consumption within a performance window taking into consideration not only the local information for cores within frequency domains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and prediction approach for Message Passing Interface (MPI) applications, a natural choice on the SCC with its fast on-chip network and its non-coherent memory hierarchy. This phase predictor operates as the front-end to the hierarchical DVFS controller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our approach provides significant improvement of the Energy Delay Product (EDP) of as much as 27.2%, and 11.4% on average, with an average increase in execution time of 7.7% over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more effective DVFS control of the domains, compared to existing approaches.

Computing in Science and Engineering | 2011

SCC: A Flexible Architecture for Many-Core Platform Research

Matthias Gries; Ulrich Hoffmann; Michael Konow; Michael Riepen

The Single-chip Cloud Computer (SCC) experimental processor by Intel Labs is a concept vehicle aimed at scaling future multicore processors and serving as a software research platform.

international conference on computer design | 2010

LMS-based low-complexity game workload prediction for DVFS

Benedikt Dietrich; Swaroop Nunna; Dip Goswami; Samarjit Chakraborty; Matthias Gries

While dynamic voltage and frequency scaling (DVFS) based power management has been widely studied for video processing, there is very little work on game power management. Recent work on proportional-integral-derivative (PID) controllers fro predicting game workload used hand-turned PID controller gains on relatively short game plays. This left open questions on the robustness of the PID controller and how sensitive the prediction quality is on the choice of the gain values, especially for long game plays involving different scenarios and scene changes. In this paper we propose a Least Mean Squares (LMS) Linear Predictor, which is a regression model commonly used for system parameter identification. Our results show that game workload variation can be estimated using a linear-in-parameters (LIP) model. This observation dramatically reduces the complexity of parameter estimation as the LMS Linear Predictor learns the relevant parameters of the model iteratively as the game progresses. The only parameter to be tuned by the system designer is the learning rate, which is relatively straightforward. Our experimental results using the LMS Linear Predictor show comparable power savings and game quality with those obtained from a highly-tuned PID controller.

Explore More