Csaba Andras Moritz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Csaba Andras Moritz is active.

Explore More

Publication

Featured researches published by Csaba Andras Moritz.

field programmable custom computing machines | 1999

Parallelizing applications into silicon

Jonathan Babb; Martin C. Rinard; Csaba Andras Moritz; Walter Lee; Matthew I. Frank; Rajeev Barua; Saman P. Amarasinghe

The next decade of computing will be dominated by embedded systems, information appliances and application-specific computers. In order to build these systems, designers will need high-level compilation and CAD tools that generate architectures that effectively meet the needs of each application. In this paper we present a novel compilation system that allows sequential programs, written in C or FORTRAN, to be compiled directly into custom silicon or reconfigurable architectures. This capability is also interesting because trends in computer architecture are moving towards more reconfigurable hardware-like substrates, such as FPGA based systems. Our system works by successfully combining two resource-efficient computing disciplines: Small Memories and Virtual Wires. For a given application, the compiler first analyzes the memory access patterns of pointers and arrays in the program and constructs a partitioned memory system made up of many small memories. The computation is implemented by active computing elements that are spatially distributed within the memory array. A space-time scheduler assigns instructions to the computing elements in a way that maximizes locality and minimizes physical communication distance. It also generates an efficient static schedule for the interconnect. Finally, specialized hardware for the resulting schedule of memory accesses, wires, and computation is generated as a multi-process state machine in synthesizable Verilog. With this system, implemented as a set of SUIF compiler passes, we have successfully compiled programs into hardware and achieve specialization performance enhancements by up to an order of magnitude versus a single general purpose processor. We also achieve additional parallelization speedups similar to those obtainable using a tightly-interconnected multiprocessor.

IEEE Transactions on Parallel and Distributed Systems | 2001

LoGPG: Modeling network contention in message-passing programs

Csaba Andras Moritz; Matthew I. Frank

In many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP and LogGP models to account for the impact of network contention and network interface DMA behavior on the performance of message passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50 percent of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify trade-offs between synchronous and asynchronous message passing styles.

IEEE Transactions on Circuits and Systems | 2007

Fault-Tolerant Nanoscale Processors on Semiconductor Nanowire Grids

Csaba Andras Moritz; Teng Wang; Pritish Narayanan; Michael Leuchtenburg; Yao Guo; Catherine Dezan; Mahmoud Bennaser

Nanoscale processor designs pose new challenges not encountered in the world of conventional CMOS designs and manufacturing. Nanoscale devices based on crossed semiconductor nanowires (NWs) have promising characteristics in addition to providing great density advantage over conventional CMOS devices. This density advantage could, however, be easily lost when assembled into nanoscale systems and especially after techniques dealing with high defect rates and manufacturing related layout/doping constraints are incorporated. Most conventional defect/fault-tolerance techniques are not suitable in nanoscale designs because they are designed for very small defect rates and assume arbitrary layouts for required circuits. Reconfigurable approaches face fundamental challenges including a complex interface between the micro and nano components required for programming. In this paper, we present our work on adding fault-tolerance to all components of a processor implemented on a 2-D semiconductor NW fabric called nanoscale application specific integrated circuits (NASICs). We combine and explore structural redundancy, built-in nanoscale error correcting circuitry, and system-level redundancy techniques and adapt the techniques to the NASIC fabric. Faulty signals caused by defects and other error sources are masked on-the-fly at various levels of granularity. Faults can be masked at up to 15% rates, while maintaining a 7 density advantage compared to an equivalent CMOS processor at projected 18-nm technology. Detailed analysis of yield, density, and area tradeoffs is provided for different error sources and fault distributions.

measurement and modeling of computer systems | 1998

LoGPC: modeling network contention in message-passing programs

Csaba Andras Moritz; Matthew I. Frank

In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP [9] and LogGP [4] models to account for the impact of network contention and network interface DMA behavior on the performance of message-passing programs.We validate LoGPC by analyzing three applications implemented with Active Messages [11, 18] on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50% of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify tradeoffs between synchronous and asynchronous message passing styles.

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems | 2000

FlexCache: A Framework for Flexible Compiler Generated Data Caching

Csaba Andras Moritz; Matthew I. Frank; Saman P. Amarasinghe

This paper describes our works in progress on FlexCache, a framework for flexible, compiler generated data caching. FlexCache substitutes the tag-memory and cache controller hardware with a compiler managed tag-like data structure, address trabslation, and tag-check. This allows the division of the data-array into separately controlled partitions, allows the selection of cache line sizes, the support of highly associative mappings, and the selection of various replacement policies on a per program basis. FlexCache leverages compile-time static information to selectively virtualize memory, to eliminate cache-tag accesses, and to guide the replacement of conflicting cache lines. For the applications studied the FlexCache compiler techniques eliminate in average more than 90% of the cacge-tag lookups, enabling the support of highly associative caching schemes. Even without any hardware support, FlexCache can outperform fixed hardware caches by improving caching effectiveness, eliminating mapping conflicts, and eliminating the cache pollution caused by register spills. The beauty of FlexCache is that its core techniques could be augmented with additional software techniques and/or hardware support (i.e., special instructions) to look into optimizing data caching in areas such as low-power, real-time systems, and high-performance microprocessors.

computing frontiers | 2004

Opportunities and challenges in application-tuned circuits and architectures based on nanodevices

Teng Wang; Zhenghua Qi; Csaba Andras Moritz

Nanoelectronics research has primarily focused on devices. By contrast, not much has been published on innovations at higher layers: we know little about how to construct circuits or architectures out of such devices. In this paper, we focus on the currently most promising nanodevice technologies, such as arrays of semiconductor nanowires (NWs) and arrays of crossed carbon nanotubes (CNTs). In contrast to general-purpose programmable fabrics (such as PLAs), we investigate nano-fabrics that, while also programmable and hierarchical, are more tuned towards an application domain (in this sense they resemble ASIC). Our goal is to achieve denser designs with better fabric utilization, efficient cascading of circuits, and routing of signals. We demonstrate detailed designs of several circuits and processor data-paths, and highlight associated challenges and opportunities for optimization.

ieee computer society annual symposium on vlsi | 2008

CMOS Control Enabled Single-Type FET NASIC

Pritish Narayanan; Michael Leuchtenburg; Teng Wang; Csaba Andras Moritz

A new hybrid CMOS-nanoscale circuit style has been developed that uses only one type of Field Effect Transistor (FET) in the logic portions of a design. This is enabled by CMOS providing control signals that coordinate the operation of the logic implemented in the nanoscale. In this paper, the new circuit style is explored, examples from a microprocessor design are shown, performance, manufacturing and density implications discussed. The system is based on the existing CMOS-nano hybrid fabric architecture NASIC, but the new circuit style reduces the requirements on devices and manufacturing from previous NASIC designs, significantly improves performance without any deterioration in circuit density.

high-performance computer architecture | 2002

The minimax cache: an energy-efficient framework for media processors

Osman S. Unsal; Israel Koren; C. Mani Krishna; Csaba Andras Moritz

This work is based on our philosophy of providing interlayer system-level power awareness in computing systems. Here, we couple this approach with our vision of multi-partitioned memory systems, where memory accesses are separated based on their static predictability and memory footprint and managed with various compiler controlled techniques. We show that media applications are mapped more efficiently when scalar memory accesses are redirected to a mini-cache. Our results indicate that a partitioned 8K cache with the scalars being mapped to a 512 byte mini-cache can be more efficient than a 16K monolithic cache from both performance and energy point of view for most applications. In extensive experiments, we report 30% to 60% energy-delay product savings over a range of system configurations and different cache sizes.

international symposium on microarchitecture | 2001

Cool-cache for hot multimedia

Osman S. Unsal; Raksit Ashok; Israel Koren; C. Mani Krishna; Csaba Andras Moritz

We claim that the unique characteristics of multimedia applications dictate media-sensitive architectural and compiler approaches to reduce the power consumption of the data cache. Our motivation is exploring energy savings for real-time multimedia workloads without sacrificing performance. In this paper, we present two complementary media-sensitive energy-saving techniques that leverage static information. While our first technique is applicable to existing architectures, in our second technique we adopt a more radical approach and propose a new caching architecture by re-evaluating the architecture-compiler interface. Our experiments show that substantial energy savings are possible in the data cache. Across a wide range of cache and architectural configurations we obtain up to 77% energy savings, while the performance varies from 14% improvement to 4% degradation depending on the application.

field-programmable custom computing machines | 1998

Exploring optimal cost-performance designs for Raw microprocessors

Csaba Andras Moritz; Donald Yeung; Anant Agarwal

The semiconductor industry roadmap projects that advance in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules inter-tile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles, in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile, and the balance is the proportion of area in each tile devoted to memory, processing, communication, and I/O. If the total chip area is fixed, more area devoted to processing will result in a higher processing power per node, but will lead to a fewer number of tiles. This paper presents an analytical framework using which designers can reason about the design space of Raw microprocessors. Based on an architectural model and a VLSI cost analysis, the framework computes the performance of applications, and uses an optimization process to identify designs that will execute these applications most cost-effectively.

Explore More