Augustus K. Uht | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Augustus K. Uht is active.

Explore More

Publication

Featured researches published by Augustus K. Uht.

IEEE Transactions on Computers | 2005

Uniprocessor performance enhancement through adaptive clock frequency control

Augustus K. Uht

Uniprocessor designs have always assumed worst-case operating conditions to set the operating clock frequency and, hence, performance. However, much more performance can be obtained under typical operating conditions through experimentation, but such increased frequency operation is subject to the possibility of system failure and, hence, data loss/corruption. Further, mobile CPUs such as those in cell phones/Internet browsers do not adapt to their current surroundings (varying temperature conditions, etc.) so as to increase or decrease operating frequency to maximize performance and/or allow operation under extreme conditions. We present a digital hardware design technique realizing adaptive clock-frequency performance-enhancing digital hardware; the technique can be tuned to approximate performance maximization. The cost is low and the design is straightforward. Experiments are presented evaluating such a design in a pipelined uniprocessor realized in a field programmable gate array (FPGA).

IEEE Computer | 2004

Going beyond worst-case specs with TEAtime

Augustus K. Uht

Virtually all engineers use worst-case component specifications for new system designs, thereby ensuring that the resulting product will operate under worst-case conditions. However, given that most systems operate under typical operating conditions that rarely approach the demands of worst-case conditions, building such robust systems incurs a significant performance cost. Further, classic worst-case designs do not adapt to variations in either manufacturing or operating conditions. A timing-error-avoidance prototype provides a circuit and system solution to these problems for synchronous digital systems. TEAtime has demonstrated much better performance than classically designed systems and also adapts well to varying temperature and supply-voltage conditions.

IEEE Computer | 1997

Branch effect reduction techniques

Augustus K. Uht; Vijay Sindagi; Sajee Somanathan

Branch effects are the biggest obstacle to gaining significant speedups when running general purpose code on instruction level parallel machines. The article presents a survey which compares current branch effect reduction techniques, offering hope for greater gains. We believe this survey is timely because research is bearing much fruit: speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years. The hardware required for large scale exploitation is great, but the density of transistors per chip is increasing exponentially, with estimates of 50 to 100 million transistors per chip by the year 2000.

IEEE Transactions on Computers | 1991

A theory of reduced and minimal procedural dependencies

Augustus K. Uht

A reduced set of procedural dependencies is presented which is necessary and sufficient to describe all procedural dependencies in standard imperative codes. Hence, the set is minimal. In conjunction with reduced data dependencies, this set forms a set of minimal semantic dependencies for all traditional code. It is also shown that all forward branches in structured code are procedurally independent. The effects of limited hardware are also addressed. A possible implementation of a machine enforcing just the minimal procedural dependencies is described. >

IEEE Transactions on Parallel and Distributed Systems | 1992

Requirements for optimal execution of loops with tests

Augustus K. Uht

Both the efficient execution of branch intensive code and knowing the bounds on the same are important issues in computing in general and supercomputing in particular. In prior work, it has been suggested that the hardware needed to execute code with branches optimally is exponentially dependent on the total number of dynamic branches executed, this number of branches being proportional at least to the number of iterations of the loop. For classes of code taking at least one cycle per iteration to execute, this is not the case. For loops containing one test (normally in the form of a Boolean recurrence of order one), it is shown that the hardware necessary varies from exponential to polynomial in the length of the dependence cycle L, while execution time varies from one time cycle per iteration to less than L time cycles per iteration; the variation depends on specific code dependences. These results bring the eager evaluation of imperative code closer to fruition. >

IEEE Transactions on Computers | 1992

Concurrency extraction via hardware methods executing the static instruction stream

Augustus K. Uht

Hardware solutions to low-level (semantic) concurrency extraction are presented, focusing on the reduction of both control-flow and dataflow inhibitors of concurrency in general-purpose and scientific instruction streams. In the first model, CONDEL-1, an input code control flow model based on the codes branch domains is used in the algorithm to detect the reduced procedural dependencies in the input code. This model allows branches to execute concurrently. The cost and delay of the models concurrency hardware are demonstrated to be relatively low, especially for the detection of concurrency beyond branches. The reduced procedural dependence techniques of CONDEL-1 are combined with high-speed reduced data dependency techniques to yield a machine model, CONDEL-2, executing standard sequential code in a manner beyond data-flow. Simulation results are presented and analyzed, showing the models functionality and performance improvement. The beneficial effects of limited software optimizations are also reviewed. >

great lakes symposium on vlsi | 2009

Central vs. distributed dynamic thermal management for multi-core processors: which one is better?

Michael Kadin; Sherief Reda; Augustus K. Uht

In this paper we investigate and contrast two techniques to maximize the performance of multi-core processors under thermal constraints. The first technique is a distributed dynamic thermal management system that maximizes the total performance without exceeding given thermal constraints. In our scheme, each core adjusts its operating parameters, i.e., frequency and voltage, according to its temperature which is measured using integrated thermal sensors. We propose a novel controller that dynamically adapts the system to simultaneously avoid timing errors and thermal violations. For comparison purposes, we implement a second technique based on a runtime centralized, optimal system that uses combinatorial optimization techniques to calculate the optimal frequencies and voltages for the different cores to maximize the total throughput under thermal constraints. To empirically validate our techniques, we put together an extensive tool chain that incorporates thermal and power consumption simulators to characterize the performance of multi-core processors for a number of configurations ranging from 2 cores at 90 nm to 16 cores at 32 nm. Our results show that both investigated techniques are capable of delivering significant improvements (about 40% for 16 cores) over standard frequency and voltage planning techniques. From the results, we outline the main advantages and disadvantages of both techniques.

ACM Sigarch Computer Architecture News | 1993

Extraction of massive instruction level parallelism

Augustus K. Uht

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the order is as indicated by the contents of the Program Counter). The use of the static instruction stream is proposed instead. The static instruction stream oreder is the same as the order of the code in memory, and is independent of the execution of branches. It allows reduced branch dependencies, as well. It is argued that a new version, Levo, of an old machine model, CONDEL-2, will be able to attain massive Instruction Level Parallelism.

european conference on parallel processing | 2002

Realizing High IPC Using Time-Tagged Resource-Flow Computing

Augustus K. Uht; Alireza Khalafi; David Morano; Marcos de Alba; David R. Kaeli

In this paper we present a novel approach to exploiting ILP through the use of resource-flow computing. This model begins by executing instructions independent of data flow and control flow dependencies in a program. The rest of the execution time is spent applying programmatic data flow and control flow constraints to end up with a programmatically-correct execution. We present the design of a machine that uses time tags and Active Stations, realizing a registerless data path.In this contribution we focus our discussion on the Execution Window elements of our machine, present Instruction Per Cycle (IPC) speedups for SPECint95 and SPECint2000 programs, and discuss the scalability of our design to hundreds of processing elements.

international symposium on microarchitecture | 1987

On the combination of hardware and software concurrency extraction methods

Augustus K. Uht; Constantine D. Polychronopoulos; John F. Kolen

It has been shown that parallelism is a very promising alternative for enhancing computer performance. Parallelism, however, introduces much complexity to the programming effort. This has lead to the development of automatic concurrency extraction techniques. Prior work has demonstrated that static program restructuring via compiler based techniques provides a large degree of parallelism to the target machine. Purely hardware based extraction techniques (without software preprocessing) have also demonstrated significant (but lesser) degrees of parallelism. This paper considers the performance effects of the combination of both hardware and software techniques. The concurrency extracted from a given set of benchmarks by each technique separately, and together, is determined via simulations and/or analysis. The “common parallelism” extracted by the two methods is thus also considered, using new metrics. The analytic techniques for predicting the performance of specific programs are also described.

Explore More