Is this you? Create Your Porfile

Charles C. Weems

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Charles C. Weems is active.

Explore More

Publication

Featured researches published by Charles C. Weems.

international symposium on computer architecture | 2003

Guided region prefetching: a cooperative hardware/software approach

Zhenlin Wang; Doug Burger; Kathryn S. McKinley; Steven K. Reinhardt; Charles C. Weems

Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compilers limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.

international conference on parallel architectures and compilation techniques | 2002

Using the compiler to improve cache replacement decisions

Zhenlin Wang; Kathryn S. McKinley; Arnold L. Rosenberg; Charles C. Weems

Memory performance is increasingly determining microprocessor performance and technology trends are exacerbating this problem. Most architectures use set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide these hints to the architecture. We prove that the hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs, using the evict-me algorithm in both the level 1 and 2 caches improves simulated cycle times by up to 34% over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance.

Journal of Parallel and Distributed Computing | 1990

The DARPA image understanding benchmark for parallel computers

Charles C. Weems; Edward M. Riseman; Allen R. Hanson; Azriel Rosenfeld

Abstract This paper describes a new effort to evaluate parallel architectures applied to knowledge-based machine vision. Previous vision benchmarks have considered only execution times for isolated vision-related tasks, or a very simple image processing scenario. However, the performance of an image interpretation system depends upon a wide range of operations on different levels of representations, from processing arrays of pixels, through manipulation of extracted image events, to symbolic processing of stored models. Vision is also characterized by both bottom-up (image-based) and top-down (model-directed) processing. Thus, the costs of interactions between tasks, input and output, and system overhead must be taken into consideration. Therefore, this new benchmark addresses the issue of system performance on an integrated set of tasks. The Integrated Image Understanding Benchmark consists of a model-based object recognition problem, given two sources of sensory input, intensity and range data, and a database of candidate models. The models consist of configurations of rectangular surfaces, floating in space, viewed under orthographic projection, with the presence of both noise and spurious nonmodel surfaces. A partially ordered sequence of operations that solves the problem is specified along with a recommended algorithmic method for each step. In addition to reporting the total time and the final solution, timings are requested for each component operation, and intermediate results are output as a check on accuracy. Other factors such as programming time, language, code size, and machine configurations are reported. As a result, the benchmark can be used to gain insight into processor strengths and weaknesses and may thus help to guide the development of the next generation of parallel vision architectures. In addition to discussing the development and specification of the new benchmark, this paper presents the results from running the benchmark on the Connection Machine, Warp, Image Understanding Architecture, Associative String Processor, Alliant FX-80, and Sequent Symmetry. The results are discussed and compared through a measurement of relative effort, which factors out the effects of differing technologies.

IEEE Computer | 1994

Associative processing and processors

Anargyros Krikelis; Charles C. Weems

Associative memory concerns the concept that one idea may trigger the recall of a different but related idea. Traditional computers, however, rely upon a memory design that stores and retrieves data by its address rather than its content. In such a search, every accessed data word must travel individually between the processing unit and the memory. The simplicity of this retrieval-by-address approach has ensured its success, but has also produced some inherent disadvantages. One is the von Neumann bottleneck, where the memory-access path becomes the limiting factor for system performance. A related disadvantage is the inability to proportionally increase the size of a unit transfer between the memory and the processor as the size of the memory scales up. Associative memory, in contrast, provides a naturally parallel and scalable form of data retrieval for both structured data (e.g. sets, arrays, tables, trees and graphs) and unstructured data (raw text and digitized signals). An associative memory can be easily extended to process the retrieved data in place, thus becoming an associative processor. This extension is merely the capability for writing a value in parallel into selected cells.<<ETX>>

IEICE Electronics Express | 2009

Sub-grouped superblock management for high-performance flash storages

Jung-Wook Park; Gi-Ho Park; Charles C. Weems; Shin-Dug Kim

In this paper we describe a new superblock management scheme to overcome the problem of increased erase operations, that results from increasing the degree of interleaving of memory banks in flash memory based storage devices. To improve performance, superblock management is used to increase the degree of linear interleaving of flash memory banks. However, increased interleaving may significantly increase the number of erase operations, thus decreasing device lifetime. The proposed management scheme efficiently separates hot and cold data into two different sub-groups, dramatically increasing the efficiency of superblock merging. According to our simulation results, the number of erase operations decreases by around 27.3 percent, which is enough to significantly lengthen overall device lifetime. Read performance is only slightly degraded by our approach.

technical symposium on computer science education | 2011

NSF/IEEE-TCPP curriculum initiative on parallel and distributed computing: core topics for undergraduates

Sushil K. Prasad; Almadena Yu. Chtchelkanova; Sajal K. Das; Frank K. H. A. Dehne; Mohamed G. Gouda; Anshul Gupta; Joseph JáJá; Krishna Kant; Richard LeBlanc; Manish Lumsdaine; David A. Padua; Manish Parashar; Viktor K. Prasanna; Yves Robert; Arnold L. Rosenberg; Sartaj Sahni; Behrooz A. Shirazi; Alan Sussman; Charles C. Weems; Jie Wu

Many personal computers and workstations have two or four cores (that is, CPUs) that enable multiple threads to be executed simultaneously. Computers in the near future are expected to have significantly more cores. To take advantage of the hardware of today and tomorrow, you can parallelize your code to distribute work across multiple processors. In the past, parallelization required low-level manipulation of threads and locks. Visual Studio 2010 and the .NET Framework 4 enhance support for parallel programming by providing a new runtime, new class library types, and new diagnostic tools. These features simplify parallel development so that you can write efficient, fine-grained, and scalable parallel code in a natural idiom without having to work directly with threads or the thread pool. The following illustration provides a high-level overview of the parallel programming architecture in the .NET Framework 4.

IEEE Computer | 1992

Image understanding architecture: exploiting potential parallelism in machine vision

Charles C. Weems; Edward M. Riseman; Allen R. Hanson

A hardware architecture that addresses at least part of the potential parallelism in each of the three levels of vision abstraction, low (sensory), intermediate (symbolic), and high (knowledge-based), is described. The machine, called the image understanding architecture (IUA), consists of three different, tightly coupled parallel processors; the content addressable array parallel processor (CAAPP) at the low level, the intermediate communication associative processor (ICAP) at the intermediate level, and the symbolic processing array (SPA) at the high level. The CAAPP and ICAP levels are controlled by an array control unit (ACU) that takes its directions from the SPA level. The SPA is a multiple-instruction multiple-data (MIMD) parallel processor, while the intermediate and low levels operat in multiple modes. The CAAPP operates in single-instruction multiple-data (SIMD) associative or multiassociative mode, and the ICAP operates in single-program multiple-data (SPMD) or MIMD mode.<<ETX>>

IEEE Computer Architecture Letters | 2002

A Low Power TLB Structure for Embedded Systems

Jin-Hyuck Choi; Jung-Hoon Lee; Seh-Woong Jeong; Shin-Dug Kim; Charles C. Weems

We present a new two-level TLB (translationlook-aside buffer) architecture that integrates a 2-waybanked filter TLB with a 2-way banked main TLB. Theobjective is to reduce power consumption in embeddedprocessors by distributing the accesses to TLB entriesacross the banks in a balanced manner. First, an advancedfiltering technique is devised to reduce access power byadopting a sub-bank structure. Second, a bank-associativestructure is applied to each level of the TLB hierarchy.Simulation results show that the Energy*Delay productcan be reduced by about 40.9% compared to a fullyassociativeTLB, 24.9% compared to a micro-TLB with4+32 entries, and 12.18% compared to a micro-TLB with16+32 entries.

international conference on computer design | 1993

The spring scheduling co-processor: a scheduling accelerator

Wayne Burleson; Jason Ko; Douglas Niehaus; Krithi Ramamritham; John A. Stankovic; Gary Wallace; Charles C. Weems

We present a novel co-processor for multiprocessor scheduling in the Spring real-time operating system. Since most dynamic scheduling problems are NP-complete, we use a heuristic algorithm which uses a smart searching scheme to find a feasible schedule for a set of specified tasks and hard deadlines. A parallel VLSI architecture for scheduling is developed that can be scaled for different numbers of tasks, numbers of resources, internal wordlengths, and future IC technologies. The scheduling architecture is implemented in a 0.8/spl mu/ CMOS technology and uses an advanced clocking scheme to allow further scaling to future technologies. With an internal clock rate of 100 MHz, a speed increase of two orders of magnitude is expected for scheduling tasks, thus removing a major bottleneck in real-time systems.<<ETX>>

Parallel Processing Letters | 2011

HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN'S ALGORITHM WITH MULTIPLE FFT SIZES

Niall Emmart; Charles C. Weems

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsubas algorithm to guide multiplication of different operand sizes built on top of Strassens algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassens algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Explore More