Michael L. Chu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael L. Chu is active.

Explore More

Publication

Featured researches published by Michael L. Chu.

international symposium on computer architecture | 2005

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Nathan Clark; Jason A. Blome; Michael L. Chu; Scott A. Mahlke; Stuart David Biles; Krisztian Flautner

Instruction set customization is an effective way to improve processor performance. Critical portions of application data-flow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be effective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system arid evaluates the cost/performance implications of the design.

international symposium on microarchitecture | 2007

Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Michael L. Chu; Rajiv A. Ravindran; Scott A. Mahlke

The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.

programming language design and implementation | 2003

Region-based hierarchical operation partitioning for multicluster processors

Michael L. Chu; Kevin Fan; Scott A. Mahlke

Clustered architectures are a solution to the bottleneck of centralized register files in superscalar and VLIW processors. The main challenge associated with clustered architectures is compiler support to effectively partition operations across the available resources on each cluster. In this work, we present a novel technique for clustering operations based on graph partitioning methods. Our approach incorporates new methods of assigning weights to nodes and edges within the dataflow graph to guide the partitioner. Nodes are assigned weights to reflect their resource usage within a cluster, while a slack distribution method intelligently assigns weights to edges to reflect the cost of inserting moves across clusters. A multilevel graph partitioning algorithm, which globally divides a dataflow graph into multiple parts in a hierarchical manner, uses these weights to efficiently generate estimates for the quality of partitions. We found that our algorithm was able to achieve an average of 20% improvement in DSP kernels and 5% improvement in SPECint2000 for a four-cluster architecture.

application specific systems architectures and processors | 2003

Systematic register bypass customization for application-specific processors

Kevin Fan; Nathan Clark; Michael L. Chu; K. V. Manjunath; Rajiv A. Ravindran; Mikhail Smelyanskiy; Scott A. Mahlke

Register bypass provides additional datapaths to eliminate data hazards in processor pipelines. The difficulty with register bypass is that the cost of the bypass network is substantial and grows substantially as processor width or pipeline depth are increased. For a single application, many of the bypass paths have extremely low utilization. Thus, there is an important opportunity in the design of application-specific processors to remove a large fraction of the bypass cost while maintaining performance comparable to a processor with full bypass. We propose a systematic design customization process along with a bypass-cognizant compiler scheduler. For the former, we employ iterative design space exploration wherein successive processor designs are selected based on bypass utilization statistics combined with the availability of redundant bypass paths. Compiler scheduling for sparse bypass processors is accomplished by prioritizing function unit choices for each operation prior to scheduling using global information. Results show that for a 5-issue customized VLIW processor, 70% of the bypass cost is eliminated while sacrificing only 10% performance.

languages, compilers, and tools for embedded systems | 2007

Compiler-managed partitioned data caches for low power

Rajiv A. Ravindran; Michael L. Chu; Scott A. Mahlke

Set-associative caches are traditionally managed using hardware-based lookup and replacement schemes that have high energy overheads. Ideally, the caching strategy should be tailored to the applications memory needs, thus enabling optimal use of this on-chip storage to maximize performance while minimizing power consumption. However, doing this in hardware alone is difficult due to hardware complexity, high power dissipation, overheads of dynamic discovery of application characteristics, and increased likelihood of making locally optimal decisions. The compiler can instead determine the caching strategy by analyzing the application code and providing hints to the hardware. We propose a hardware/software co-managed partitioned cache architecture in which enhanced load/store instructions are used to control fine-grain data placement within a set of cache partitions. In comparison to traditional partitioning techniques, load and store instructions can individually specify the set of partitions for lookup and replacement. This fine grain control can avoid conflicts, thus providing the performance benefits of highly associative caches, while saving energy by eliminating redundant tag and data array accesses. Using four direct-mapped partitions, we eliminated 25% of the tag checks and recorded an average 15% reduction in the energy-delay product compared to a hardware-managed 4-way set-associative cache.

symposium on code generation and optimization | 2006

Compiler-directed data partitioning for muiticluster processors

Michael L. Chu; Scott A. Mahlke

Multicluster architectures overcome the scaling problem of centralized resources by distributing the datapath, register file, and memory subsystem across multiple clusters connected by a communication network. Traditional compiler partitioning algorithms focus solely on distributing operations across the clusters to maximize instruction-level parallelism. The distribution of data objects is generally ignored. In this work, we examine explicit partitioning of data objects and its affects on operation partitioning. The partitioning of data objects must consider several factors: object size, access frequency/pattern, and dependence patterns between operations that manipulate the objects. This work proposes a compiler-directed approach to synergistically partition both data objects and computation across multiple clusters. First, a global view of the application determines the interaction between data memory objects and their associated computation. Next, data objects are partitioned across multiple clusters with knowledge of the associated computation required by the application. Finally, the resulting distribution of the data objects is relayed to a region-level computation partitioner, which carefully places computation operations in a performance-centric manner.

symposium on code generation and optimization | 2004

FLASH: foresighted latency-aware scheduling heuristic for processors with customized datapaths

Manjunath Kudlur; Kevin Fan; Michael L. Chu; Rajiv A. Ravindran; Nathan Clark; Scott A. Mahlke

Application-specific instruction set processors (ASIPs) have the potential to meet the challenging cost, performance, and power goals of future embedded processors by customizing the hardware to suit an application. A central problem is creating compilers that are capable of dealing with the heterogeneous and nonuniform hardware created by the customization process. The processor datapath provides an effective area to customize, but specialized datapaths often have nonuniform connectivity between the function units, making the effective latency of a function unit dependent on the consuming operation. Traditional instruction schedulers break down in this environment due to their locally greedy nature of binding the best choice for a single operation even though that choice may be poor due to a lack of communication paths. To effectively schedule with nonuniform connectivity, we propose a foresighted latency-aware scheduling heuristic (FLASH) that performs lookahead across future scheduling steps to estimate the effects of a potential binding. FLASH combines a set of lookahead heuristics to achieve effective foresight with low compile-time overhead.

application-specific systems, architectures, and processors | 2004

Automatic synthesis of customized local memories for multicluster application accelerators

Manjunath Kudlur; Kevin Fan; Michael L. Chu; Scott A. Mahlke

Distributed local memories, or scratchpads, have been shown to effectively reduce cost and power consumption of application-specific accelerators while maintaining performance. The design of the local memory organization must take several factors into account, including the memory bandwidth and size requirements of the program and the distribution of program data among the memories. In addition, when register structures and function units in the accelerator are clustered, the effects of intercluster communication should be taken into account. This work proposes a technique to synthesize the local memory architecture of a clustered accelerator using a phase-ordered approach. First, the dataflow graph is pre-partitioned to define a performance-centric grouping of the operations. Second, memory synthesis is performed by combining multiple data structures into a set of physical memories that minimizes cost while maintaining a performance threshold. Finally, post-partitioning is performed to determine the final assignment of operations to clusters given the memory organization. Results show that customization reduces memory cost from 2% to 59% over a naive scheme that utilizes one physical memory per program data structure. Further, pre-partitioning is shown to reduce the intercluster communication required to achieve a fixed performance.

languages, compilers, and tools for embedded systems | 2007

Code and data partitioning for fine-grain parallelism

Michael L. Chu; Scott A. Mahlke

The recent shift to multicore designs for mainstream processors offers the potential to improve the performance of current applications. However, converting this potential into reality is a great challenge. The programmer and/or compiler must parallelize applications to take advantage of multiple cores. Recently, a significant amount of work has focused on areas such as new programming models and ways to exploit data-level parallelism. These methods for coarse-grain parallelization can be extremely powerful in extracting large amounts of parallel work and distributing them across the cores. However, there are still a significant number of singlethreaded applications and programs that simply do not exhibit the inherent parallelism for programmers to widely spread their execution across multiple cores. This paper focuses on an alternative compiler-directed method for program parallelization by exploiting fine-grain instructionlevel parallelism (ILP). Current research in interconnection networks have focused on multiple ways to increase the speed and bandwidth of communication between cores [6, 7]. Faster communication of data values between the cores can then allow for applications to take advantage of parallelization at the operation and data granularity between the cores. While coarse-grain techniques can parallelize large portions of execution, our fine-grain method can use an additional dimension to further increase performance and exploit the multiple underlying cores. The challenge for exploiting fine-grain parallelism is: given an application, identify the operations that should execute on each core. This decision must take into account the communication overhead of transferring register values between the cores as well as the layout of data values in the individual caches of each core. Poor decisions could lead to communication across the interconnection network delaying the execution of other operations, cache conflicts

international symposium on microarchitecture | 2004

Cost-sensitive partitioning in an architecture synthesis system for multicluster processors

Michael L. Chu; Kevin Fan; Rajiv A. Ravindran; Scott A. Mahlke

Application-specific instruction processors (ASIPs) have great potential to meet the challenging demands of pervasive systems. This hierarchical system automatically designs highly customized multicluster processors. In the first of two tightly coupled components, design space exploration heuristically searches the basic capabilities that define the processors overall parallelism. In the second, a hardware compiler determines the detailed architecture configuration that realizes the parallelism.

Explore More