Christopher Batten | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher Batten is active.

Explore More

Publication

Featured researches published by Christopher Batten.

high performance interconnects | 2008

Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics

Christopher Batten; Ajay Joshi; Jason S. Orcutt; Anatoly Khilo; Benjamin Moss; Charles W. Holzwarth; Miloš A. Popović; Hanqing Li; Henry I. Smith; Judy L. Hoyt; Franz X. Kärtner; Rajeev J. Ram; Vladimir Stojanovic; Krste Asanovic

We present a new monolithic silicon photonics technology suited for integration with standard bulk CMOS processes, which reduces costs and improves opto-electrical coupling compared to previous approaches. Our technology supports dense wavelength-division multiplexing with dozens of wavelengths per waveguide. Simulation and experimental results reveal an order of magnitude better energy-efficiency than electrical links in the same technology generation. Exploiting key features of our photonics technology, we have developed a processor-memory network architecture for future manycore systems based on an opto-electrical global crossbar. We illustrate the advantages of the proposed network architecture using analytical models and simulations with synthetic traffic patterns. For a power-constrained system with 256 cores connected to 16 DRAM modules using an opto-electrical crossbar, aggregate network throughput can be improved by ap8-10times compared to an optimized purely electrical network.

international symposium on computer architecture | 2004

The Vector-Thread Architecture

Ronny Krashinsky; Christopher Batten; Mark Hampton; Steve Gerding; Brian Pharris; Jared Casper; Krste Asanovic

The vector-thread (VT) architectural paradigm unifies the vectorand multithreaded compute models. The VT abstraction providesthe programmer with a control processor and a vector of virtualprocessors (VPs). The control processor can use vector-fetch commandsto broadcast instructions to all the VPs or each VP can usethread-fetches to direct its own control flow. A seamless intermixingof the vector and threaded control mechanisms allows a VT architectureto flexibly and compactly encode application parallelismand locality, and a VT machine exploits these to improve performanceand efficiency. We present SCALE, an instantiation of theVT architecture designed for low-power and high-performance embeddedsystems. We evaluate the SCALE prototype design usingdetailed simulation of a broad range of embedded applications andshow that its performance is competitive with larger and more complexprocessors.

international symposium on microarchitecture | 2009

Building Many-Core Processor-to-DRAM Networks with Monolithic CMOS Silicon Photonics

Christopher Batten; Ajay Joshi; Jason S. Orcutt; Anatol Khilo; Benjamin Moss; Charles W. Holzwarth; Miloš A. Popović; Hanqing Li; Henry I. Smith; Judy L. Hoyt; Franz X. Kärtner; Rajeev J. Ram; Vladimir Stojanovic; Krste Asanovic

Silicon photonics is a promising technology for addressing memory bandwidth limitations in future many-core processors. This article first introduces a new monolithic silicon-photonic technology, which uses a standard bulk CMOS process to reduce costs and improve energy efficiency, and then explores the logical and physical implications of leveraging this technology in processor-to-memory networks.

international symposium on computer architecture | 2010

Re-architecting DRAM memory systems with monolithically integrated silicon photonics

Scott Beamer; Chen Sun; Yong-Jin Kwon; Ajay Joshi; Christopher Batten; Vladimir Stojanovic; Krste Asanovic

The performance of future manycore processors will only scale with the number of integrated cores if there is a corresponding increase in memory bandwidth. Projected scaling of electrical DRAM architectures appears unlikely to suffice, being constrained by processor and DRAM pin-bandwidth density and by total DRAM chip power, including off-chip signaling, cross-chip interconnect, and bank access energy. In this work, we redesign the DRAM main memory system using a proposed monolithically integrated silicon photonics technology and show that our photonically interconnected DRAM (PIDRAM) provides a promising solution to all of these issues. Photonics can provide high aggregate pin-bandwidth density through dense wavelength-division multiplexing. Photonic signaling provides energy-efficient communication, which we exploit to not only reduce chip-to-chip interconnect power but to also reduce cross-chip interconnect power by extending the photonic links deep into the actual PIDRAM chips. To complement these large improvements in interconnect bandwidth and power, we decrease the number of bits activated per bank to improve the energy efficiency of the PIDRAM banks themselves. Our most promising design point yields approximately a 10x power reduction for a single-chip PIDRAM channel with similar throughput and area as a projected future electrical-only DRAM. Finally, we propose optical power guiding as a new technique that allows a single PIDRAM chip design to be used efficiently in several multi-chip configurations that provide either increased aggregate capacity or bandwidth.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2012

Designing Chip-Level Nanophotonic Interconnection Networks

Christopher Batten; Ajay Joshi; Vladimir Stojanovic; Krste Asanovic

Technology scaling will soon enable high-performance processors with hundreds of cores integrated onto a single die, but the success of such systems could be limited by the corresponding chip-level interconnection networks. There have been many recent proposals for nanophotonic interconnection networks that attempt to provide improved performance and energy-efficiency compared to electrical networks. This paper discusses the approach we have used when designing such networks, and provides a foundation for designing new networks. We begin by briefly reviewing the basic silicon-photonic device technology before outlining design issues and surveying previous nanophotonic network proposals at the architectural level, the microarchitectural level, and the physical level. In designing our own networks, we use an iterative process that moves between these three levels of design to meet application requirements given our technology constraints. We use our ongoing work on leveraging nanophotonics in an on-chip title-to-tile network, processor-to-main-memory network, and dynamic random-access memory (DRAM) channel to illustrate this design process.

Nucleic Acids Research | 2010

Algorithms for automated DNA assembly

Douglas Densmore; Timothy Hsiau; Joshua T. Kittleson; Will DeLoache; Christopher Batten; J. Christopher Anderson

Generating a defined set of genetic constructs within a large combinatorial space provides a powerful method for engineering novel biological functions. However, the process of assembling more than a few specific DNA sequences can be costly, time consuming and error prone. Even if a correct theoretical construction scheme is developed manually, it is likely to be suboptimal by any number of cost metrics. Modular, robust and formal approaches are needed for exploring these vast design spaces. By automating the design of DNA fabrication schemes using computational algorithms, we can eliminate human error while reducing redundant operations, thus minimizing the time and cost required for conducting biological engineering experiments. Here, we provide algorithms that optimize the simultaneous assembly of a collection of related DNA sequences. We compare our algorithms to an exhaustive search on a small synthetic dataset and our results show that our algorithms can quickly find an optimal solution. Comparison with random search approaches on two real-world datasets show that our algorithms can also quickly find lower-cost solutions for large datasets.

international symposium on microarchitecture | 2004

Cache Refill/Access Decoupling for Vector Machines

Christopher Batten; Ronny Krashinsky; Steve Gerding; Krste Asanovic

Vector processors often use a cache to exploit temporal locality and reduce memory bandwidth demands, but then require expensive logic to track large numbers of outstanding cache misses to sustain peak bandwidth from memory. We present refill/access decoupling, which augments the vector processor with a Vector Refill Unit (VRU) to quickly pre-execute vector memory commands and issue any needed cache line refills ahead of regular execution. The VRU reduces costs by eliminating much of the outstanding miss state required in traditional vector architectures and by using the cache itself as a cost-effective prefetch buffer. We also introduce vector segment accesses, a new class of vector memory instructions that efficiently encode two-dimensional access patterns. Segments reduce address bandwidth demands and enable more efficient refill/access decoupling by increasing the information contained in each vector memory command. Our results show that refill/access decoupling is able to achieve better performance with less resources than more traditional decoupling methods. Even with a small cache and memory latencies as long as 800 cycles, refill/access decoupling can sustain several kilobytes of in-flight data with minimal access management state and no need for expensive reserved element buffering.

international symposium on microarchitecture | 2014

PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research

Derek Lockhart; Gary Zibrat; Christopher Batten

Technology trends prompting architects to consider greater heterogeneity and hardware specialization have exposed an increasing need for vertically integrated research methodologies that can effectively assess performance, area, and energy metrics of future architectures. However, constructing such a methodology with existing tools is a significant challenge due to the unique languages, design patterns, and tools used in functional-level (FL), cycle-level (CL), and register-transfer-level (RTL) modeling. We introduce a new framework called PyMTL that aims to close this computer architecture research methodology gap by providing a unified design environment for FL, CL, and RTL modeling. PyMTL leverages the Python programming language to create a highly productive domain-specific embedded language for concurrent-structural modeling and hardware design. While the use of Python as a modeling and framework implementation language provides considerable benefits in terms of productivity, it comes at the cost of significantly longer simulation times. We address this performance-productivity gap with a hybrid JIT compilation and JIT specialization approach. We introduce Sim JIT, a custom JIT specialization engine that automatically generates optimized C++ for CL and RTL models. To reduce the performance impact of the remaining unspecialized code, we combine Sim JIT with an off-the-shelf Python interpreter with a meta-tracing JIT compiler (PyPy). Sim JIT+PyPy provides speedups of up to 72× for CL models and 200× for RTL models, bringing us within 4-6× of optimized C++ code while providing significant benefits in terms of productivity and usability.

international symposium on microarchitecture | 2014

Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists

Ji Yun Kim; Christopher Batten

Although GPGPUs are traditionally used to accelerate workloads with regular control and memory-access structure, recent work has shown that GPGPUs can also achieve significant speedups on more irregular algorithms. Data-driven implementations of irregular algorithms are algorithmically more efficient than topology-driven implementations, but issues with memory contention and memory-access irregularity can make the former perform worse in certain cases. In this paper, we propose a novel fine-grain hardware work list for GPGPUs that addresses the weaknesses of data-driven implementations. We detail multiple work redistribution schemes of varying complexity that can be employed to improve load balancing. Furthermore, a virtualization mechanism supports seamless work spilling to memory. A convenient shared work list software API is provided to simplify using our proposed mechanisms when implementing irregular algorithms. We evaluate challenging irregular algorithms from the Lonestar GPU benchmark suite on a cycle-level simulator. Our findings show that data-driven implementations running on a GPGPU using the hardware work list outperform highly optimized software-based implementations of these benchmarks running on a baseline GPGPU with speedups ranging from 1.2 - 2.4× and marginal area overhead.

international symposium on computer architecture | 2013

Microarchitectural mechanisms to exploit value structure in SIMT architectures

Ji Kim; Christopher Torng; Shreesha Srinath; Derek Lockhart; Christopher Batten

SIMT architectures improve performance and efficiency by exploiting control and memory-access structure across data-parallel threads. Value structure occurs when multiple threads operate on values that can be compactly encoded, e.g., by using a simple function of the thread index. We characterize the availability of control, memory-access, and value structure in typical kernels and observe ample amounts of value structure that is largely ignored by current SIMT architectures. We propose three microarchitectural mechanisms to exploit value structure based on compact affine execution of arithmetic, branch, and memory instructions. We explore these mechanisms within the context of traditional SIMT microarchitectures (GP-SIMT), found in general-purpose graphics processing units, as well as fine-grain SIMT microarchitectures (FG-SIMT), a SIMT variant appropriate for compute-focused data-parallel accelerators. Cycle-level modeling of a modern GP-SIMT system and a VLSI implementation of an eight-lane FG-SIMT execution engine are used to evaluate a range of application kernels. When compared to a baseline without compact affine execution, our approach can improve GP-SIMT cycle-level performance by 4-17% and can improve FG-SIMT absolute performance by 20-65% and energy efficiency up to 30% for a majority of the kernels.

Explore More