Mark Woh
University of Michigan
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Mark Woh.
international symposium on microarchitecture | 2008
Mark Woh; Yuan Lin; Sangwon Seo; Scott A. Mahlke; Trevor N. Mudge; Chaitali Chakrabarti; Richard Edward Bruce; Danny Kershaw; Alastair Reid; Mladen Wilder; Krisztian Flautner
With the multitude of existing and upcoming wireless standards, it is becoming increasingly difficult for hardware-only baseband processing solutions to adapt to the rapidly changing wireless communication landscape. Software defined radio (SDR) promises to deliver a cost effective and flexible solution by implementing a wide variety of wireless protocols in software. In previous work, a fully programmable multicore architecture, SODA, was proposed that was able to meet the real-time requirements of 3G wireless protocols. SODA consists of one ARM control processor and four wide single instruction multiple data (SIMD) processing elements. Each processing element consists of a scalar and a wide 512-bit 32-lane SIMD datapath. A commercial prototype based on the SODA architecture, Ardbeg (named after a brand of Scotch whisky), has been developed. In this paper, we present the architectural evolution of going from a research design to a commercial prototype, including the goals, tradeoffs, and final design choices. Ardbegpsilas redesign process can be grouped into the following three major areas: optimizing the wide SIMD datapath, providing long instruction word (LIW) support for SIMD operations, and adding application-specific hardware accelerators. Because SODA was originally designed with 180 nm technology, the wide SIMD datapath is re-optimized in Ardbeg for 90 nm technology. This includes re-evaluating the most efficient SIMD width, designing a wider SIMD shuffle network, and implementing faster SIMD arithmetic units. Ardbeg also provides modest LIW support by allowing two SIMD operations to issue in the same cycle. This LIW execution supports SDR algorithmspsila most common parallel SIMD execution patterns with minimal hardware overhead. A viable commercial SDR solution must be competitive with existing ASIC solutions. Therefore, algorithm-specific hardware is added for performance bottleneck algorithms while still maintaining enough flexibility to support multiple wireless protocols. The combination of these architectural improvements allows Ardbeg to achieve 1.5-7x speedup over SODA across multiple wireless algorithms while consuming less power.
architectural support for programming languages and operating systems | 2011
Amir Hormati; Mehrzad Samadi; Mark Woh; Trevor N. Mudge; Scott A. Mahlke
Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a write-once software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponges compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.
international symposium on microarchitecture | 2007
Yuan Lin; Hyunseok Lee; Mark Woh; Yoav Harel; Scott A. Mahlke; Trevor N. Mudge; Chaitali Chakrabarti; Krisztian Flautner
Software-defined radio (SDR) belongs to an emerging class of applications with the processing requirements of a supercomputer but the power constraints of a mobile terminal. The authors developed the signal-processing on-demand architecture (SODA), a fully programmable architecture that supports SDR, by examining two widely differing protocols, W-CDMA and 802.11A. It meets power-performance requirements by separating control and data processing and by employing ultrawide SIMD execution
international conference on embedded computer systems architectures modeling and simulation | 2007
Mark Woh; Sangwon Seo; Hyunseok Lee; Yuan Lin; Scott A. Mahlke; Trevor N. Mudge; Chaitali Chakrabarti; Krisztian Flautner
Wireless communication for mobile terminals has been a high performance computing challenge. It requires almost super computer performance while consuming very little power. This requirement is being made even more challenging with the move to Fourth Generation (4G) wireless communication. It is projected that by 2010, 4G will be available with data rates from 100Mbps to 1Gbps. These data rates are orders of magnitude greater than current 3G technology and, consequently, will require orders of magnitude more computation power. Leading forerunners for this technology are protocols like 802.16e (mobile WiMAX) and 3GPP LTE. This paper presents an analysis of the major algorithms that comprise these 4G technologies and describes their computational characteristics. We identify the major bottlenecks that need to be overcome in order to meet the requirements of this new technology. In particular, we show that technology scaling alone of current Software Defined Radio architectures will not be able to meet these requirements. Finally, we will discuss techniques that may make it possible to meet the power/performance requirements without giving up programmability.
high performance embedded architectures and compilers | 2005
Hyunseok Lee; Yuan Lin; Yoav Harel; Mark Woh; Scott A. Mahlke; Trevor N. Mudge; Krisztian Flautner
Wireless communication is one of the most computationally demanding workloads. It is performed by mobile terminals (“cell phones”) and must be accomplished by a small battery powered system. An important goal of the wireless industry is to develop hardware platforms that can support multiple protocols implemented in software (software defined radio) to support seamless end-user service over a variety of wireless networks. An equally important goal is to provide higher and higher data rates. This paper focuses on a study of the wideband code division multiple access protocol, which is one of the dominant third generation wireless standards. We have chosen it as a representative protocol. We provide a detailed analysis of computation and processing requirements of the core algorithms along with the interactions between the components. The goal of this paper is to describe the computational characteristics of this protocol to the computer architecture community, and to provide a high-level analysis of the architectural implications to illustrate one of the protocols that would need to be accommodated in a programmable platform for software defined radio. The computation demands and power limitations of approximately 60 Gops and 100~300 mW, place extremely challenging goals on such a system. Several of the key features of wideband code division multiple access protocol that can be exploited in the architecture include high degrees of vector and task parallelism, small memory footprints for both data and instructions, limited need for complex arithmetic functions such as multiplication, and a highly variable processing load that provides the opportunity to dynamically scale voltage and frequency.
design automation conference | 2012
Sangwon Seo; Ronald G. Dreslinski; Mark Woh; Yongjun Park; Chaitali Charkrabari; Scott A. Mahlke; David T. Blaauw; Trevor N. Mudge
Near-threshold operation has emerged as a competitive approach for energy-efficient architecture design. In particular, a combination of near-threshold circuit techniques and parallel SIMD computations achieves excellent energy efficiency for easy-to-parallelize applications. However, near-threshold operations suffer from delay variations due to increased process variability. This is exacerbated in wide SIMD architectures where the number of critical paths are multiplied by the SIMD width. This paper provides a systematic in-depth study of delay variations in near-threshold operations and shows that simple techniques such as structural duplication and supply voltage/frequency margining are sufficient to mitigate the timing variation problems in wide SIMD architectures at the cost of marginal area and power overhead.
architectural support for programming languages and operating systems | 2010
Amir Hormati; Yoonseo Choi; Mark Woh; Manjunath Kudlur; Rodric M. Rabbah; Trevor N. Mudge; Scott A. Mahlke
SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts of a program that will benefit from a particular SIMD engine. The focus of this paper is on the problem of SIMDization for the growing application domain of streaming. Streaming applications are an ideal solution for targeting multi-core architectures, such as shared/distributed memory systems, tiled architectures, and single-core systems. Since these architectures, in most cases, provide SIMD acceleration units as well, it is highly beneficial to generate SIMD code from streaming programs. Specifically, we introduce MacroSS, which is capable of performing macro-SIMDization on high-level streaming graphs. Macro-SIMDization uses high-level information such as execution rates of actors and communication patterns between them to transform the graph structure, vectorize actors of a streaming program, and generate intermediate code. We also propose low-overhead architectural modifications that accelerate shuffling of data elements between the scalar and vectorized parts of a streaming program. Our experiments show that MacroSS is capable of generating code that, on average, outperforms scalar code compiled with the current state-of-art auto-vectorizing compilers by 54%. Using the low-overhead data shuffling hardware, performance is improved by an additional 8% with less than 1% area overhead.
IEEE Computer | 2010
Mark Woh; Scott A. Mahlke; Trevor N. Mudge; Chaitali Chakrabarti
AnySP demonstrates that power efficiency can be achieved on a fully programmable processor in the context of a future mobile terminal supporting 4G wireless and high-definition video coding.
international symposium on low power electronics and design | 2010
Sangwon Seo; Ronald G. Dreslinski; Mark Woh; Chaitali Chakrabarti; Scott A. Mahlke; Trevor N. Mudge
Power has become the most critical design constraint for embedded handheld devices. This paper proposes a power-efficient SIMD architecture, referred to as Diet SODA, for DSP applications. The key design idea is to apply near-threshold operation on a single instruction and multiple data (SIMD) architecture to significantly lower the power consumption. The major features of Diet SODA are very wide SIMD width, scatter/gather data prefetcher, and dual mode operation. A case study was performed on digital still camera (DSC) applications; the results show that Diet SODA achieves ∼130x better performance and ∼340x better energy efficiency than a DSP solution.
design, automation, and test in europe | 2011
Mark Woh; Sudhir Satpathy; Ronald G. Dreslinski; Danny Kershaw; Dennis Sylvester; David T. Blaauw; Trevor N. Mudge
Driven by continued scaling of Moores Law, the number of processing elements on a die are increasing dramatically. Recently there has been a surge of wide single instruction multiple data architectures designed to handle computationally intensive applications like 3D graphics, high definition video, image processing, and wireless communication. A limit of the SIMD width of these types of architectures is the scalability of the interconnect network between the processing elements in terms of both area and power. To mitigate this problem, we propose the use of a new interconnect topology, XRAM, which is a low power high performance matrix style crossbar. It re-uses output buses for control programming, and stores multiple swizzle configurations at the cross points using SRAM cells, significantly reducing routing congestion and control signaling. We show that compared to conventionally implemented crossbars, the area scales with the product of inputx output ports while consuming almost 50% less energy. We present an application case study, color-space conversion, utilizing XRAM and show a 1.4× gain in performance while consuming 1.5–2.5× less power.
