CHIPKIT: An agile, reusable open-source framework for rapid test chip development
Paul Whatmough, Marco Donato, Glenn Ko, Sae-Kyu Lee, David Brooks, Gu-Yeon Wei
11 CHIPKIT: An agile, reusable open-sourceframework for rapid test chip development
Paul N. Whatmough , Marco Donato , Glenn G. Ko , Sae Kyu Lee , David Brooks , and Gu-Yeon Wei Harvard University Arm Research IBM Research
Abstract —The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips todemonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to thewhole system stack. However, success with tape-outs requires hard-earned experience, and the design process is time consuming andfraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focusedon circuit design research. This paper describes the CHIPKIT framework: a reusable SoC subsystem which provides basic IO, anon-chip programmable host, off-chip hosting, memory and peripherals. This subsystem can be readily extended with new IP blocks togenerate custom test chips. Central to CHIPKIT, is an agile RTL development flow, including a code generation tool called VGEN.Finally, we discuss best practices for full-chip validation across the entire design cycle.
Index Terms —Agile design, design reuse, testing, open-source (cid:70)
NTRODUCTION R ESEARCH test chips are the ultimate demonstrationof the true value of novel computer architecture andcircuits innovation. In addition, taping out test chips ina research or academic setting provides huge pedagogicalvalue, offering real insight across the whole stack. Nonethe-less, taping-out test chips remains very challenging, espe-cially for the uninitiated. Custom chips are time consumingto design, fabricate and test, and are error prone – often re-quiring expensive re-spins to fix problems. In this paper, weexplore two key themes of agile and reusable design, to helpreduce the barrier to entry for chip tape-outs. Emphasizing reuse greatly reduces development cost and at the sametime minimizes the opportunity for silicon bugs, freeingthe designer to focus on differentiating features. While agile design seeks to follow a methodology where changes canbe readily implemented late into the design cycle, withoutsignificant disruption or risk.A number of exciting and ambitious open source hard-ware projects currently provide an exciting range of IPblocks to use in test chip projects. In addition to this, manyIP companies are offering broad access to their productsfor non-commercial use. However, turning a few IP blocksinto a functioning chip that can be easily measured anddebugged is still very challenging due to a significant re-maining experience gap in terms of both methodology andIP blocks.This paper describes CHIPKIT , a straightforwardframework for the rapid development of successful researchtest chips. We describe a front-to-back design example,drawing on multiple generations of successful test chipsdesigned at Harvard (Fig. 1), which follow a consistent
1. Available online: https://github.com/whatmough/CHIPKIT
W-MEMBANK1256KBW-MEMBANK0256KB W-MEMBANK3256KBW-MEMBANK2256KB
DNN ENGINE S RA M KB ARM M02.5 mm . mm TSMC 28HPC TSMC 16FF+ TSMC 16FFC mm A O N Dual-A532MB L2$ e F P G A x A rr a y Fig. 1. Three recent chips [1], [2], [3] built using the CHPIKIT framework. design approach. These span a range of complexities, fromsmaller single-accelerator micro-controller based SoCs [1],[2], through to large multi-accelerator SoCs with ArmCortex-A multi-core CPU clusters [3], [4]. However, theyall share the same basic framework, with the same SoCsubsystem for system bring up, communication and control.Following this framework has allowed new tape-outs to bedeveloped with very low-risk and high success rate. To helpresearchers bootstrap their own chip designs, the contents ofthis paper is supported with the release of our open sourceCHIPKIT project.This paper provides the following contributions: • Reusable SoC Subsystem , a simple on-chip CPU host,a flexible interconnect, memory, basic peripherals, androbust off-chip communication and hosting (Section 2). • Agile RTL Development Methodology , suitable for in-experienced designers, with robust RTL coding guide-lines and a code generation tool,
VGEN (Section 3). • Full-Chip Validation Methodology , covering the en-tire design flow, which is critical to ensure functionalcorrectness (Section 4). a r X i v : . [ c s . A R ] M a y Arm Cortex-M0UART Host A H B Bus Master MuxIMEM 64KB A P B GPIODMEM64KBBridgeWatchdog TimerReal-Time Clock(RTC)UARTsUSB-UARTRTC OSCPCB ResetRESETnHCLKDIAGUSB-UARTGPIO MSSSSMSSS D e bu g S i g n a l s C u s t o m A cc e l e r a t o r I P C u s t o m M e m o r y S y s t e m Reusable SoC Subsystem
Fig. 2. Reusable SoC subsystem, extended with a custom IP block.
EUSABLE S O C S
UBSYSTEM
The design goal for the reusable SoC subsystem (Fig. 2) is toprovide the minimum components to robustly handle essen-tial bring up and test of custom IP projects. The followingsubsections briefly introduce the key components.
The traditional digital chip testing approach of using exter-nal pattern generators and logic analyzers to drive and readchip pins is slow and error prone. Instead, our subsystemincludes two bus masters that can be used to run tests:a CPU and a UART master. This configuration allows thechip to be hosted either autonomously by the on-chip CPU,or from an external PC over USB-UART. We use a 32-bit Arm Cortex-M0 microcontroller, which is an extremelyarea efficient and easy to use design with broad softwarecompatibility. We have developed and extensively used a simple androbust on-chip UART bus master peripheral (included inCHIPKIT) to allow an external PC host to drive transac-tions onto the on-chip interconnect. This is a very usefulcapability for running tests, loading binaries, moving dataetc. However, it also allows whole tests to be developedand run from an external PC in any language (e.g. Python),which is a very convenient approach to chip testing. Theperipheral provides a simple interactive text interface inany standard terminal emulator, with no CPU overhead.Single read and write transactions on-chip are initiatedusing simple commands, such as
R 0x70000000 to reada 32-bit word at the specified hex address. More extensivetests are easily scripted in any language using a standardserial port library. UART-to-USB transceivers on the PCBenable a PC to easily connect to the test chip over a USBcable.
An on-chip interconnect allows components on the SoC tocommunicate. We adhere to well-documented, open indus-try standard bus protocols, which allows access to a broad IPecosystem, including verification components such as pro-tocol checkers. In particular, we have extensively used threeprotocols from the AMBA standards , selected based on re-quired features and performance: APB for low-performanceperipherals, AHB where-ever possible for general-purpose,and AXI for high-performance and more features.We typically need multiple buses, and often partitionthem (even on simple chips) based on usage and traffic typesand volumes, which helps with throughput, as well as de-sign and verification. A silicon bug in an interconnect couldeasily hang the whole chip and therefore the interconnectIP must be robust and carefully verified. It should also beflexible and easy to modify as the tape-out project evolves.Open-source solutions for interfacing the components in acomplex heterogeneous systems have been proposed [5].However, these solutions may go beyond the requirementsof smaller SoCs used for prototyping novel DSA hardwareblocks. To better serve this purpose, CHIPKIT includes asimple single-layer AHB interconnect, which is very easyto setup and maintain. It makes use of the SystemVerilog(SV) interface feature to drastically simplify RTL. An SVinterface is used to bundle the signals in each port, whichcan be either of type master or slave. These bundles canthen be connected using a single modport instance. Theaddress decoder in the interconnect is defined by a singleSV header file which describes the entire memory map forthe interconnect segment. Adding or removing an IP is assimple as modifying the header file for the interconnect,which makes the SoC extremely agile. An automatic defaultslave in the decoder catches and returns an error responsefor accesses to unused regions to prevent deadlocks. Robust off-chip IO for control and data movement is essen-tial for painless test chip bring up and test. Where possible,reusing the same basic off-chip signal IO arrangement re-duces risk and development effort. The essentials include anoff-chip clock and reset, UARTs, general purpose IO (GPIO),real-time clock (RTC) oscillator, diganostic (DIAG) signals,and any CPU debug interfaces.When using the on-chip CPU host, printf() calls canbe retargeted to the UART slaves. This is also useful in RTLsimulation, where the CPU can terminate a test at the endof a program by writing a unique ASCII code to the slaveUART which is used by the test bench to end simulation.It is often necessary to debug issues during test chipbring-up, which can be very challenging in silicon due tolack of visibility. To help increase visibility, we include adiagnostic (DIAG) pin multiplexer to allow multiple signalsto be observed off-chip (Fig. 2), without requiring a largenumber of chip pins. The mux select signal is controlledby a memory-mapped register. Typical signals to observeinclude clocks, resets, power rails, power gate enables, FSM
3. Specifications available online: https://developer.arm.com/architectures/system-architectures/amba states, interrupts etc. Using at least two DIAG pins allowsrelative observation, e.g. observe clock and reset at the sametime. The DIAG multiplexers themselves can be trivially im-plemented and automatically populated with signals usinga VGEN script (Section 3).
On-chip SRAM memories are included for storing binaryprograms and data. ROMs may also be useful, such as for aboot loader. To be bus accessible, SRAM and ROM macrosrequire a bus interface, and CHIPKIT includes a suitableAHB SRAM interface.Control and status registers (CSRs) are very commonat both the SoC and IP level. In particular, research testchips tend to include a larger number of CSRs in orderto configure experiments and turn them on or off (so-called chicken bits). CSRs can quickly become very time-consuming to implement, document, modify and maintain.CHIPKIT provides an agile flow to automatically generateCSRs from a single database using a VGEN Python script,which we will describe in Section 3. This approach makes itvery convenient to add CSRs as needed during RTL design.
The SoC includes some basic low-bandwidth peripheralsarranged on a compact APB interconnect. These includeUART slaves, a watchdog timer, and a real-time counter(RTC). The RTC is especially useful for measuring the run-time of a given workload while characterizing the chip.
A simple and robust clock and reset architecture is essentialin research test chips to prevent potential complications inwhat is a truely essential aspect of digital electronics. Asingle off-chip clock (
HCLK ) and asynchronous active-lowreset (
RESETn ) is supplied to the chip from the PCB (Fig 2).Due to the controlled slew rate of standard IO cells,
HCLK istypically limited to a maximum of a few hundred MHz. Anyfaster clocks must be generated on-chip. A straightforwardapproach to generating fast clocks on-chip is to used anopen-loop digitally-controlled oscillator (DCO), which canbe implemented using a netlist of digital standard cells,without any custom hand layout.Multiple power domains are useful on test chips, to mea-sure power consumption of individual blocks, or to performfine-grained dynamic voltage and frequency scaling (DVFS).However, power domains add a huge amount of complexityin both RTL development and (especially) implementation,which brings risk. For research test chips, we suggest usinga lightweight approach; in particular it is a good idea toavoid the use of power-gates and level-shifters, which adda huge amount of complexity in the EDA flow, along withvalidation overhead. This approach is usually feasible if thevoltage ranges are sufficiently close and there is no strongrequirement to power-off domains.
The SoC subsystem is a foundation upon which researchtest chips can be rapidly constructed by adding new IPblocks [1], [2] or even whole subsystems [3].The method of interfacing new blocks will largely de-pend on the complexity of the IP to be added. The simplestapproach is a slave programming model, where a softwaredriver programs the accelerator with data and control in-formation, before initiating the accelerator, which executesthe task and returns an interrupt on completion. Higher-performance blocks will require a more sophisticated inter-face, such as including a master bus interface on the accel-erator to allow it to initiate data transfers independently. Infact, programming models and interfacing of accelerators isa very active area of research, especially considering thingslike data movement cost, virtualization and coherency [6],[7].A fast clock domain will allow an accelerator to achievehigher performance. Note that this introduces an asyn-chronous or isochronous clock-domain crossing (CDC)around the bus interface, which will require a CDC bridgeto ensure correct data transfer.
GILE
RTL D
EVELOPMENT
Research test chip projects are typically severely time con-strained. Therefore, it is important to use an RTL develop-ment approach that is 1) efficient, 2) minimizes opportunityfor bugs, and 3) is supported by front-to-back EDA toolflows. In recent years, there has been a significant researcheffort exploring new hardware design languages [8], [9], aswell as high-level synthesis (HLS) from C++/SystemC [10].Chipyard provides an comprehensive SoC design frame-work in Chisel. In contrast, CHIPKIT focuses on parame-terized SystemVerilog (SV) for RTL design. Compared toChisel, SV is mature, natively supported by EDA tools [11],and relatively well supported [12]. SV also does not requirean opaque translation step to generate RTL for simulationand implementation. Which can speed up validation andimplementation, which tend to far exceed the time originallyspent on design.We have found that the quality of RTL coding inacademia is often poor, especially in comparison to industryRTL. Poor RTL can lead to long debug cycles and is timeconusming to maintain and update. It can even lead tosilicon bugs. We have found that introducing strict codingguidelines can effectively solve this problem. Therefore, inthis section, we discuss the components of an agile RTLdevelopment process, that uses standard commercial sim-ulation and implementation tools. SV is a very large language with many verification-orientedfeatures that are not relevant to writing synthesizable RTL.Therefore, we use a strict RTL coding style, which can besummarized in the following directives: • Separate logic and registers.
Makes RTL easier to parseand pipelining easier to modify.
4. Available online: https://github.com/ucb-bar/chipyard ` i n c l u d e RTL . svhmodule my counter (input l o g i c clock ,input l o g i c reset n ,input l o g i c enable ,output l o g i c [ 3 1 : 0 ] count) ;// Use ” l o g i c ” type e x c l u s i v e l y , not ” wire ” or ” reg ”l o g i c [ 3 1 : 0 ] count next ;// Use ”always comb” keyword f o r l o g i calways comb count next = count + 32 ' d1 ;// Use a macro to i n f e r r e g i s t e r s`FF ( count next , count , clock , enable , reset n , ' 0 ) ;endmodule // my counter
Fig. 3. SystemVerilog coding guidelines example. ‘ • Use rising-edge registers with active-low async reset.
Simplifies timing constraint development. • Use the logic type exclusively.
Replaces both theolder wire and (very confusing) reg types. Providescompile-time checking for multiple drivers. • Use the always_comb keyword for logic.
Providescompile-time checking for unintended latches or regis-ters. • Use the always_ff keyword to infer registers.
Pro-vides compile-time checking for unintentional latches. • Use automatic module instantiation connections ( .* ). These significantly reduce the verbosity of connectingmodules and provide additional compile-time check-ing.In addition to these guidelines, we also recommend thestrict use of a pre-processor macro for register inference.This has a number of advantages, including: 1) significantreduction in lines of code, 2) removes the risk of poorinference style, 3) enforces use of a rising-edge, async active-low reset, 4) allows the register inference template to bechanged to suit ASIC or FPGA. A macro is used insteadof a module to reduce simulation overhead. Fig. 3 givesan example module for a simple counter, following ourguidelines, including the use of the CHIPKIT SV header(
RTL.svh ) which includes a register macro ( ‘FF() ). Physical IP such as SRAMs, IO cells, clock oscillators, andsynchronizers need to be instantiated in the RTL. It’s worthremembering that various versions of these cells may berequired over the lifetime of the IP or full-chip, includingRTL functional models as well as various ASIC and FPGAlibraries. Therefore, it is helpful to wrap instantiated compo-nents inside a module, which can then be easily switched.Each set of wrapped component instantiation modules isstored in a different directory for each library, with thecorrect directory included at compile time.
RTL Modules vgen -update
Human-Editable CSV Database vgen -generate
Generated Code ...my_sig_csr...
Custom Postfix
Update database with new matching signals Generate code from database RTL (*.v)Instance (*.v)Docs (*.md)SW (*.c, *.py)Tests (*.c, *.py) ...my_sig_csr...
Fig. 4. VGEN agile templating flow for the CSR generation example.
In chip development, it is common to encounter repetitivecoding tasks. Typical examples include SoC memory-maps,CSRs, IO signals/pads, clocks and resets and debug signals.These can be tedius and error-prone and introduce signifi-cant risk into the tape-out project. The perennial preferredapproach is to write generators . CHIPKIT includes
VGEN ,which is a simple Python framework for writing templatedcode generators.As a prototypical example, consider the implementationof CSRs (Section 2). Whenever a new CSR is added duringthe design process, the following need to be updated: 1)RTL, 2) documentation, 3) C/Python software views, 4)CSR tests. A typical CSR module with 100 register defini-tions requires over a thousand lines of RTL to be written,maintained and validated. Adding a new CSR therefore,becomes an extremely time-consuming and error-prone pro-cess. VGEN automates the generation of all this code requir-ing only 126 lines of code.Fig. 4 gives an outline of the VGEN flow, which operatesin two stages. The first step is to update a CSR database withsignals from the design, which can be done periodically asthe RTL is developed. The VGEN tool automatically updatesthe database ( vgen -update ) by parsing RTL modules tofind signal names with a matching prefix or postfix that indi-cates a CSR should be attached to the signal. Any matchingsignals are then cross-referenced against the database to seeif they already exist and if any extracted parameters, suchas the bitwidth of the register have changed. If it is a newaddition or a modification, the change will be made in thedatabase. The database is stored in comma-separated value(CSV) format, which allows it to be easily viewed and editedin a spreadsheet program. The CSV database can be versioncontrolled alongside the RTL.The second stage of operation is to proceed and gen-erate templated output code with values from the database( vgen -generate ). For CSRs, an RTL module is generatedwith memory-mapped registers as described in the CSVdatabase, along with code for a module instantiation tem-plate. Documentation in Markdown format is also gener-ated, along with C and Python software register definitionsand tests to confirm correct operation of the automaticallygenerated code.VGEN is a lightweight Python module. The databasedata structure is represented as a list of dictionaries. Thekeys for the dictionaries are defined in the header line ofthe CSV file, so it is easy to add new attributes by simplyediting the CSV file. CHIPKIT currently includes exampleVGEN scripts for generating CSRs and IO pads, and is easilyextensible to other common chip design tasks.
EST C HIP V ALIDATION
Validation effort typically far outweighs design, and shouldthus be a primary consideration. Although this is a hugetopic, the following section presents some general adviceand guidelines to help ensure first time right silicon forresearch test chips.
As a first step, documentation should be considered an es-sential form of IP validation. Adopting a Markdown formatfor documentation allows it to be developed and versioncontrolled alongside the RTL. As a priority, documentationshould also include a detailed block diagram of the IP.Another useful first step is to implement an integration shell ,which has the interface signals described in the documenta-tion. This is very useful for preliminary SoC development. Itshould compile without error, but typically does not includemuch, if any, functionality. As the IP matures, the integrationshell can be replaced with the full RTL, accompanied bysuitable integration tests.RTL simulation at various granularities, is the workhorseof IP development. This typically involves a testbench, a setof self-checking tests and an associated Makefile or scriptto run them. Smaller ad-hoc unit-tests and RTL assertionsshould also be developed as new modules are coded.As the IP approaches the feature-complete milestone,the timing and power closure process will proceed, and agreater breadth and depth of validation will help improvethe quality of the design. Simulator coverage tools give agood indication of the maturity of test suits and whichparts of the IP warrant further attention. Linting toolsprovide a static check for RTL coding errors and clock-domain crossing issues. Other static tools can help withoptimizing clock gate enables and RAM enable efficiency.Early synthesis trials will help flush out long timing pathsin the design. Power analysis will give an indication of thepower consuming blocks in the design, indicating targetsfor further optimization.
At the SoC level, many of the same guidelines discussed forthe IP level apply. However, there are also a greater numberof details that tend to make it difficult to achieve high testcoverage. Top of the list for validation scrutiny is everythingrequired for initial bring up, including clocks, resets, powersequencing (if any), and basic off-chip communication. Itis essential that these design components work withoutfail, and the boot sequence must be carefully validatedbefore tape-out. The interconnect is another critical areafor validation. A comprehensive, automatically-generatedtest is useful to check correct operation of all regions inthe memory map, both valid and unmapped. For largememories, rather than just testing the first few words, besure to toggle all address and data bits to catch accidentalsignal truncation in RTL hierarchies, which is common. Atthe SoC-level, there is a trade-off between coverage andrun-time, so it may be necessary to optimize big tests toachieve the best coverage in reasonable run time. Finally, it’simportant to re-run IP-level tests on the SoC to check correct interface and functionality assumptions hold. The ultimategoal is to run all the tests in simulation that you intend tomeasure on the silicon.With some additional effort, the accuracy with whichRTL simulation models real circuits can be enhanced. Agood example of this is to setup an option to run simula-tion regressions with undefined SRAM initial states (“X”values), which removes dependence on SRAM power-upstates, which will be random on real silicon. Similarly,clock domain crossings (CDCs) are a big concern in thisregard. Hence, another useful simulation capability is to setany asynchronous clocks to be randomly jittered in periodrelative to each other, which helps to catch CDC pathswith missing synchronizers. If possible, reset synchronizersshould be avoided in favor of individually controllableresets, which provide greater control for debug.
The test and (occasional) debug process will be muchsmoother if carefully considered at design time. The IPshould include instrumentation to perform any measure-ments required to demonstrate and measure the expectedoperation. This typically means some kind of performancecounters, usually manipulated using CSRs. For example,clock cycle or memory access counters provide essentialdata to characterize performance. Some kind of infinite loopor autonomous self-test mode can also be useful to alloweasy average power measurements without including anydata loading phase that is otherwise required to test thedesign.It’s almost inevitable that at some point it will be neces-sary to debug unexpected behaviour on real silicon. Debug-ging any design aspect that includes significant complexitycan be very challenging partly due to limited visibility inreal silicon. Therefore, this should be considered duringdesign time. The DIAG mux approach described in Section 2is a cheap way to provide visibility from outside the chip,and should include all clocks, resets and other critical hard-ware state. Full control of clocks and resets from the SoC bymeans of dedicated CSRs is also essential. Finally, it is alsoa good general rule to make all SRAMs and register filesin the chip memory mapped. Although this adds additionalcomplexity, it helps when writing self-checking tests and isinvaluable when debugging SRAM circuit performance.
Successfully validating a test chip RTL codebase on anFPGA will drastically increase the chances of success onASIC. The process of running the RTL through an FPGAtoolflow can uncover a myriad of functional and timingissues. As well as helping uncover bugs in the RTL, FPGAimplementation will also help with timing constraint debug.The FPGA emulation will also be significantly faster thanRTL simulation, which enables much more extensive stresstest regressions. Finally, FPGA emulation is a great chanceto check the correct operation of off-chip interfaces withthe opportunity to run a more convincing end-to-end test,without any “magic” help from testbenches which can dothings in simulation that are not possible on a real chip,such as pre-loading SRAM in zero time. The RTL coding guidelines presented in Section 3.1 along with the guidelinesfor instantiated library components in Section 3.2 shouldmake porting the SoC to FPGA straightforward.
Physical design is the final process before tape-out, and isobviously a critical focus for full-chip validation, which wewill discuss briefly here from the RTL design perspective.As soon as the RTL codebase will compile, the physicaldesign flow can be developed, starting with developingtiming constraints in synthesis. Beyond this, static timinganalysis (STA) reports from the implementation flow pro-vide the feedback for RTL timing closure. This process isiterative, as refactoring logic and pipelines to reduce thegate delays on a path tends to reveal other near-critical pathswhich need attention. Therefore, it is helpful to start usingSTA early in the design process as the microarchitecturesolidifies. Throughout this process, VGEN code generation(Section 3.3) can be used to automatically generate repetitivedesign-specific scripting, such as IO pad placement scripts,as the project evolves.As the design matures, the validation effort in the back-end will ramp. Logical equivalence check (LEC) tools allowsynthesis and layout netlists to be formally checked againstthe RTL. These netlists should also be simulated in theRTL testbench, using the library vendor Verilog models ofcells and SRAMs. Netlists are simulated in various degrees,including without annotating delays to cells and wires ( zero-delay mode), through to full annotation with dynamic timingchecks. The former (zero-delay) is relatively easy to setupand should be included in regressions early in the validationstage. The latter (full timing annotation) can take some workto get running. However, annotated netlist simulation (overPVT corners) is essential to help debug any potential errorsin the timing constraints, which will not be caught by STAalone.
ONCLUSION
Chip tape-outs can be time consuming and error prone.This paper describes
CHIPKIT , an agile reusable frame-work for rapidly developing robust test chips. The basisof this framework is a straightforward SoC subsystem thatprovides basic IO, communication and both on and off -chip hosting. Research chips with various experiments canreadily be built on top of this, without having to reinvent thewheel each time. With the current interest around domain-specific accelerators, we believe the CHIPKIT frameworkwill enable more research teams to demonstrate their workin custom silicon.
CKNOWLEDGEMENTS
This work was supported by the Applications DrivingArchitectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. R EFERENCES [1] P. N. Whatmough, S. K. Lee, D. Brooks, and G. Wei, “DNNEngine: A 28-nm timing-error tolerant sparse deep neural networkprocessor for IoT applications,”
IEEE Journal of Solid-State Circuits ,vol. 53, pp. 2722–2731, Sep. 2018.[2] S. K. Lee, P. N. Whatmough, D. Brooks, and G. Wei, “A 16-nm always-on DNN processor with adaptive clocking and multi-cycle banked SRAMs,”
IEEE Journal of Solid-State Circuits , vol. 54,pp. 1982–1992, July 2019.[3] P. N. Whatmough, S. K. Lee, M. Donato, H. Hsueh, S. Xi, U. Gupta,L. Pentecost, G. G. Ko, D. Brooks, and G. Wei, “A 16nm SoC with a 54.5x flexibility-efficiency range from dual-core ArmCortex-A53 to eFPGA and cache-coherent accelerators,” in , pp. C34–C35, June 2019.[4] G. G. Ko, Y. Chai, M. Donato, P. N. Whatmough, T. Tambe, R. A.Rutenbar, D. Brooks, and G.-Y. Wei, “A 3mm2 programmablebayesian inference accelerator for unsupervised machine percep-tion using parallel gibbs sampling in 16nm,” in , Accepted 2020.[5] L. P. Carloni, “The case for Embedded Scalable Platforms,” in , pp. 1–6, June 2016.[6] B. Venu, “Enabling Hardware Accelerator andSoC Design Space Exploration.” https://community.arm.com/developer/research/b/articles/posts/enabling-hardware-accelerator-and-soc-design-space-exploration.[7] S. L. Xi, Y. Yao, K. Bhardwaj, P. Whatmough, G.-Y. Wei, andD. Brooks, “SMAUG: End-to-end full-stack simulation infrastruc-ture for deep learning workloads,” 2019.[8] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Aviienis,J. Wawrzynek, and K. Asanovi, “Chisel: Constructing hardware ina Scala embedded language,” in
DAC Design Automation Conference2012 , pp. 1212–1221, June 2012.[9] D. Lockhart, G. Zibrat, and C. Batten, “PyMTL: A unified frame-work for vertically integrated computer architecture research,” in , pp. 280–292, Dec 2014.[10] B. Khailany, E. Krimer, R. Venkatesan, J. Clemons, J. S. Emer,M. Fojtik, A. Klinefelter, M. Pellauer, N. Pinckney, Y. S. Shao,S. Srinath, C. Torng, S. L. Xi, Y. Zhang, and B. Zimmer, “Invited: Amodular digital VLSI flow for high-productivity SoC design,” in ,pp. 1–6, June 2018.[11] S. Sutherland and D. Mills, “Synthesizing SystemVerilog: Bustingthe myth that SystemVerilog is only for verification,” in
SynopsysUsers Group Conference , 2013.[12] M. B. Taylor, “Basejump STL: SystemVerilog needs a standard tem-plate library for hardware design,” in , pp. 1–6, June 2018.
Paul N. Whatmough leads research on hardware for machine learn-ing at Arm Research Boston, and is an Associate at Harvard Uni-versity. His research interests include efficient algorithms, computerarchitecture, and circuits. Whatmough received a PhD in electricalengineering from University College London, U.K. Contact him [email protected].
Marco Donato is a Research Associate at Harvard University. Hisresearch interests include novel design methodologies targeting energy-efficient, reliable circuits and architectures for emerging computingparadigms. Donato received a PhD in electrical engineering from BrownUniversity. Contact him at [email protected].
Glenn G. Ko is a postdoctoral researcher at Harvard University. Hisresearch interests include machine learning, algorithm-hardware co-design and scalable accelerator architectures on the cloud and edge.Ko received a PhD in electrical and computer engineering from theUniversity of Illinois at Urbana-Champaign and worked on SamsungExynos SoCs prior to that. Contact him at [email protected].
Sae Kyu Lee is a Senior Hardware Engineer at IBM T.J Watson Re-search Center. His research interests include circuits, architecture anddesign methodologies for energy-efficient accelerators. Lee received aPhD in electrical engineering from Harvard University. Contact him [email protected].
David Brooks is the Haley Family Professor of Computer Scienceat Harvard University. His research interests include architectural andsoftware approaches to address power, thermal, and reliability issues forembedded and high-performance computing systems. Brooks receiveda PhD in electrical engineering from Princeton University. Contact himat [email protected].