Breaking the Memory Wall for AI Chip with a New Dimension
Eugene Tam, Shenfei Jiang, Paul Duan, Shawn Meng, Yue Pang, Cayden Huang, Yi Han, Jacke Xie, Yuanjun Cui, Jinsong Yu, Minggui Lu
BBreaking the Memory Wall for AI Chip with a NewDimension st Eugene Tam
IC League Inc.
Haining, [email protected] st Shenfei Jiang
IC League Inc.
Haining, [email protected] st Paul Duan
IC League Inc.
Haining, [email protected] st Shawn Meng
IC League Inc.
Haining, [email protected] st Yue Pang
IC League Inc.
Haining, [email protected] st Cayden Huang
IC League Inc.
Haining, [email protected] st Yi Han
IC League Inc.
Haining, [email protected] st Jacke Xie
IC League Inc.
Haining, [email protected] st Yuanjun Cui
IC League Inc.
Haining, [email protected] st Jinsong Yu
IC League Inc.
Haining, [email protected] st Minggui Lu
IC League Inc.
Haining, [email protected]
Abstract —Recent advancements in deep learning have ledto the widespread adoption of artificial intelligence (AI) inapplications such as computer vision and natural languageprocessing. As neural networks become deeper and larger,AI modeling demands outstrip the capabilities of conventionalchip architectures. Memory bandwidth falls behind processingpower. Energy consumption comes to dominate the total cost ofownership. Currently, memory capacity is insufficient to supportthe most advanced NLP models. In this work, we present a 3D AIchip, called Sunrise, with near-memory computing architectureto address these three challenges. This distributed, near-memorycomputing architecture allows us to tear down the performance-limiting memory wall with an abundance of data bandwidth. Weachieve the same level of energy efficiency on 40nm technologyas competing chips on 7nm technology. By moving to similartechnologies as other AI chips, we project to achieve more thanten times the energy efficiency, seven times the performance ofthe current state-of-the-art chips, and twenty times of memorycapacity as compared with the best chip in each benchmark.
Index Terms —Artificial intelligence, heterogeneous integration,near-memory computing, system-on-chip
I. I
NTRODUCTION
AI is now used widely in a diverse set of tasks, rang-ing from face recognition to real-time speech processing toautonomous transportation. Among these applications, deepneural networks have been responsible for the most recentAI breakthroughs, like DeepMinds AlphaGos stunning defeatof the Go world champion. Google has developed a high-quality deep neural machine translation system between 17languages [4]. Since then, natural language processing modelshave only grown more powerful and are now capable of programming, writing stories, and even composing poems.These powerful AI models require deeper and larger nets. In2019, Nvidia published natural language model Megatron with8.5 billion parameters. This year, Microsoft introduced the nextgeneration Turing-NLG with 17 billion parameters. Recently,OpenAI released an even larger model and current state-of-the-art, GTP-3 [24]. With 174 billion parameters, GPT-3 takes350GB memory and $12 million to train [25].It is clear that newer AI models require more parameters,more memory, and larger training sets. They will only continueto grow in size and training cost. To address these futuredemands, we propose a new architecture for AI chip. In thispaper, we discuss two key components of this new architec-ture: heterogenous chip integration technology and single-formmemory.Our results demonstrate the potential of this new archi-tecture. This chip, fabricated on a 40nm process, achievesperformances comparable with current-state-of-the-art chips,which use generations more advanced fabrication processesthan the 40nm. We further extrapolate the performances of ourchip to current fabrication processes. Our chip is projected tobe able to hold 12 billion parameters on a single chip, whilecurrent best on the market only holds 8 billion on a wholewafer. II. B
ACKGROUND
Neural networks consist of layers of nodes and the con-nections between them. Each node stores a value. Duringinfernece, the first layer contains the input values. The valuesof the nodes in subsequent layers are calculated from thevalues in the previous layer’s nodes. The last layer is theoutput of the neural network. Figure 1 shows connections a r X i v : . [ c s . A R ] S e p etween two layers. A neural neural network may be deep(i.e. a large number of layers). It may also be wide– each layermay have large number of nodes. Large amounts of computing Fig. 1. Fully-connected layers in a neural network. power and training data are generally regarded as the drivers ofmodern breakthoughs in artificial intelligence. To sustain themomentum in the field, we need to address the three challengesfor AI chips: the memory wall, energy efficiency, and on-chipmemory capacity.Computational power has increased 60% every year sincethe 80’s, as IC fabrication technology advances with Mooreslaw. Yet memory performance, namely DRAM, has increasedonly 7% every year during the same time [5] [6]. The gapbetween processor and memory speed has been widening formore than two decades. As a result, data bandwidth cannotmatch the processing speed. This is the well-known memorywall. The memory wall has been a challenge for CPUs fordecades. As AI computations consume much more data thanmost other applications, the memory wall is especially relevantin AI chip design.There are two main existing approaches to addressing thememory wall problem. One is increasing the data transferclock rate. The other is widening transfer data width. The first approach appears in high-bandwidth memory, such as high-performance DDR memory and HBM memory. Currently, thepeak performance of such memory is around 256GB/s [26].An example of the second approach is in Interposer, wherethe number of connections is in the 1000s [9].The second challenge for AI chips is power consumption.Training an AI model emits more than five times the lifetimeemissions of an average car [11]. Generally, power consump-tion comes mainly from computing units or data transfer units.AI chips must handle not only computationally intensive butalso data transfer-intensive tasks. In addition to algorithm opti-mization, common approaches to reducing power consumptioninclude using more advanced IC fabrication technology andreducing capacitance load on data transfer paths. To lower thepower consumption of data transfer between DRAM [9] andAI chips, high-bandwidth memory (HBM) [10] and packagesubstrate routing, like Interposer, are used. Yet even with theseadvanced technologies, data transfer power consumption is stillover 0.5mW/Gbps. [16]The third challenge for AI chips is fast memory capacity[13]. Fast memory is referred to as on-chip or near-chipmemory, which provides data quickly without throttling theprocessing units. Fast memory is generally in the form ofon-chip SRAM. SRAM has the advantage of fast accesstime, but its size limits memory capacity. As AI is appliedto more complex problems, the number of parameters ofneural network models grows exponentially. The largest NLPmodel to date has 170 billion parameters. It is crucial tokeep as much parameters as possible in fast memory to avoidperformance degradation. Currently, no AI chip can hold thatmany parameters. Most AI chips on the market typically havea memory capacity of just 50 MB, which leaves a large gapbetween current memory capabilities and memory demands.The insufficiency of fast memory capacity will only continueto get worse if nothing is done to address this problem.The main goal of our approach is to overcome memorybandwidth limitations, to reduce power, and increase memorycapacity. We achieve this by optimizing integration, architec-ture, and memory choice. The component technologies for ourSunrise chip include Heterogeneous Integration Technologyon Chip (HITOC), single form memory (UNIMEM), andarchitecture specifically optimized for HITOC and UNIMEM.III. HITOCAll AI chips are currently fabricated on a single wafer.Sunrise chip is partitioned into separately fabricated logicand memory wafers. These two wafers are bonded face-to-face with a hybrid bonding process [7] [14] using Cudamascene-patterned surfaces, as shown in Figure 2. Exter-nal IOs are brought out to the backside of CMOS waferwith TSV (Through-Silicon Via) for bonding. Circuits ondifferent wafers operate together and communicated throughwires running between two wafers. We call this approachHeterogeneous Integration Technology on Chip (HITOC).By fabricating logic and memory wafers separately, boththe logic process and memory process are optimized indepen- ig. 2. Connection between two wafers with HITOC dently. In the logic process, the transistor threshold voltage islow, and the number of metal layers is large. The oppositeis true in the memory process. The physical structure ofbasic elements in the logic process and the memory processdiffer. Integrating logic and memory processes thus resultsin a design that favors one over the other, so we chooseto separate the two. Furthermore, with logic and memoryon separate wafers, there are more compute elements andmemory elements in the same area when compared to a singlewafer. By separating logic and memory, we achieve betterelectrical characteristics and overall chip characteristics as wellas higher computational performance and memory capacity.HITOC is one form of 3D IC. There are other forms of 3DIC that are widely used. One is Interposer. Interposer [27] isto connect two chips through metal lines on the substrate, asin Figure 3. Interposer connections are much denser comparedto connections between different package of chips or bondingwire connections between chips in the same package.
Fig. 3. Interposer to connect two chips
Another approach is Through-Silicon Via. TSV is a di-rect vertical connection between different levels of a chip.It consists of a conducting via which passes through thesilicon substrate and connects the two sides of the wafer[Fig. 4]. Typically, the interplane via is etched and filledwith metal, such as tungsten (W) or copper (Cu). Connectionsbetween chips with are denser compared to Interposer. TSVis commonly used in HBM (high bandwidth memory).Compared to Interposer and TSV, HITOC is even denserbecause spacing between connections are smaller. With Inter-poser, connections are in one dimension between two chipsplaced on the same surface. With TSV and HITOC, con-nections are in two dimensions between two chips stackedtogether. As shown in Table I, wire pitch difference amongall three approach directly affects wire density. It ultimately
Fig. 4. Through Silicon Via connect multiple chips on a stack makes the difference in bandwidth which is in proportion tonumber of connections [1] [8] [9].
TABLE ID
ATA PATH COMPARISONS OF I NTERPOSER , TSV,
AND
HITOCInterposer TSV HITOCWire Pitch (um) 11.5 9.29.2 11Wire Density ( /mm ) 86 . × × Bandwidth (TB/s) 0.086 1.2 100 a In that table, we conservatively assume the clock frequen-cies for data transfer are the same for comparison purposes.But clock frequency runs faster at HITOC and TSV. Data pathsare shorter in HITOC than in Interposer and TSV. Shorter datapaths have smaller capacitance loading. As a result, powerconsumption is only 0.02pJ/b for HITOC while 2.17 pJ/b and0.55 pJ/b for Interposer and TSV respectively. With smallercapacitance loading, data transfer runs at higher frequency, andpower consumption is lower.With shorter and denser connection, HITOC delivers, amongthe three approaches, higher data transfer rate and lower powerconsumption.IV. UNIMEM: A SINGLE MEMORY SOLUTIONTo simplify the system and circumvent the need for con-ventional CPU-cache-memory architecture [2], Sunrise chipuses only the DRAM without SRAM cache. We call thisapproach UniMem, as we use only a single form of memoryfor the whole chip. We choose to use DRAM over SRAM,since DRAM has a higher density with a cell size of 6-12 F compared to SRAMs cell size of 140 F [23]. However,DRAM has a read/write latency that is around 50-90 timesslower than SRAMs [23].To counteract DRAMs slow latency, multiple localizedDRAM units are pooled together to supply data to logic units[Fig. 5]. Memory access load is shared amongst DRAM arraysin the pool. ig. 5. Localized dedicated memory for logic units The computation sequence is rearranged such that param-eters are reused. We adopted weight stationary data flow [3].Operations on the same weights are grouped so that access toweight data from memory is minimized.Data is broadcast and shared: in our weight-stationary sys-tolic architecture, feature data and results move. Input featuredata is broadcast to all Vector Processing Units (VPU). EachVPU computes and generates output channels independentlyfrom other cores. The results are sent back to a central memorypool. Then, the VPU performs all the operations necessary togenerate a result. All intermediate data are localized in VPU’s,and no exchange of such data occurs with any other units.With localized DRAM array pooling, and maximizing datamovement, Sunrise chip overcomes slow DRAM latency anddeliver high computation performance.V. CHIP ARCHITECTUREWith HITOC, we have two wafers, logic wafer, and memorywafer, bonded together [Fig. 6]. On the logic wafer, we havepools of processing units. Underneath the logic pool on theother wafer are pools of DRAM arrays.
Fig. 6. HTOC with Logic wafer and DRAM wafer
Logic wafer consists of mainly logic units, and control unitsthat include processor and unified control engine (UCE) [Fig.7].
Fig. 7. Top level architecture
There are two types of logic units: data serving unit (DSU)and vector processing unit (VPU). VPU’s perform computationon data. DSU’s serve data to VPU. Each DSU and VPU hastheir own multiple DRAM arrays directly bonded below theunits from the DRAM wafer. All VPU’s and DSU’s form theirrespective pool. The overall bandwidth between DSU/VPUpool and DRAM pool is 1.8TB/s.Feature data are stored in the DRAM of the DSU pool andare sent to the VPU pool for computation. The results aresent back to the DSU pool. The data bandwidth between theDSU pool and the VPU pool is 13 TB/s. This high bandwidthensures that data transfer between DSU and VPU is not abottleneck.Because memory bandwidth is abundant in Sunrise chip,we choose to use vectors instead of tensors as the basiccomputational data unit. This allows us to optimize for bettercomputational performance on sparse tensors.With extremely high data bandwidth on Sunrise chip, syn-chronization of all modules is challenging. On this chip, alldata flow and module operations are centrally controlled bya single unit called the Unified Control Engine (UCE). Itconsists of modules such as a Direct Memory Access con-troller (DMA), data path multiplexer controllers, and functionselector. All modules are fully configurable to implementdifferent neural networks.There is a proprietary 13-bit processor on Sunrise chip. Itmainly controls high-level tasks such as data batch movementand UCE configuration.To minimize yield loss due to defects in memory, ourDRAM PHY is capable of DRAM repair. Before shipment,DRAM is tested, and defects are recorded in non-volatilememory (NVM). During chip power-up, the defect informationis retrieved, and repairs are applied to DRAM arrays.There are two chip interfaces. One is a standard SPIinterface, and the other is a proprietary high-speed-port (HSP)interface. SPI is for the host to transfer commands to the chip.The HSP interface is for data transfer with a transfer rate of200MB/s.Sunrise chip has three implementation layers: logic blocks,unified data flow control configuration, and firmware [Fig.8]. Logic blocks consist of primitive functional blocks andconfigurable modules. The unified data flow control configu-ation dictates how the configurable module functions. It alsoinitiates predetermined sequences of operations. The top tieris the firmware. Firmware mainly modifies operation registervalues, changes configurations, or calls out configurations. Thefirmware also initiates large operations whose sequence iscontrolled by configuration. It is also responsible for host andchip communication. With this three-tier architecture, Sunrisechip implements a wide range of neural networks through acombination of firmware and configuration.
Fig. 8. Implementation layers
VI. R
ESULTS
We fabricated Sunrise chip with a 40nm CMOS processand 38nm DRAM process. The chip consists of one die fromlogic wafer and one die from memory wafer [Fig. 9]. There are32,768 MAC on the chip, with a die size of 110 mm (12.4mm × Fig. 9. Dies that make up a chip
We compare Sunrise chip with three other leading AI chips,whose information is publicly available. Table II contains keymetrics and specifications for Sunrise as well as the otherchips, referred to as Chip A [17], B [18], and C [19].Each chip has a different die size. We remove this factorby normalizing by die size and compare the chips on thefollowing benchmarks [Tab. III]: • Peak performance in TOPS/mm (trillion operations persecond per square millimeter). Computation performanceis higher with more computation units. Number of com-putation units increases with larger die. We normalize TABLE IIB
ENCHMARK R ESULTS FOR S UNRISE
Sunrise Chip A Chip B Chip C
Process 40nm 16 nm 12 nm 7 nmDie Size ( mm ) 110 800 709 456Peak Performance (TOPS) 25 122 125 512Memory Capacity (MB) 560 300 190 32Power Consumption (W) 12 120 280 350Memory Bandwidth (TB/s) 1.8 45 no data 3 performance by unit die area for the true performancecomparison. • Memory Bandwidth on unit die area in MB/s/mm . Mem-ory bandwidth is higher with more localized memoryarrays. We normalize the memory bandwidth by unit area. • Energy efficiency as a ratio of performance to powerconsumption, in TOPS/W. • Memory Capacity per unit area in MB/mm • Cost. Cost comparison includes standard non-recurrenceexpenses, die cost per units area, and for applicationpurpose, cost per unit performance.
TABLE IIID IE - TO -D IE B ENCHMARK C OMPARISONS
Peak Memory Memory EnergyPerformance a Bandwidth b Capacity c Efficiency d SUNRISE (40nm) 0.23 16.3 5.11 2.08Chip A (16nm) 0.15 56.2 0.38 1.02Chip B (12nm) 0.18 no data 0.27 0.45Chip C (7nm) 1.12 6.6 0.07 1.46 a Measured in TOPS/mm . b Measured in MB/s/mm . c Measured in MB/mm . d Measured in TOPS/W.
Sunrise chip outperforms on two of the four metrics, mem-ory capacity and energy efficiency [Tab. III]. Its peak perfor-mance is below chip C’s. This is understandable consideringthat Sunrise is fabricated on 40nm process, a process fourgenerations behind that of chip C.Sunrise memory bandwidth is below chip A’s. Chip A haslarge amount of fast SRAM on chip. SRAM enables highmemory bandwidth. However, SRAM memory bit units arelarge and take up a large portion of the die area. This reducesmemory capacity and leaves smaller die area for computationunits. As a result, chip A lags behind Sunrise in both memorycapacity and performance.Sunrise takes a different approach to balance trade-offsbetween memory and perfomrance. Replacing SRAM withDRAM leads to large memory capacity and leaves moredie area for computation units. With HITOC and UNIMEMtechnology, this architecture not only overcomes slow DRAMlatency but also has sufficient memory bandwidth to supporthigh performance. HITOC cuts down data transfer energyconsumption. UNIMEM removes SRAM cache, and thus theenergy consumption associated with it. Both factors makeSunrise the most energy efficient amongst the other chips inthe table.
ABLE IVC
OST C OMPARISON IN
USD
NRE Die Cost Cost per TOPS
SUNRISE (40nm) . ×
11 0.43Chip A (16nm) . ×
617 2.47Chip B (12nm) ×
296 1.19Chip C (7nm) ×
336 0.66
In Table IV, we compare non-recurrence expense (NRE) anddie costs. NRE mainly consists of process mask cost. Althoughthe exact die cost of the other chips are not published, weestimate their die cost based on die size, wafer cost frommajor foundries, and expected yields. We also include the costfor delivering the same performance. It is expected that 40nmprocess delivers the lowest cost. Normally, a more advancedprocess delivers better cost-to-performance ratio. However,our chip delivers the best cost-to-performance ratio even with40nm process. This is directly due to our chip architecture.Combining all the discussed metrics, our Sunrise chip over-all outperforms other leading AI chips despite its less advancedfabrication process. The HITOC and UNIMEM technologyincorporated in this chip allows us move to a new and moreoptimal architecture. The key benefit is being able to achieveand surpass the performance of more advanced fabricationprocesses with less expensive fabrication.VII. P
ROJECTION
The AI chips in comparisons are fabricated with differ-ent processes. To compare the architecture effectively, wenormalize each chip to a 7nm CMOS process and a 1yDRAM process, based on factors such as density, transistorperformance, and power reduction. The factors are derivedfrom the parameters of the CMOS process (Table V) [20] [21]and the DRAM process (Table VI) [22] of leading foundries.
TABLE VCMOS
PROCESS PARAMETERS
Density Performance PowerRatio Improvement Reduction
28 nm vs. 40 nm 2 45% 40%16 nm vs. 28 nm 2 35% 55%12 nm vs. 16 nm 1.2 28% 35%10 nm vs. 16 nm 2 15% 35%7 nm vs. 10 nm 1.65 22% 54%TABLE VIDRAM
DENSITY
3x nm Process 1x nm Process 1y nm Process0.04 Gb/mm As seen in Table V, density is improved with each gen-eration of process. More transistor computing units can bepacked into the same die area. This not only improves deviceperformance but also increases power consumption by thesame factor. With more advanced process generation, one can choose high performance process to get performancegains, or choose lower power process to get better powerefficiency. When we project the parameters of each chip in ournormalization calculations, we use performance improvementparameters under the condition that power consumption iswithin the common range as seen in ASIC chips. Otherwise,we use power reduction parameters.With all the chips normalized to 7nm, Sunrise chip archi-tectures surpass all other three chips in all benchmarks (TableVII).
TABLE VIIB
ENCHMARK C OMPARISONS N ORMALIZED TO NM PROCESS
Peak Memory Memory EnergyPerformance a Bandwidth b Capacity c Efficiency d SUNRISE 7.58 216 30.3 50.10Chip A 0.86 122 1.50 5.38Chip B 0.19 no data 0.90 0.83Chip C 1.12 6.6 0.07 1.46 a Measured in TOPS/mm . b Measured in MB/s/mm . c Measured in MB/mm . d Measured in TOPS/W.
Although Sunrise chip is on a 40nm process, its perfor-mance per unit area exceeds that of two competing chips at12nm and 16nm. It exceeds all three chips after normalizedby process node. With the architecture of Sunrise, one canchoose to either use a less expensive process to achieve thesame performance as other chips or use current processes toget better performance then other chips.As shown in Table III, the Sunrise chip has the highestmemory bandwidth. We designed just enough bandwidth asneeded for chip performance. Table I shows that HITOC tech-nology enables extremely high memory bandwidth. Even so,bandwidth after normalization exceeds that of all comparabledevices.Sunrise chips memory capacity is 20 times that of otherchips at the 40nm node. When we normalized it to 7nmCMOS and 1Ynm DRAM process, it would reach 20 timesthe memory capacities of other chips. The gain in memorycapacity is mostly a result of replacing all SRAM with DRAM,which has a density of more than 14 times higher than SRAM[12]. On a 800 mm die, our architecture could reach a storagecapacity as high as 24GB. The largest memory capacity on AIchip ever made is 18GB [15], which requires a whole wafer.With our architecture, the current memory capacity of an AIwafer can fit onto a single chip.Sunrise chip is more energy-efficient than all other threechips, even though it is on a less advanced process. Our highenergy efficiency is due to the removal of SRAM cache andthe close proximity between memory and compute units.Overall, Sunrise chip meets or exceeds benchmarks of AIchips in the market. The Sunrise architecture is projected towell exceed all benchmarks if fabricated with an advancedprocess.III. C ONCLUSION