Optimizing Memory-Access Patterns for Deep Learning Accelerators
Hongbin Zheng, Sejong Oh, Huiqing Wang, Preston Briggs, Jiading Gai, Animesh Jain, Yizhi Liu, Rich Heaton, Randy Huang, Yida Wang
aa r X i v : . [ c s . PF ] F e b Optimizing Memory-Access Patterns for DeepLearning Accelerators
Hongbin Zheng, Sejong Oh, Huiqing Wang, Preston Briggs, Jiading Gai,Animesh Jain, Yizhi Liu, Rich Heaton, Randy Huang, Yida Wang
Amazon Web Services
ABSTRACT
Deep learning (DL) workloads are moving towards accelera-tors for faster processing and lower cost. Modern DL acceler-ators are good at handling the large-scale multiply-accumulateoperations that dominate DL workloads; however, it is chal-lenging to make full use of the compute power of an accel-erator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result insignificant performance loss. This paper proposes a system-atic approach which leverages the polyhedral model to ana-lyze all operators of a DL model together to minimize thenumber of memory accesses. Experiments show that ourapproach can substantially reduce the impact of memoryaccesses required by common neural-network models on ahomegrown AWS machine-learning inference chip named
Inferentia , which is available through Amazon EC2 Inf1 in-stances.
KEYWORDS
Compiler, Deep Learning Accelerator
As deep learning (DL) models grow in sophistication andcomputational load, the traditional approach of executingDL workloads, i.e., neural networks, on CPUs and GPUs isbecoming more time consuming and expensive. There is atrend to move DL workloads to custom accelerators [2, 6].By designing domain-specific architectures, these processorsare able to accelerate DL workloads and reduce energy re-quirements by orders of magnitude.A typical DL model can be represented as a graph, wherenodes are operators and directed edges denote the depen-dences between nodes. Modern accelerators mostly focuson compute-bound operators such as convolution (CONV) and general matrix multiplication (GEMM) via specially de-signed compute units like systolic arrays. These units areable to process multiply-accumulate operations in a highlyefficient manner. On the other hand, the accelerators dependon complex software-managed scratchpads. End-to-end per-formance will be limited if memory references of a neuralnetwork are not well organized. Current solutions, e.g., theXLA compiler for Google’s TPU [11], handle memory-access optimization within an operator, but ignore opportunitiesto reduce the number of memory accesses across multipleoperators. There is some global optimization work for DLmodels [5, 7], but no one seems to have attacked global op-timization of memory-access patterns for DL accelerators.We propose a systematic way to optimize the memory-access patterns of DL models for efficient execution on DLaccelerators. Specifically, our approach takes a DL modelas input, does a number of global optimizations to removeunnecessary memory copies and intelligently schedule nec-essary memory accesses on the accelerators to maximizethe memory-bandwidth usage. Experiments show that weare able to significantly reduce the impact of memory refer-ences running on a homegrown AWS machine-learning in-ference chip named
Inferentia . The chip is available to publicthrough Amazon EC2 Inf1 instances . Our work is part of the compiler toolchain for
Inferentia . Thetoolchain reads in the computation graph of a DL model, de-fines the operators via TVM [1] to build an intermediate rep-resentation (IR) that represents the whole neural network,applies analyses and optimizations to the IR, and eventuallyproduces a low-level IR for machine-code generation.This paper focuses on a small portion of the compiler: op-timizing the memory-access patterns. A DL workload ma-nipulates high dimensional tensors with loop nests . With-out loss of generality, we define the tensor accesses withelement-wise load and store instructions inside a loop nestbased on the polyhedral model [10]: v l = t m [ ® f (® i )] (load) t m [ ® f (® i ))] = v s (store)In these definitions, ® i = i , i , ..., i n − represents a loop nestwith n loops, where i j is the loop at level j , t m represents the m -dimensional tensor which is being read/written by theload/store instructions, and ® f (® i ) = C ® i + ® b . Since the matrix C and the vector b are compile-time constants, C ® i + ® b is an affine expression. Finally, v l in (load) represents the resultof the load instruction and v s in (store) represents the databeing written to t m [ ® f (® i ))] in the store instruction. https://aws.amazon.com/ec2/instance-types/inf1/ Our approach tries to eliminate unnecessary data move-ments in the workload (Section 2.1), and for the remainder,maximizing the utilization of the on-chip memory by main-taining data locality in the scratchpad (Section 2.2). Our ap-proach was designed for DL accelerators equipped with pow-erful compute units and limited on-chip memory.
Data-movement elimination tries to eliminate the pair of in-structions ( v = t l [ ® f l (® i )] , t s [ ® f s (® i ))] = v ) , where the result ofthe load instruction, v , directly feeds the input of the storeinstruction. Such patterns are found in DL workloads by an-alyzing the loop nests of pairs of memory-bound operatorslike repeat, tile, split, transpose, strided_slice, etc. Existing so-lutions cannot thoroughly eliminate them without optimiz-ing globally.To eliminate such pairs, we first generate the reverse of ® f s as ® f ′ s : ® idx t s
7→ ® i . Using ® f ′ s , we build a function: ® д ls = ® f l ◦ ® f ′ s = ® f l ( ® f ′ s ( ® idx t s )) : ® idx t s
7→ ® idx t l (1)to map the index space of tensor t s to the index space oftensor t l . Using ® д ls , we rewrite the accesses that read t s sothey directly read t l which in turn allows us to eliminate thestores that defined t s . Specifically, for each load instructionthat reads t s , v ′ = t s [ ® f ′ l (® i ′ )] , we build a function: ® д ′ = ® д ls ◦ ® f ′ l = ® д ls ( ® f ′ l (® i ′ )) = ® f l ( ® f ′ s ( ® f ′ l (® i ′ ))) : ® i ′
7→ ® idx t l (2)to map the loop indices ® i ′ to the index space of t l and rewritethe load instruction v ′ = t l [ ® д ′ (® i ′ )] . Once we apply suchtransformations to all load instructions that read tensor t s , t s can be eliminated along with all instructions defining it.We repeat this process until we cannot eliminate any moreload/store pairs. The affine function reverse and composition are implemented using the Integer Set Library [9]. Not all data movement in a DL workload can be removed.For compulsory references, we try to fully exploit the avail-able memory bandwidth. In order to maximize the internalmemory bandwidth, accelerators typically organize on-chipmemories into multiple banks with disjoint address spaces,each of which connects to one portion of the compute units( e.g., a specific row of the systolic array). Data movementbetween different banks is very slow through the main mem-ory; therefore, tensor data needs to be carefully spread acrossthe banks for computation. For example, in a
Conv2D oper-ator, data from different channels of the feature map andweights must be mapped to different memory banks so thatthe internal compute units can read and process the data inparallel. At the same time, the result of the
Conv2D needs to be spread across several banks, guided by the differentoutput channels.In prior work [3], bank mapping focused on a single loopnest with a goal of maximizing the memory-access paral-lelism for that nest. We call this local bank mapping.
Our goal is to minimize inter-bank data movement be-tween multiple operators (represented by multiple loop nestsin our compiler). To achieve this goal, we first derive bankmappings for the operators with bank-mapping restrictions, e.g., conv2D, matmul, pooling, etc., then propagate these map-pings across the network based on the data dependenciesbetween operators. We perform a fixed-point iteration topropagate the mappings to cover all operators in the neuralnetwork and make sure that the output of an operator mapsto the memory banks required by the next operator. If a ten-sor t has conflicting mapping requirements during the prop-agation, i.e., the data layout changes between consecutiveoperators in the network, we will introduce a tensor t ′ anda memcopy between t and t ′ to represent data movementbetween memory banks. Typically, for a high-dimensionaltensor, we map its outer dimensions to different banks anduse its inner dimensions to address different elements in thesame bank to support sequential data access. We conducted our experiments on a homegrown AWS chipcalled
Inferentia , specifically, Amazon EC2 Inf1.xlarge instance.For the sake of space, we present results of a single modelfor each algorithm.We tested the effectiveness of data-movement elimina-tion on Parallel WaveNet [8]. Our optimization was able toeliminate 123 out of 124 load-store pairs. As a result, we elim-inated 145 MB (out of 146 MB) of tensors that were used forintermediate storage. We saved 10% of the on-chip memorycopies and 11% of the off-chip memory copies (measured inbytes).We tested the effectiveness of global memory-bank map-ping by running our compiler on ResNet-50 [4], comparingtwo different mapping algorithms:
Local mapping which generates mappings within eachoperator, without propagation, but keeps the outputof an operator in on-chip memory if it will be directlyused as the input of the next operator.
Global mapping as described in Section 2.2.Taking results from local mapping as a baseline, we sawglobal mapping eliminate 76% of the on-chip data copies and37% of the copies off chip (measured in bytes). ptimizing Memory-Access Patterns for Deep Learning Accelerators C4ML ’20, February 23, 2020, San Diego, CA
To conclude, this paper proposes a systematic approach toglobally optimize the memory-access patterns of DL work-loads on accelerators. Experimental results show that we areable to significantly reduce memory references for state-of-the-art networks on
Inferentia, a homegrown AWS machine-learning inference chip.
REFERENCES [1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, EddieYan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, LuisCeze, et al. 2018. TVM: An Automated End-to-End Optimizing Com-piler for Deep Learning. In
USENIX Symposium on Operating SystemsDesign and Implementation , Vol. 13. 578–594.[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and OlivierTemam. 2016. DianNao Family: Energy-Efficient Hardware Accelera-tors for Machine Learning.
Commun. ACM
59, 11 (2016), 105–112.[3] Wei Ding, Diana Guttman, and Mahmut Kandemir. 2014. Com-piler Support for Optimizing Memory Bank-Level Parallelism. In
IEEE/ACM International Symposium on Microarchitecture , Vol. 47. 571–582.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. DeepResidual Learning for Image Recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 770–778.[5] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Za-haria, and Alex Aiken. 2019. Optimizing DNN Computation with Re-laxed Graph Substitutions. In
Proceedings of the Conference on Systemsand Machine Learning , Vol. 19.[6] Norman P. Jouppi, Cliff Young, Nishant Patil, and David Patterson.2018. A Domain-Specific Architecture for Deep Neural Networks.
Commun. ACM
61, 9 (2018), 50–59.[7] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang.2019. Optimizing CNN Model Inference on CPUs. In
USENIX AnnualTechnical Conference , Vol. 19. 1025–1040.[8] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Ed-ward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Paral-lel Wavenet: Fast High-Fidelity Speech Synthesis. arXiv preprintarXiv:1711.10433 (2017).[9] Sven Verdoolaege. 2010. isl: An Integer Set Library for the PolyhedralModel. In