[PDF] Optimizing Memory-Access Patterns for Deep Learning Accelerators

Abstract

Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result in significant performance loss. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses. Experiments show that our approach can substantially reduce the impact of memory accesses required by common neural-network models on a homegrown AWS machine-learning inference chip named Inferentia, which is available through Amazon EC2 Inf1 instances.

Full PDF

aa r X i v : . [ c s . PF ] F e b Optimizing Memory-Access Patterns for DeepLearning Accelerators

Hongbin Zheng, Sejong Oh, Huiqing Wang, Preston Briggs, Jiading Gai,Animesh Jain, Yizhi Liu, Rich Heaton, Randy Huang, Yida Wang

Amazon Web Services

ABSTRACT

Deep learning (DL) workloads are moving towards accelera-tors for faster processing and lower cost. Modern DL acceler-ators are good at handling the large-scale multiply-accumulateoperations that dominate DL workloads; however, it is chal-lenging to make full use of the compute power of an accel-erator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result insigniﬁcant performance loss. This paper proposes a system-atic approach which leverages the polyhedral model to ana-lyze all operators of a DL model together to minimize thenumber of memory accesses. Experiments show that ourapproach can substantially reduce the impact of memoryaccesses required by common neural-network models on ahomegrown AWS machine-learning inference chip named

Inferentia , which is available through Amazon EC2 Inf1 in-stances.

KEYWORDS

Compiler, Deep Learning Accelerator

As deep learning (DL) models grow in sophistication andcomputational load, the traditional approach of executingDL workloads, i.e., neural networks, on CPUs and GPUs isbecoming more time consuming and expensive. There is atrend to move DL workloads to custom accelerators [2, 6].By designing domain-speciﬁc architectures, these processorsare able to accelerate DL workloads and reduce energy re-quirements by orders of magnitude.A typical DL model can be represented as a graph, wherenodes are operators and directed edges denote the depen-dences between nodes. Modern accelerators mostly focuson compute-bound operators such as convolution (CONV) and general matrix multiplication (GEMM) via specially de-signed compute units like systolic arrays. These units areable to process multiply-accumulate operations in a highlyeﬃcient manner. On the other hand, the accelerators dependon complex software-managed scratchpads. End-to-end per-formance will be limited if memory references of a neuralnetwork are not well organized. Current solutions, e.g., theXLA compiler for Google’s TPU [11], handle memory-access optimization within an operator, but ignore opportunitiesto reduce the number of memory accesses across multipleoperators. There is some global optimization work for DLmodels [5, 7], but no one seems to have attacked global op-timization of memory-access patterns for DL accelerators.We propose a systematic way to optimize the memory-access patterns of DL models for eﬃcient execution on DLaccelerators. Speciﬁcally, our approach takes a DL modelas input, does a number of global optimizations to removeunnecessary memory copies and intelligently schedule nec-essary memory accesses on the accelerators to maximizethe memory-bandwidth usage. Experiments show that weare able to signiﬁcantly reduce the impact of memory refer-ences running on a homegrown AWS machine-learning in-ference chip named

Inferentia . The chip is available to publicthrough Amazon EC2 Inf1 instances . Our work is part of the compiler toolchain for

Inferentia . Thetoolchain reads in the computation graph of a DL model, de-ﬁnes the operators via TVM [1] to build an intermediate rep-resentation (IR) that represents the whole neural network,applies analyses and optimizations to the IR, and eventuallyproduces a low-level IR for machine-code generation.This paper focuses on a small portion of the compiler: op-timizing the memory-access patterns. A DL workload ma-nipulates high dimensional tensors with loop nests . With-out loss of generality, we deﬁne the tensor accesses withelement-wise load and store instructions inside a loop nestbased on the polyhedral model [10]: v l = t m [ ® f (® i )] (load) t m [ ® f (® i ))] = v s (store)In these deﬁnitions, ® i = i , i , ..., i n − represents a loop nestwith n loops, where i j is the loop at level j , t m represents the m -dimensional tensor which is being read/written by theload/store instructions, and ® f (® i ) = C ® i + ® b . Since the matrix C and the vector b are compile-time constants, C ® i + ® b is an aﬃne expression. Finally, v l in (load) represents the resultof the load instruction and v s in (store) represents the databeing written to t m [ ® f (® i ))] in the store instruction. https://aws.amazon.com/ec2/instance-types/inf1/ Our approach tries to eliminate unnecessary data move-ments in the workload (Section 2.1), and for the remainder,maximizing the utilization of the on-chip memory by main-taining data locality in the scratchpad (Section 2.2). Our ap-proach was designed for DL accelerators equipped with pow-erful compute units and limited on-chip memory.

Data-movement elimination tries to eliminate the pair of in-structions ( v = t l [ ® f l (® i )] , t s [ ® f s (® i ))] = v ) , where the result ofthe load instruction, v , directly feeds the input of the storeinstruction. Such patterns are found in DL workloads by an-alyzing the loop nests of pairs of memory-bound operatorslike repeat, tile, split, transpose, strided_slice, etc. Existing so-lutions cannot thoroughly eliminate them without optimiz-ing globally.To eliminate such pairs, we ﬁrst generate the reverse of ® f s as ® f ′ s : ® idx t s

7→ ® i . Using ® f ′ s , we build a function: ® д ls = ® f l ◦ ® f ′ s = ® f l ( ® f ′ s ( ® idx t s )) : ® idx t s

7→ ® idx t l (1)to map the index space of tensor t s to the index space oftensor t l . Using ® д ls , we rewrite the accesses that read t s sothey directly read t l which in turn allows us to eliminate thestores that deﬁned t s . Speciﬁcally, for each load instructionthat reads t s , v ′ = t s [ ® f ′ l (® i ′ )] , we build a function: ® д ′ = ® д ls ◦ ® f ′ l = ® д ls ( ® f ′ l (® i ′ )) = ® f l ( ® f ′ s ( ® f ′ l (® i ′ ))) : ® i ′

7→ ® idx t l (2)to map the loop indices ® i ′ to the index space of t l and rewritethe load instruction v ′ = t l [ ® д ′ (® i ′ )] . Once we apply suchtransformations to all load instructions that read tensor t s , t s can be eliminated along with all instructions deﬁning it.We repeat this process until we cannot eliminate any moreload/store pairs. The aﬃne function reverse and composition are implemented using the Integer Set Library [9]. Not all data movement in a DL workload can be removed.For compulsory references, we try to fully exploit the avail-able memory bandwidth. In order to maximize the internalmemory bandwidth, accelerators typically organize on-chipmemories into multiple banks with disjoint address spaces,each of which connects to one portion of the compute units( e.g., a speciﬁc row of the systolic array). Data movementbetween diﬀerent banks is very slow through the main mem-ory; therefore, tensor data needs to be carefully spread acrossthe banks for computation. For example, in a

Conv2D oper-ator, data from diﬀerent channels of the feature map andweights must be mapped to diﬀerent memory banks so thatthe internal compute units can read and process the data inparallel. At the same time, the result of the

Conv2D needs to be spread across several banks, guided by the diﬀerentoutput channels.In prior work [3], bank mapping focused on a single loopnest with a goal of maximizing the memory-access paral-lelism for that nest. We call this local bank mapping.

Our goal is to minimize inter-bank data movement be-tween multiple operators (represented by multiple loop nestsin our compiler). To achieve this goal, we ﬁrst derive bankmappings for the operators with bank-mapping restrictions, e.g., conv2D, matmul, pooling, etc., then propagate these map-pings across the network based on the data dependenciesbetween operators. We perform a ﬁxed-point iteration topropagate the mappings to cover all operators in the neuralnetwork and make sure that the output of an operator mapsto the memory banks required by the next operator. If a ten-sor t has conﬂicting mapping requirements during the prop-agation, i.e., the data layout changes between consecutiveoperators in the network, we will introduce a tensor t ′ anda memcopy between t and t ′ to represent data movementbetween memory banks. Typically, for a high-dimensionaltensor, we map its outer dimensions to diﬀerent banks anduse its inner dimensions to address diﬀerent elements in thesame bank to support sequential data access. We conducted our experiments on a homegrown AWS chipcalled

Inferentia , speciﬁcally, Amazon EC2 Inf1.xlarge instance.For the sake of space, we present results of a single modelfor each algorithm.We tested the eﬀectiveness of data-movement elimina-tion on Parallel WaveNet [8]. Our optimization was able toeliminate 123 out of 124 load-store pairs. As a result, we elim-inated 145 MB (out of 146 MB) of tensors that were used forintermediate storage. We saved 10% of the on-chip memorycopies and 11% of the oﬀ-chip memory copies (measured inbytes).We tested the eﬀectiveness of global memory-bank map-ping by running our compiler on ResNet-50 [4], comparingtwo diﬀerent mapping algorithms:

Local mapping which generates mappings within eachoperator, without propagation, but keeps the outputof an operator in on-chip memory if it will be directlyused as the input of the next operator.

Global mapping as described in Section 2.2.Taking results from local mapping as a baseline, we sawglobal mapping eliminate 76% of the on-chip data copies and37% of the copies oﬀ chip (measured in bytes). ptimizing Memory-Access Patterns for Deep Learning Accelerators C4ML ’20, February 23, 2020, San Diego, CA

To conclude, this paper proposes a systematic approach toglobally optimize the memory-access patterns of DL work-loads on accelerators. Experimental results show that we areable to signiﬁcantly reduce memory references for state-of-the-art networks on

Inferentia, a homegrown AWS machine-learning inference chip.

REFERENCES [1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, EddieYan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, LuisCeze, et al. 2018. TVM: An Automated End-to-End Optimizing Com-piler for Deep Learning. In

USENIX Symposium on Operating SystemsDesign and Implementation , Vol. 13. 578–594.[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and OlivierTemam. 2016. DianNao Family: Energy-Eﬃcient Hardware Accelera-tors for Machine Learning.

Commun. ACM

59, 11 (2016), 105–112.[3] Wei Ding, Diana Guttman, and Mahmut Kandemir. 2014. Com-piler Support for Optimizing Memory Bank-Level Parallelism. In

IEEE/ACM International Symposium on Microarchitecture , Vol. 47. 571–582.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. DeepResidual Learning for Image Recognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 770–778.[5] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Za-haria, and Alex Aiken. 2019. Optimizing DNN Computation with Re-laxed Graph Substitutions. In

Proceedings of the Conference on Systemsand Machine Learning , Vol. 19.[6] Norman P. Jouppi, Cliﬀ Young, Nishant Patil, and David Patterson.2018. A Domain-Speciﬁc Architecture for Deep Neural Networks.

Commun. ACM

61, 9 (2018), 50–59.[7] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang.2019. Optimizing CNN Model Inference on CPUs. In

USENIX AnnualTechnical Conference , Vol. 19. 1025–1040.[8] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Ed-ward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Paral-lel Wavenet: Fast High-Fidelity Speech Synthesis. arXiv preprintarXiv:1711.10433 (2017).[9] Sven Verdoolaege. 2010. isl: An Integer Set Library for the PolyhedralModel. In