A Primer for Neural Arithmetic Logic Modules
AA Primer for Neural Arithmetic Logic Modules
A Primer for Neural Arithmetic Logic Modules
Bhumika Mistry [email protected]
Department of Vision Learning, and ControlElectronics and Computer ScienceUniversity of SouthamptonSouthampton, SO17 1BJ, United Kingdom
Katayoun Farrahi [email protected]
Department of Vision Learning, and ControlElectronics and Computer ScienceUniversity of SouthamptonSouthampton, SO17 1BJ, United Kingdom
Jonathon Hare [email protected]
Department of Vision Learning, and ControlElectronics and Computer ScienceUniversity of SouthamptonSouthampton, SO17 1BJ, United Kingdom
Abstract
Neural Arithmetic Logic Modules have become a growing area of interest, though remain aniche field. These units are small neural networks which aim to achieve systematic general-isation in learning arithmetic operations such as { + , − , × , ÷} while also being interpretivein their weights. This paper is the first in discussing the current state of progress of thisfield, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focus-ing on the shortcomings of NALU, we provide an in-depth analysis to reason about designchoices of recent units. A cross-comparison between units is made on experiment setupsand findings, where we highlight inconsistencies in a fundamental experiment causing theinability to directly compare across papers. We finish by providing a novel discussion ofexisting applications for NALU and research directions requiring further exploration. Keywords:
Neural Arithmetic Logic Module, Interpretability, Systematic Generalisation,Extrapolation
1. Introduction
The ability to learn by composition of already known knowledge is a form of systematicgeneralisation
Fodor et al. (1988), also termed as compositional generalisation
Lake (2019).Humans can learn such generalisations for arithmetic. For example, combining primitiveoperations such as addition ( a + b ) and multiplication ( a × b ) to produce more complexexpressions (such as ( a + b ) × ( c + d )). Humans can also transfer their skills in applyingoperations on simple numbers (e.g. between 1-10) to other various ranges of numbers (e.g.50-100) which are outside the range they were taught on. This ability to extrapolate , i.e.generalise to out-of-distribution (OOD) data, is a desirable property for neural networks.Research suggests neural networks struggle to extrapolate even for the simplest of taskssuch as learning the identity function Trask et al. (2018). Rather than generalising, networks a r X i v : . [ c s . N E ] J a n humika Mistry, Katayoun Farrahi and Jonathon Hare lean towards memorization in which the model memorises the training labels Zhang et al.(2020).To address this issue, Trask et al. (2018) introduce the first in a new class of neuralmodules which we term Neural Arithmetic Logic Modules (NALMs) . Their unit, theNALU, aims to learn systematic generalisation for arithmetic computations. For example,learning the relation between input [ x , x , x , x ] and output o where the input elementsare real numbers and output is expression x + x − x . To achieve this they incorporatean inductive learning bias such that discrete weight values can be interpreted as differentprimitive arithmetic operations. This form of interpretability is comparable to the definitionof decomposable transparency by Lipton (2016). Though NALU shows promising improve-ments over networks such as Multilayer Perceptrons (MLPs) for extrapolation, the unit stillpresents various shortcomings in architecture, convergence, and transparency. These areasfor improvement inspired the design of other units Heim et al. (2020); Madsen and Johansen(2020); Schl¨or et al. (2020); Rana et al. (2019). Due to the growing interest of NALMs,we believe it is important to have a resource, this paper, to explain current motivations,strengths, weaknesses and gaps in this line of research. Contributions:
1. We provide the first definition to describe this research field by defining a NALM tobe a Neural Network with the ability to model arithmetic in a generalisable mannerwhich encourages the weights of the network to be interpretable.2. We explain how recent modules are designed to overcome various shortcomings ofNALU including: inability to process negative inputs and outputs, lack of convergenceand adhering to its inductive bias, weak modelling of the division operation, and lackof compositionality.3. We highlight how a popular experiment for testing modules arithmetic capabilities isinconsistent between different papers with regards to hyperparameters and experimentsetup.4. We show the usefulness of NALUs in larger differentiable applications which requirearithmetic and extrapolation capabilities, while also making aware situations in whichNALU is sub-optimal.5. We outline possible research directions regarding modelling division, robustness acrossdiving training ranges, compositionality of modelled expressions, and affect whentrained along with other types of neural architectures.
Outline:
In this paper we begin by defining a NALM, motivating their aim and uses in Section 2.Section 3 and 4 explains the definitions of key NALMs: NALU, iNALU, NAU, NMU, andNPU to build understanding. Using the first NALM, the NALU, as a focal point, Section 5provides an in-depth analysis of the shortcomings of NALU to understand the motivationbehind design choices for more recent NALMs. Section 6 highlights inconsistencies in ex-periment setup and compares findings across existing modules. Additionally, we outline Primer for Neural Arithmetic Logic Modules all experiments used to currently evaluate the modules. Section 7 shows the diversity inNALU’s use in applications, while also indicating situations in which NALU is sub-optimal.Section 8 considers all discussed issues and outlines remaining gaps, suggesting possibleresearch directions to take as a result.
2. What are NALMs and Why use them?
We begin by defining NALMs. More specifically, before we detail instances of NALMs, wefirst answer three questions: 1. What is a NALM? 2. What is the aim of a NALM? 3. Whyis a NALM useful?
NALM stands for Neural Arithmetic Logic Module. Neural refers to neural networks. Arith-metic refers to the ability to learn to model arithmetic operations such as addition. Logicrefers to the ability to learn operations such as selection, comparison and logic. Modulerefers to the neural units which model arithmetic. The term module encompasses both asingle (sub-)unit and multiple (sub-)units combined together.
What kind of operations can be learnt?
Existing work has tried to model arith-metic operations including addition, subtraction, multiplication, division, square, and square-root. Other operations include logic (e.g. conjunction) Reimann and Schwung (2019) andcontrol (e.g. < =) Faber and Wattenhofer (2020). Selection of relevant inputs to the modulesis also learnt. How are operations learnt?
Because a NALM is a neural network, a module canmodel the relation between input and output vectors via supervised learning which trainsweights through backpropogation. To learn the relation between input and output, re-quires learning to select relevant elements of the input and apply the relevant arithmeticoperation/s to the selected input to create the output.
How is data represented?
The input and outputs are both vectors. Each vectorelement is a floating point number. Each output element can learn a different arithmeticexpression. For a single data sample, this can be illustrated in Figure 1 where we assumethat the NALM used (made from two stacked sub-units) can learn addition, subtractionand multiplication. In practice data would be given in batch form.
The main aim of NALMs is to be utilised in larger systems while remaining interpretable.A by-product of the interpretability enables NALMs to achieve systematic generalisation inlearning arithmetic expressions and be extrapolative on OOD data.
What does Interpretability mean for NALMs?
Currently, we say a NALM isinterpretable if it has decomposable transparency
Lipton (2016). Transparency means tounderstand how the model works. Decomposability is transparency at component level de-fined by Lipton (2016) as ‘each part of the model - each input, parameter, and calculation- admits an intuitive explanation’. So far, only Heim et al. (2020) has considered theirNALMs in terms of decomposable transparency. A consequence of NALMs achieving thisform of interpretability results in parameters being discrete values and calculations being humika Mistry, Katayoun Farrahi and Jonathon Hare NeuralArithmeticLogicModule
Generalised Explicit
NeuralArithmeticLogicModule
Figure 1: High-level example of the input output structure into a NALM. Both networksare the same. The generalised network defines the notation of each element in theinput and output. The explicit network is an example of valid input and outputvalues.compositions of arithmetic operations. The discrete parameter values result in exact solu-tions which are valid regardless of the data distribution, enabling generalisation on OODdata.
What does Extrapolation on OOD data mean for NALMs?
Once trained, aNALM should be able to predict the output of the input data which comes for a rangeoutside the training range. Any loss in predictive accuracy will only occur due to numericalimprecisions due to hardware limitations.
The ability to learn arithmetic seems trivial in comparison to other architectures such asLSTMs, CNNs or Transformers which can be used as standalone networks which learn taskssuch as arithmetic, object recognition and language modeling. So, why care about NALMs?Learning arithmetic, though it may seem a simple task, still remains unsolved for neuralnetworks. To solve this problem requires precisely learning the underlying rules of arithmeticsuch that failure cases will not occur on cases of OOD data. Therefore, before consideringmore complex tasks, solving the simple tasks seems reasonable. Even though NALMsspecialise in arithmetic there is no restriction in using them as part of larger end-to-endneural networks. For example, attaching a NALM to a CNN. In Section 7, we show a vastarray of applications in which NALMs are being utilised. Being used as a sub-componentin a larger network implies that the sub-component has the ability to learn regardless ofthe data distribution. Therefore, the ability to extrapolate is essential.
3. Overview of the NALU Architecture
The NALU, illustrated in Figure 2, provides the ability to model basic arithmetic operations,specifically: addition, subtraction, multiplication, division. NALU requires no indication ofwhich operation to apply, and aims to learn weights that provide extrapolation capabilitiesif correctly converged. NALU comprises of two sub-units, a summative unit which models { + , −} and a multiplicative unit which models {× , ÷} . Following the notation of Madsen
1. The learned gate matrix ( R × ) is mistakenly drawn as a vector R (the 3 vertical circles in blue). Primer for Neural Arithmetic Logic Modules
Figure 2: Original NALU architecture, taken from Trask et al. (2018) .and Johansen (2020) we denote the sub-units as NAC + and NAC • respectively. Formally,NALU is expressed as: W = tanh( ˆ W ) (cid:12) sigmoid( ˆ M ) (1)NAC + : a = W x (2)NAC • : m = exp W (log( | x | + (cid:15) )) (3) g = sigmoid( Gx ) (4)NALU : ˆ y = g (cid:12) a + ( − g ) (cid:12) m (5)where ˆ W , ˆ M ∈ R I × O are learnt matrices (where I and O represent input and outputdimension sizes). A non-linear transformation is applied to each matrix and then both arecombined via element-wise multiplication to form W (equation 1). Due to the range valuesof tanh and sigmoid, W aims to have a inductive bias towards values {− , , } whichcan be interpreted as selecting a particular operation within a sub-unit (i.e. intra-sub-unit selection). For example, in NAC + +1 is addition and -1 is subtraction, and in NAC • +1 is multiplication and -1 is division. In both sub-units, 0 represents not selecting (i.e.ignoring) an input element. A sigmoidal gating mechanism (equation 4) enables selection between the sub-units (inter-sub-unit), where an open gate, 1, selects the NAC + and closedgate, 0, selects the NAC • . Once trained the gating should ideally select a single sub-unit. G is learnt, and the gating vector g represents which sub-unit to use for each element inthe output vector. Finally, equation 5 gates the sub-units and sums the result to give theoutput. NALU’s gating only allows for each output element to have a mixture of operationsfrom the same sub-unit. Therefore, each output element is an expression of a combinationof operations from either { + , −} or {× , ÷} but not { + , − , × , ÷} . This issue is fixed bystacking NALUs such that the output of one NALU is the input of another. Next, weoverview architectures of some recent units. humika Mistry, Katayoun Farrahi and Jonathon Hare
4. NALU Influenced Units
NALU has inspired the creation of other units/ sub-units including: Improved NALU (iN-ALU) Schl¨or et al. (2020), Neural Addition Units (NAU)/ Neural Multiplication Units(NMU) Madsen and Johansen (2020), Neural Power Units (NPU) Heim et al. (2020),Golden Ratio NALU (G-NALU) Rana et al. (2019), Neural Logic Rules (NLR) Reimannand Schwung (2019) and Neural Status Registers (NSR) Faber and Wattenhofer (2020).Existing unit illustrations are found in Appendix 10. iNALU identifies key issues in NALU and modifies the unit to incorporate solutions(detailed in Section 5). They introduce methods to improve convergence and stabilityduring training through regularisation, clipping, and decouple previously shared parametersbetween sub-units.
NAU and
NMU are sub-units for addition/subtraction and multiplication respectively.Architecture and initialisations of the units have strong theoretical justifications and em-pirical results to validate design choices. The NAU and NMU definitions for calculating anoutput element indexed at o is:NAU : a o = I (cid:88) i =1 ( W i,o · x i ) (6)NMU : m o = I (cid:89) i =1 ( W i,o · x i + 1 − W i,o ) (7)Prior to applying the weights of a sub-unit to the input vector, each element of W isclamped between [-1,1] if using the NAU, or [0,1] if using the NMU.The NPU , equation 8, focuses on improving the division ability of the NAC • by applyinga complex log transformation and using real and complex weight matrices. A relevancegate ( g ) is also combined. g learns to convert values close to 0 to 1 to avoid the outputmultiplication becoming 0.NPU := exp( W ( r ) log( r ) − W ( i ) k ) (cid:12) cos( W ( i ) log( r ) + W ( r ) k ) (8)where r = g (cid:12) ( | x | + (cid:15) ) + ( − g ) , (9)and k i = (cid:40) x i ≥ π g i x i < . (10)Additionally a simplified version of the NPU exists, named RealNPU, considering only realvalues of equation 8 RealNPU := exp( W ( r ) log( r )) (cid:12) cos( W ( r ) k ) . (11) G-NALU replaces the exponent in the tanh and sigmoid operations when calculatingNALU’s weight matrix with the golden ratio value. Primer for Neural Arithmetic Logic Modules
NLR , influenced by inductive biases in Trask et al. (2018), creates a unit to express logicrules and simple arithmetic operations via modelling AND (conjunction), OR (disjunction)and NOT (negation).
NSR , models comparison based control logic: < , > , ! =, =, > =, < =. The NSR alsouse the inductive bias idea in Trask et al. (2018) to constrain the parameter space, andregularisation like Madsen and Johansen (2020) to enforce the biases.
5. NALU’s Shortcomings and Existing Solutions
We detail the weaknesses of NALU and explain existing solutions. We focus on the iNALU,NAU, NMU and NPU when looking at solutions, as these modules focus on overcoming theshortcomings of NALU.
The NAC • cannot deal with mixed sign inputs/negative outputs. Equation 3 requiresconverting negative inputs into their positive counterparts because the log transformationcannot evaluate negative values. Therefore the sign of the input is lost, causing the NAC • to be unable to have negative target values. The use of an exponent also causes the inabilityto have negative outputs, as the range of an exponent is R > . To allow for negative targets,a unit can incorporate logic to deal with assigning the correct sign to the output such asthe iNALU’s sign correction mechanism Schl¨or et al. (2020) or the NPU’s inherent signretrieval Heim et al. (2020). The sign correction mechanism creates a mixed sign vector( msv ) ∈ R O × , msv = I (cid:89) i =1 (sign( x i ) · | W i,o | + 1 − | W i,o | ) , (12)consisting of elements {− , } (assuming W has converged to integers {− , , } ), whereeach element represents the correct sign for each output element. The msv is multiplied tothe end of equation 3, regaining the lost signs. The +1 − | W i,o | gives non-selected inputs a msv value of 1 to avoid effecting the final sign value. In the case of a RealNPU, the latterhalf of its definition i.e. (cid:12) cos( W r k ) can be interpreted as a sign retrieval mechanism. krepresents positive inputs as 0 and negative inputs as 1 (assuming the gate value convergedto select the input). Assuming convergence, W r values are { , − } representing {× , ÷} .Two outcomes are possible from evaluating the expression: –cos( ± π ) = −
2. Notice the similarity in calculation between the NMU (equation 7) and iNALU’s msv (equation 12). humika Mistry, Katayoun Farrahi and Jonathon Hare The NALU gate, responsible for selection between the NAC + and NAC • units, is unableto converge reliably. This is due to the different convergence properties of the NAC + andNAC • Madsen and Johansen (2020). Partial convergence of gate values lead to a leaky gateeffect, noted by Schl¨or et al. (2020), where the gate allows for the unit to incorrectly takeboth a multiplicative and summative route which can lead to exploding outputs. This issueis amplified when additional NALU layers are stacked. In cases where the correct gate isselected, the NALU unit still fails to converge consistently Madsen and Johansen (2020)implying additional architectural issues for the unit. Even with using the improved NAUand NMU sub-units, gating still leads to inferior results. Madsen and Johansen (2020)replace unit gating with unit stacking. Schl¨or et al. (2020) suggests using separate weightsfor the iNALU sub-units to improve convergence, and independent gating (removing x from equation 4) so learning G is no longer influenced by input. However this provides onlyminimal improvements for simple arithmetic tasks. Good initalisations are crucial for convergence. Assuming the Madsen and Johansen (2020)implementation of NALU is used for initialisation, then weight matrices are from a uniformdistribution with the range calculated from the fan values , and the gate matrix froma Xavier uniform initialisation with a sigmoid gain . This results in difficultly for bothoptimisation and robustness. Fragility results in the expected inductive bias of weight valuesconverging to {− , , } to be difficult to achieve Madsen and Johansen (2020). Unsparsesolutions result in a lack of transparent and hence ungeneralisable solutions.The weight biases are achieved by adding a regularisation term for sparsity Madsenand Johansen (2020); Schl¨or et al. (2020) and using weight clamping Madsen and Johansen(2020). Regularisation encourages weights to converge to the discrete values, activating andwarming-up for a predefined period of time to avoid overpowering the main MSE loss term.Madsen and Johansen (2020) use sparsity regularisation to enforce the relevant biasesfor both NAU {− , , } and NMU { , } : R sparse = 1 I · O O (cid:88) o =1 I (cid:88) i =1 min ( | W i,o | , − | W i,o | ) . (13)Note that the absolute of W i,o is not necessary when using NMU. Clamping is applied tothe weights beforehand, which clamps to the ranges of the desired biases. A scaling factor λ = ˆ λ max (cid:18) min (cid:18) iteration i − λ start λ end − λ start , (cid:19) , (cid:19) , (14)is multiplied to R sparse to get the final value, where regularisation strength is scaled by apredefined ˆ λ . https://github.com/AndreasMadsen/stable-nalu/blob/2db888bf2dfcb1bba8d8065b94b7dab9dd178332/stable_nalu/layer/nac.py https://github.com/AndreasMadsen/stable-nalu/blob/2db888bf2dfcb1bba8d8065b94b7dab9dd178332/stable_nalu/layer/_abstract_nalu.py Primer for Neural Arithmetic Logic Modules
Figure 3: Figure taken from Rana et al. (2019). Left: NAC + W values over the domain ofˆ W and ˆ M . Right: NALU where the base value for non-linear functions (tanhand sigmoid) uses the golden ratio rather than exponential resulting in smoothervalue transition.iNALU uses a piece-wise function for regularisation on weight ( ˆ W , ˆ M ) and gate pa-rameters ( G ), R sparse = 1 t max(min( − w, w )) + t,
0) (15)to encourage discrete values that do not converge to near-zero values. Rather than a warmupperiod, regularisation occurs only once the loss is under a pre-defined threshold and stopsonce a discretisation threshold t (=20) is met.These methods alone would restrict the parameter space, but not the unit’s outputscale. To address this, Madsen and Johansen (2020) use a linear weight matrix construc-tion (removing the need of non-linear transformations), allowing for easier optimisation,while Schl¨or et al. (2020) use clipping of the NAC • weights and gradients. The weights inequation 3 would be clipped between [log(max( | x | , − )), 20] before the exp is applied.Rana et al. (2019) modify the non-linear activations, using G-NALU, for the weightmatrices for smoother gradient propagation as shown by Figure 3. In contrast, in attemptsto avoid falling into a local optima, iNALU allows multiple reinitialisations of a model duringtraining. Through a grid search they find having the mean of the gate and NALU weightmatrices ˆ M , ˆ W initialised to be 0, -1 and 1 respectively, results in the most stable units.However, even when using such initialisations, the stability problem remains for division. Division is NALU’s weakest operation Trask et al. (2018). Having both division and multi-plication in the same sub-unit causes optimisation difficulties. Madsen and Johansen (2020)highlight the singularity issue (division by 0 or values close to 0 bounded by an epsilon value)in the NAC • which causes exploding outputs (see Figure 4). This issue is amplified due tooperations being applied in log space. NMU removes the use of log, therefore is not epsilonbound. Furthermore, the NMU is only designed for multiplication. NPU takes Madsenand Johansen (2020)’s interpretation of multiplication (using products of power functions), humika Mistry, Katayoun Farrahi and Jonathon Hare Figure 4: Taken from Madsen and Johansen (2020). Illustration of singularity issue arisingin the NAC • .but applies it in a complex space enabling division and multiplication Heim et al. (2020).Though the NPU cannot fully solve the singularity issue as a log transformation is stillapplied to the inputs, the relevance gate aids in smoothing the loss surface. Schl¨or et al.(2020) observe that reinitialising units numerous times during training can still lead to fail-ure, implying that the issue lies in unit architecture as well as initialisation. Hence, divisionremains an open issue. A single NALU is unable to output expressions whose operations are from both { + , −} and {× , ÷} , e.g. x + x ∗ x . Bogin et al. (2020) hint at NALU’s inflexibility to learn differentexpressions from same weights. Rana et al. (2019) develop CalcNet, a parsing algorithm,to decompose expressions before applying the NALU sub-units. However decompositionrequires fixed rules and pre-trained sub-units which are undesirable. A summary of the discussed NALU issues and proposed solutions is given in Table 1.
6. Experiments and Findings of Units
To understand the evaluation of units, we go through the experiments used in the papers for:NALU, iNALU, NAU, NMU, and NPU. We indicate inconsistencies across papers for thetwo-layer arithmetic task setup, encouraging the need of task standardisation. Inter-unitcomparison using existing findings is made to infer the best unit per operation. Primer for Neural Arithmetic Logic Modules
Table 1: Summarised NALU shortcomings and existing proposed solutions.
Shortcoming NMU iNALU NPU CalcNet
N AC • cannothave negativeinputs/targets Remove log-exponenttransformation Sign correction(mixed signvector) Sign re-trieval Fixed rulesand signparsing
Convergence ofgate parameters
Stacking Separate gateand weightsper sub unit - -
Fragile initialisa-tion
Theoreticallyvalid initialisa-tion scheme Reinitialisemodel - -
Weight inductivebias of { -1,0,1 } not met (non-discrete solutions) Regularisationloss term Regularisationloss termand weightclipping - -
Unrestricted out-put scale
Linear weightmatrix Weight andgradient clip - -
Gradient propa-gation - Reinitialisemodel Relevancegating Replacesigmoidand tanhexponent’swith goldenratio
Singularity (val-ues close to 0)
Remove log-exponenttransformation - Complexspace trans-formationand rel-evancegating -
Compositionality - - - Parsing al-gorithm
Though mentioned in Trask et al. (2018) that NALU can learn to model square and square-rooting, we will purposefully avoid analysing the ability of the multiplicative units to dosquare ( a ) and square-root ( √ a ) operations.The squared operation can be solved when using a multiplication unit. Firstly, therecould be two input elements with the same value resulting in the operation a × a . Secondly,the unit can set the weight value corresponding to the input to 2. The first way is amultiplication operation (which is separately tested), and the second requires breaking the humika Mistry, Katayoun Farrahi and Jonathon Hare inductive bias assumption of discrete weights with a magnitude up to 1. Therefore, weavoid analysing the square operation.For a multiplicative unit to solve the square-root operation such that the weights areinterpretable requires a weight value of 0.5. Though this allows to model square-rooting as a , it contradicts the inductive bias of discrete weights with a magnitude up to 1. Therefore,we avoid analysis square-root operation. A task consistently used in all papers is the ‘
Static Simple Function Learning ’ experi-ment Trask et al. (2018), which evaluates the ability of a unit to learn a trivial two-operationfunction. Madsen and Johansen (2020) renames this task ‘
Arithmetic Datasets ’ and intro-duce their own experiment setup (including details for reproducibility). Specifically, givenan input vector R of floats, the first (addition) layer should learn to output two val-ues (denoted a and b) which are the sums of two different partially overlapping slices (i.e.subsets) of the input, and the second layer should perform an operation on a and b. Fig-ure 5 illustrates such an example. Due to the rigorous setup, evaluation metrics,and available code, we strongly suggest this experiment be used to test andcompare new units. iNALU’s experiments 4 (‘
Influence of Initalization ’) and 5 (‘
Simple
Figure 5: Taken from Madsen and Johansen (2020). Illustration on how to get from inputvector to target scalar for the Dataset Arithmetic Task.
Function Learning Task ’) is a copy of the task but is different to the original. Experiment4 calculates a and b differently to Madsen and Johansen (2020) by not allowing for overlapbetween the slices which form a and b, and experiment 5 use different interpolation andextrapolation ranges to the original experiments. Heim et al. (2020)’s claims that their‘
Large Scale Arithmetic ’ task is equivalent to the
Arithmetic Dataset task. However, asshown in Table 2 there are key distinctions between the two meaning the results from thetwo papers are not directly comparable.
The papers also carry out experiments on top of the two-layer arithmetic task. Trask et al.(2018) carries out a recurrent version of their static task experiment to test the NAC + , Primer for Neural Arithmetic Logic Modules
Table 2: Differences in the‘
Large Scale Arithmetic ’ task used in the papers Madsen andJohansen (2020) and Heim et al. (2020). ‘a’ and ‘b’ represent summed slices ofthe input, and are the expected output values for the addition unit.
Property Madsen and Jo-hansen (2020) Heim et al. (2020)
Hidden size 2 100Iterations for onerun 5,000,000 50,000Number of seeds 100 10Learning rates 1e-3 1e-2 for addition and 5e-3 for all other op-erationsSubset and over-lap ratios 0.25 and 0.5 0.5 and 0.25 (for addition, subtraction,and multiplication)Division a/b 1/aInterpolation andextrapolationranges Uniform distributions,using U[1,2) for train-ing all operations, test-ing on U[2,6). Sobol(-1,1) for training addition, subtrac-tion, and multiplication, Sobol(0,0.5) fordivision. Testing uses Sobol(-4,4) foraddition, subtraction and multiplication,Sobol(-0.5,0.5) for division.Regularisationpenalty Biasing weight discriti-sation L1 on all parametersProgramminglanguage Python 3 Juliawhere the subsets a and b are accumulated over multiple timesteps. The purpose of thistask is to generate much larger output values to test NALU on. As well as pure arithmetictasks, Trask et al. (2018) tests NALU in other settings such as: translating numbers in textform into the numerical form (e.g. ‘two hundred and one’ to 201), a block grid-world whichrequires travelling from point A to B in exactly n timesteps, and program evaluation forprograms with arithmetic and control operations. MNIST is also used to evaluate NALU’sabilities on being part of end-to-end applications. This includes exploring counting theoccurrence of different digits, addition of a sequence of digits, and parity prediction.Madsen and Johansen (2020) also use MNIST for testing the unit’s abilities to act as arecurrent unit for adding/ multiplying the digits. Madsen and Johansen (2020) additionallyprovide experiments to express the validity of their units. This includes modifying thenumber of redundant units, ablation on multiplication, stress testing the stacked NAU-NMU against difference input sizes, overlap ratios and subset ratios, showing the failure ofgating in convergence, and parameter tuning regularisation parameters.Schl¨or et al. (2020) provide three additional experiments. Experiment 1 (‘Minimal Arith-metic Task’) uses a single-layer to do a single operation with no redundancy to see the effectof different input distributions. Experiment 2 (‘Input Magnitude’) sees the effect of trainingdata by controlling the magnitude of the interpolation data. NALU fails on magnitudes humika Mistry, Katayoun Farrahi and Jonathon Hare greater than 1. iNALU remains unaffected for addition and subtraction. Multiplicationperformance is coupled to magnitude where extrapolation error increases with magnitude.Division is uncorrelated to the input magnitude. To increase problem difficulty, experiment3 (’Simple Arithmetic Task’) introduces redundancy where from 10 inputs only 2 are rele-vant. NALU improves on performance for exponentially distributed data when redundantinputs are introduced. iNALU show improvements for multiplication where the unit is ableto succeed on previously failed training ranges such as an exponential distribution with ascale parameter of 5 (i.e. lambda 0.2), but worsens for division.Heim et al. (2020) highlights the relevance gate’s use via a toy experiment to select oneof the two inputs. Additionally, they demonstrate an application of a stacked NAU-NPUunit for equation discovery for an epidemiological model. We compare existing findings across units. NALU is no longer considered the state-of-the-art for neural arithmetic operation learning. For each operation the best sub-unit is asfollows - addition or subtraction : NAU, multiplication : NMU, division : NPU (orRealNPU if the task is trivial).iNALU generally outperforms NALU at the cost of additional parameters and complex-ities to the model. The magnitude of iNALU’s improvement varies, as Schl¨or et al. (2020)claims vast improvements, while Heim et al. (2020) claim minor. For division both theiNALU and NALU performances remain comparable. Success on multiplication is depen-dent on the input training range. Heim et al. (2020) states the NMU outperforms iNALUon multiplication (as expected), but also addition and subtraction. The reason lies in thearchitecture used. The model is a stacked NAU-NMU meaning the addition/subtractionwould be modelled by the NAU. Therefore, the NMU would only be required to act asa selector, selecting the output of the summation (i.e. have a single weight at 1 and therest a 0). Therefore, if two NMUs are stacked together we expect the failure in a pureaddition/subtraction task as shown in the Appendix C.7 in Madsen and Johansen (2020).Surprisingly the 2-layer NMU was able to get 56% success for subtraction, though 0% suc-cess for addition Madsen and Johansen (2020). Heim et al. (2020) is the only work (at thetime of writing this paper) to experimentally compare the main units mentioned. Resultsshow NPU outperforms iNALU for multiplication and division. When stacked on top of aNAU, the NPU performs similar to the NMU for addition and subtraction. The NPU isoutperformed by the NMU for multiplication, however it is more consistent in convergenceagainst different runs. For addition and subtraction, the NAU-NMU is the sparsest unit(having the least number of non-zero weights). Interpretive units require the weight andgate values to be discrete. Regularisation penalties have been a popular approach Madsenand Johansen (2020); Schl¨or et al. (2020) to achieve this. NPU uses L1 regularisation forarithmetic tasks, encouraging sparsity over discretisation. This may explain results fromHeim et al. (2020) where NMU models are generally sparser than NPUs for multiplication.
7. Applications of NALU
This section describes uses of NALU as a sub-component in architectures to tackle practicalproblems outside the domain of solving arithmetic on numeric inputs. Success and failure Primer for Neural Arithmetic Logic Modules cases are mentioned. We choose to focus on NALU applications on the basis that theimproved units discussed above can be applied in place of NALU to provide additionalperformance gains to the mentioned applications.
Before discussing applications, we raise awareness of a case where the NALU is not utilisedfor its capabilities as a NALM. The
Language to Number Translation task in Trask et al.(2018) converts numbers in their string form to their numerical form (such as ‘twenty one’ to‘21’). The NALU is applied to an LSTM’s hidden state vector; therefore it is questionableon if the arithmetic capabilities of NALU is being used, as the NALU may also have todecode the numerical values from the LSTM vector.Xiao et al. (2020) insert a NALU layer between a two-layer Gated Recurrent Unit(GRU) and dense layer to predict vehicle trajectory of complex road sections (containingconstantly changing directions). NALU improves extrapolation capabilities to deal withabnormal input cases outside the range of the GRU hidden states output.Raj et al. (2020) combine NAC + sub-units before LSTM cells for fast training in theextraction of temporal features to classify videos for badminton strokes. They further ex-periment in using NAC + units with a dense layer to learn temporal transformations, findingbetter performance than the LSTM based module and the dense modules being quicker totrain. They justify the use of the NAC + as a way to produce sparse representations offrames, as non-relevant pixels would not be selected by the NAC + resulting in 0 values,while relevant pixels accumulate.Zhang et al. (2019a) use deep reinforcement learning to learn to schedule views oncontent-delivery-networks (CDNs) for crowdsourced-live-streaming (CLS). NALU’s extrap-olative ability alleviates the issue of data bias (which is the failure of models outside thetraining range) by using NALU to build a offline simulator to train the agent when learningto choose actions. The simulator is composed of a 2-layer LSTM with a NALU layer at-tached to the end. Zhang et al. (2019b) propose a novel framework (named Livesmart) forcost-efficient CLS scheduling on CDNs with a quality-of-service (QoS) guarantee. Two com-ponents required in Livesmart contain models using NALU. The first component (namednew viewer predictor) uses a stacked LSTM-NALU to predict workloads from new viewers.The second component (named QoS characterizer) predicts the QoS of a CDN provider.This component uses a stack of Convolutional Neural Networks (CNNs), LSTM and NALU.Both components use NALU’s ability to capture OOD data to aid in dealing with rareevents/ unexpected data.Wu et al. (2020) combines layers of NAC + to learn to do addition and subtraction onvector embeddings to form novel compositions for creating analogies. Units are applied tothe output of an attention module (scoring candidate analogies) that is passed through aMLP. The output of the NAC + units is passed to a LSTM producing the final analogyencoding.NALU has also been used with CNNs. Rajaa and Sahoo (2019) applies stacked NALUsto the end of convolution units to predict stock future stock prices. Rana et al. (2020)utilises the NAC + /NALU as residual connections modules to larger convolutional networkssuch as U-Net and a fully convolutional regression networks for cell counting in images. humika Mistry, Katayoun Farrahi and Jonathon Hare Such connections enable better generalisation when transitioning to data with higher cellcounts to the training data. However, no observations are made to what the units learnwhich lead to an improvement on cell counting over the baseline models.Chennupati et al. (2020) uses NALU as part of a larger architecture to predict theruntime of code on different hardware devices configured using hyperparameters. NALUpredicts the reuse profile of the program, keeping track of the count of memory referencesaccessed in the execution trace. NALU outperforms a Genetic Programming approach fordoing such a prediction.Teitelman et al. (2020) explores the problem domain of cloning black-box functionalityin a generalisable and interpretable way. A decision tree is trained to differentiate betweendifferent tasks of the black box. Each leaf of the tree is assigned a neural network comprisingof stacked dense layers with a NALU layer between them. Each neural network is able tolearn the black-box behaviour for a particular task. Like Xiao et al. (2020), results showedthat NALU is required to learn the more complex tasks.Finally, Sestili et al. (2018) suggests NALU has potential use in networks which predictsecurity defects in code. This is due to the unit’s ability to work with numerical inputsin a generalisable manner, instead of limiting the application to be bound to a fixed tokenvocabulary requiring lookups.
There exist situations where alternate architectures are favoured over NALU. Madsen andJohansen (2020) show that the NAU/NMU outperforms NALU in the MNIST sequencetask for both addition and multiplication. Dai and Muggleton (2020) show the arithmeticability (named background knowledge) of NALU is incapable in performing the MNIST taskfor addition or products when combined with a LSTM. Instead, they show a neural modelfor symbolic learning, which learns logic programs using pre-defined rules as backgroundknowledge, can perform with over 95% accuracy. However, we question whether the failureis a result of NALU or due to the misuse of its abilities from combining it with a LSTM.Jacovi et al. (2019) show that in black box cloning for the Trask et al. (2018) MNISTaddition task, their EstiNet model which captures non-differentiable models outperformsNALU. Though it can be argued that a more relevant comparison would test the NAC + or the NAU which are solely designed for addition. Joseph-Rivlin et al. (2019) show thatalthough the NAC • can learn the order for a polynomial transformation to a high accuracy,it is still outperformed by a pre-defined order two polynomial model. Results suggest thatthe NAC • may not have fully converged to express integer orders. Dobbels et al. (2020)found NALU was unable to extrapolate for the task of predicting far-infrared radiationfluxes from ultraviolet-mid-infrared fluxes. Though no clear reason was stated, the lack ofextrapolation could be attributed to the co-dependence of features because of applying afully connected layers prior to the unit. Jia et al. (2020) considers NALU as a hardwarecomponent concluding that NALU has too high an area and power cost to be feasible forpractical use. Implementing for addition costs 17 times the area of a digital adder, and thememory requirements for weight storage is energy inefficient for doing CPU operations. Primer for Neural Arithmetic Logic Modules
8. Remaining Gaps
This section discusses areas which remain to be fully addressed. We focus on: division,robustness, compositionality, and interpretability of more complex architectures.
Division remains a challenge. To date no unit has been able to reliably solve division.Currently the NPU by Heim et al. (2020) is the best unit to use, though it would strugglewith input values close to zero. Madsen and Johansen (2020) argues modelling division isnot possible due to the singularity issue. One suggestion for dealing with the zero case isto take influence from Reimann and Schwung (2019) which can have an option for showingan output which is invalid (or in their case all off values).One goal of these units is to be able to extrapolate. To achieve this, a unit should be robust to being trained on any input range. Madsen and Johansen (2020) show that unitsare unable to achieve full success of all tested ranges (with the stacked NAU-NMU failingon a training range of [1.1,1.2), being unable to obtain a single success). Reinitialisation ofweights Schl¨or et al. (2020) during training could provide a solution, however this seems tobe a unlikely given Madsen and Johansen (2020) tests against 100 model initialisations.
Compositionality is desirable. A model should be flexible, having the option to selectdifferent types of operations and model complex mathematical expressions. Currently thetwo popular approaches are gating and stacking. Gating has been found to not work asexpected and give convergence issues. Stacking, though more reliable, has less options inoperation selection than gating. Deep stacking of units (in a non-recurrent fashion) remainsuntested.It remains to be understood how units influence learning of other modules (suchas recurrent networks and CNNs) in their representations. For example, seeing if represen-tations are more interpretable because of being trained with a unit.
9. Related Work
We outline alternate research in neural models for solving arithmetic tasks. Such works re-quire components such as convolutions Kaiser and Sutskever (2016), or Transformers Saxtonet al. (2019); Lample and Charton (2020). Neural GPUs can extrapolate to long sequencelengths (2000) from being trained on length 20 inputs, but use binary inputs rather thanreal numbers Kaiser and Sutskever (2016). Furthermore only a few models generalise tosuch a long sequence, but this has been improved on in Freivalds and Liepins (2017). Evenmore complex architectures such as Transformers which can process numerical values, re-main unsuccessful for extrapolation tasks which are simple e.g. arithmetic using multipli-cation Saxton et al. (2019), or complex e.g. integration Lample and Charton (2020). Otherapproaches which can process raw numerical inputs include using reinforcement learningor non-specialised architectures. The Chen et al. (2018) hierarchical reinforcement learn-ing approach requires arithmetic operation/s to be defined in the input. Non-specialisedarchitectures from Nollet et al. (2020) trains using task decomposition and active learningbut is not fully robust to noisy redundant inputs. In short, though various alternates toNALMs exist, each have their own shortcomings in regard to input format, extrapolation,and robustness. humika Mistry, Katayoun Farrahi and Jonathon Hare
10. Conclusion
NALMs are a promising area of research for systematic generalisation. Focusing on thefirst Neural Arithmetic Unit, NALU, we explained the unit’s limitations along with existingsolutions from other units: iNALU, NAU, NMU, NPU, and CalcNet. There exists a rangeof applications for NALU, though some uses remain questionable. Cross-comparing unitssuggest inconsistencies with experiment methodology and limitations existing in the cur-rent state-of-the-art units. Finally, we outline remaining research gaps regarding: solvingdivision, robustness, compositionality and interpretability of complex architectures.
Acknowledgments
We would like to thank Andreas Madsen for informative discussions and explanations re-garding the Neural Arithmetic Units.
Unit Illustrations
Table 3 displays unit illustrations given in their respective papers, displayed chronologically.
References
Ben Bogin, Sanjay Subramanian, Matt Gardner, and Jonathan Berant. Latent composi-tional representations improve systematic generalization in grounded question answering. arXiv preprint arXiv:2007.00266 , 2020. URL https://arxiv.org/pdf/2007.00266.pdf .Kaiyu Chen, Yihan Dong, Xipeng Qiu, and Zitian Chen. Neural arithmetic expressioncalculator, 2018. URL https://arxiv.org/pdf/1809.08590.pdf .Gopinath Chennupati, Nandakishore Santhi, Phill Romero, and Stephan Eidenbenz. Ma-chine learning enabled scalable performance prediction of scientific codes. arXiv preprintarXiv:2010.04212 , 2020. URL https://arxiv.org/pdf/2010.04212.pdf .Wang-Zhou Dai and Stephen H. Muggleton. Abductive knowledge induction from raw data,2020. URL https://arxiv.org/pdf/2010.03514.pdf .Wouter Dobbels, Maarten Baes, S´ebastien Viaene, S Bianchi, JI Davies, V Casasola, CJRClark, J Fritz, M Galametz, F Galliano, et al. Predicting the global far-infrared sedof galaxies via machine learning techniques.
Astronomy & Astrophysics , 634:A57, 2020.URL https://arxiv.org/pdf/1910.06330.pdf .Lukas Faber and Roger Wattenhofer. Neural status registers. arXiv preprintarXiv:2004.07085 , 2020. URL https://arxiv.org/pdf/2004.07085.pdf .Jerry A Fodor, Zenon W Pylyshyn, et al. Connectionism and cognitive architecture: Acritical analysis.
Cognition , 28(1-2):3–71, 1988. URL https://uh.edu/~garson/F&P1.PDF . Primer for Neural Arithmetic Logic Modules
Karlis Freivalds and Renars Liepins. Improving the neural gpu architecture for algorithmlearning. arXiv preprint arXiv:1702.08727 , 2017. URL https://arxiv.org/pdf/1702.08727.pdf .Niklas Heim, Tom´aˇs Pevn`y, and V´aclav ˇSm´ıdl. Neural power units.
Advances in NeuralInformation Processing Systems , 33, 2020. URL https://papers.nips.cc/paper/2020/file/48e59000d7dfcf6c1d96ce4a603ed738-Paper.pdf .Alon Jacovi, Guy Hadash, Einat Kermany, Boaz Carmeli, Ofer Lavi, George Kour, andJonathan Berant. Neural network gradient-based learning of black-box function inter-faces. In
International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=r1e13s05YX .T. Jia, Y. Ju, R. Joseph, and J. Gu. Ncpu: An embedded neural cpu architecture onresource-constrained low power devices for real-time end-to-end performance. In , pages1097–1109, 2020. doi: 10.1109/MICRO50266.2020.00091. URL https://ieeexplore.ieee.org/document/9251958 .M. Joseph-Rivlin, A. Zvirin, and R. Kimmel. Momenˆet: Flavor the moments in learningto classify shapes. In , pages 4085–4094, 2019. URL https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9022223 .(cid:32)Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In .International Conference on Learning Representations, ICLR, 2016. URL http://arxiv.org/abs/1511.08228 .Brenden M Lake. Compositional generalization through meta sequence-to-sequence learning. In
Advances in Neural Information Processing Systems , pages9791–9801, 2019. URL https://proceedings.neurips.cc/paper/2019/file/f4d0e2e7fc057a58f7ca4a391f01940a-Paper.pdf .Guillaume Lample and Fran¸cois Charton. Deep learning for symbolic mathematics. In
In-ternational Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=S1eZYeHFDS .Zachary C. Lipton. The Mythos of Model Interpretability.
Communications of the ACM ,61(10):35–43, jun 2016. URL http://arxiv.org/abs/1606.03490 .Andreas Madsen and Alexander Rosenberg Johansen. Neural arithmetic units. In
Interna-tional Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=H1gNOeHKPS .Bastien Nollet, Mathieu Lefort, and Fr´ed´eric Armetta. Learning arithmetic operations witha multistep deep learning. In , pages 1–8. IEEE, 2020. URL https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9206963 . humika Mistry, Katayoun Farrahi and Jonathon Hare Aditya Raj, Pooja Consul, and Sakar K Pal. Fast neural accumulator (nac) based bad-minton video action classification. In
Proceedings of SAI Intelligent Systems Conference ,pages 452–467. Springer, 2020. URL https://link.springer.com/chapter/10.1007/978-3-030-55180-3_34 .Shangeth Rajaa and Jajati Keshari Sahoo. Convolutional feature extraction and neuralarithmetic logic units for stock prediction. In
International Conference on Advancesin Computing and Data Sciences , pages 349–359. Springer, 2019. URL https://link.springer.com/chapter/10.1007/978-981-13-9939-8_31 .Ashish Rana, Avleen Malhi, and Kary Fr¨amling. Exploring numerical calculations withcalcnet. In , pages 1374–1379. IEEE, 2019. URL https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8995315 .Ashish Rana, Taranveer Singh, Harpreet Singh, Neeraj Kumar, and Prashant Singh Rana.Systematically designing better instance counting models on cell images with neural arith-metic logic units, 2020. URL https://arxiv.org/pdf/2004.06674.pdf .Jan Niclas Reimann and Andreas Schwung. Neural logic rule layers. arXiv preprintarXiv:1907.00878 , 2019. URL https://arxiv.org/pdf/1907.00878.pdf .David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathe-matical reasoning abilities of neural models. In
International Conference on LearningRepresentations , 2019. URL https://openreview.net/forum?id=H1gR5iR5FX .Daniel Schl¨or, Markus Ring, and Andreas Hotho. inalu: Improved neural arithmetic logicunit.
Frontiers in Artificial Intelligence , 3:71, 2020. ISSN 2624-8212. doi: 10.3389/frai.2020.00071. URL .Carson D Sestili, William S Snavely, and Nathan M VanHoudnos. Towards security defectprediction with ai. arXiv preprint arXiv:1808.09897 , 2018. URL https://arxiv.org/pdf/1808.09897.pdf .Daniel Teitelman, I. Naeh, and Shie Mannor. Stealing black-box functionality using thedeep neural tree architecture.
ArXiv , abs/2002.09864, 2020. URL https://arxiv.org/pdf/2002.09864.pdf .Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neuralarithmetic logic units. In
Advances in Neural Information Processing Systems , pages8035–8044, 2018. URL https://openreview.net/pdf?id=H1gNOeHKPS .Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, and Shih-Fu Chang. Analogical reason-ing for visually grounded language acquisition. arXiv preprint arXiv:2007.11668 , 2020.URL https://arxiv.org/pdf/2007.11668.pdf .Zhu Xiao, Fancheng Li, Ronghui Wu, Hongbo Jiang, Yupeng Hu, Ju Ren, Chenglin Cai, andArun Iyengar. Trajdata: On vehicle trajectory collection with commodity plug-and-play Primer for Neural Arithmetic Logic Modules obu devices.
IEEE Internet of Things Journal , 2020. URL https://ieeexplore.ieee.org/document/9115028 .Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. Identitycrisis: Memorization and generalization under extreme overparameterization. In
Interna-tional Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=B1l6y0VFPr .Rui-Xiao Zhang, Tianchi Huang, M. Ma, Haitian Pang, Xin Yao, Chenglei Wu, and L. Sun.Enhancing the crowdsourced live streaming: a deep reinforcement learning approach.
Proceedings of the 29th ACM Workshop on Network and Operating Systems Support forDigital Audio and Video , 2019a. URL https://dl.acm.org/doi/10.1145/3304112.3325607 .Ruixiao Zhang, M. Ma, Tianchi Huang, Haitian Pang, X. Yao, Chenglei Wu, J. Liu, andL. Sun. Livesmart: A qos-guaranteed cost-minimum framework of viewer scheduling forcrowdsourced live streaming.
Proceedings of the 27th ACM International Conference onMultimedia , 2019b. URL https://dl.acm.org/doi/10.1145/3343031.3351013 . humika Mistry, Katayoun Farrahi and Jonathon Hare Table 3: Units architecture illustrations take from the original papers.
NALU
Trask et al. (2018)
NLR
Reimann and Schwung(2019)
G-NALU
Rajaa and Sahoo(2019) (No figure exists)
NAU
Madsen and Johansen(2020) (No figure exists)
NMU
Madsen and Johansen(2020)
NSR
Faber and Wattenhofer(2020) iNALU
Schl¨or et al. (2020)
NPU