Continual Learning in Low-rank Orthogonal Subspaces
Arslan Chaudhry, Naeemullah Khan, Puneet K. Dokania, Philip H. S. Torr
CContinual Learning in Low-rank OrthogonalSubspaces
Arslan Chaudhry , Naeemullah Khan , Puneet K. Dokania , , Philip H. S. Torr University of Oxford & Five AI Ltd., UK [email protected] Abstract
In continual learning (CL), a learner is faced with a sequence of tasks, arriving oneafter the other, and the goal is to remember all the tasks once the continual learningexperience is finished. The prior art in CL uses episodic memory, parameterregularization or extensible network structures to reduce interference among tasks,but in the end, all the approaches learn different tasks in a joint vector space. Webelieve this invariably leads to interference among different tasks. We proposeto learn tasks in different (low-rank) vector subspaces that are kept orthogonalto each other in order to minimize interference. Further, to keep the gradients ofdifferent tasks coming from these subspaces orthogonal to each other, we learnisometric mappings by posing network training as an optimization problem overthe Stiefel manifold. To the best of our understanding, we report, for the firsttime, strong results over experience-replay baseline with and without memory onstandard classification benchmarks in continual learning. In continual learning, a learner experiences a sequence of tasks with the objective to remember allor most of the observed tasks to speed up transfer of knowledge to future tasks. Learning from adiverse sequence of tasks is useful as it allows for the deployment of machine learning models thatcan quickly adapt to changes in the environment by leveraging past experiences. Contrary to thestandard supervised learning setting, where only a single task is available, and where the learner canmake several passes over the dataset of the task, the sequential arrival of multiple tasks poses uniquechallenges for continual learning. The chief one among which is catastrophic forgetting [McCloskeyand Cohen, 1989], whereby the global update of model parameters on the present task interfere withthe learned representations of past tasks. This results in the model forgetting the previously acquiredknowledge.In neural networks, to reduce the deterioration of accumulated knowledge, existing approaches modifythe network training broadly in three different ways. First, regularization-based approaches [Kirk-patrick et al., 2016, Zenke et al., 2017, Aljundi et al., 2018, Chaudhry et al., 2018, Nguyen et al.,2018] reduce the drift in network parameters that were important for solving previous tasks. Second, modular approaches [Rusu et al., 2016, Lee et al., 2017] add network components as new tasks arrive.These approaches rely on the knowledge of correct module selection at test time. Third, and perhapsthe strongest, memory-based approaches [Lopez-Paz and Ranzato, 2017, Hayes et al., 2018, Iseleand Cosgun, 2018, Riemer et al., 2019], maintain a small replay buffer, called episodic memory, andmitigate catastrophic forgetting by replaying the data in the buffer along with the new task data. Onecommon feature among all the three categories is that, in the end, all the tasks are learned in the same Code: https://github.com/arslan-chaudhry/orthog_subspace a r X i v : . [ c s . L G ] D ec tiefel Euclidean proj
Figure 1:
ORTHOG - SUBSPACE . Each blob, with the three ellipses, represents a vector space and its subspacesat a certain layer. The projection operator in the layer L keeps the subspaces orthogonal (no overlap). Theoverlap in the intermediate layers is minimized when the weight matrices are learned on the Stiefel manifold. vector space where a vector space is associated with the output of a hidden layer of the network. Webelieve this restriction invariably leads to forgetting of past tasks.In this work, we propose to learn different tasks in different vector subspaces. We require thesesubspaces to be orthogonal to each other in order to prevent the learning of a task from interferingcatastrophically with previous tasks. More specifically, for a point in the vector space in R m , typicallythe second last layer of the network, we project each task to a low-dimensional subspace by atask-specific projection matrix P ∈ R m × m , whose rank is r , where r (cid:28) m . The projection matricesare generated offline such that for different tasks they are mutually orthogonal. This simple projectionin the second last layer reduces the forgetting considerably in the shallower networks – the averageaccuracy increases by up to % and forgetting drops by up to % compared to the strongestexperience replay baseline [Chaudhry et al., 2019b] in a three-layer network. However, in deepernetworks, the backpropagation of gradients from the different projections of the second last layer donot remain orthogonal to each other in the earlier layers resulting in interference in those layers. Toreduce the interference, we use the fact that a gradient on an earlier layer is a transformed version ofthe gradient received at the projected layer – where the transformation is linear and consists of theproduct of the weight matrix and the diagonal Jacobian matrix of the non-linearity of the layers inbetween. Reducing interference then requires this transformation to be an inner-product preservingtransformation, such that, if two vectors are orthogonal at the projected layer, they remain close toorthogonal after the transformation. This is equivalent to learning orthonormal weight matrices –a well-studied problem of learning on Stiefel manifolds [Absil et al., 2009, Bonnabel, 2013]. Ourapproach, dubbed ORTHOG - SUBSPACE , generates two projected orthogonal vectors (gradients) – onefor the current task and another for one of the previous tasks whose data is stored in a tiny replay buffer– and updates the network weights such that the weights remain on a Stiefel manifold. We visuallydescribe our approach in Fig. 1. For the same amount of episodic memory,
ORTHOG - SUBSPACE ,improves upon the strong experience replay baseline by % in average accuracy and % in forgettingon deeper networks. In this section, we describe the continual learning setup followed by necessary preliminaries for ourapproach.
We assume a continual learner experiencing a stream of data triplets ( x i , y i , t i ) containing an input x i ,a target y i , and a task identifier t i ∈ T = { , . . . , T } . Each input-target pair ( x i , y i ) ∈ X × Y t i is anidentical and independently distributed example drawn from some unknown distribution P t i ( X, Y ) ,representing the t i -th learning task. We assume that the tasks are experienced in order t i ≤ t j forall i ≤ j , and the learner cannot store any but a few samples from P t i in a tiny replay buffer M i .Under this setup, our goal is to estimate a predictor f = ( w ◦ Φ) :
X × T → Y , composed of a2eature extractor Φ Θ : X → H , which is an L -layer feed-forward neural network parameterized by Θ = { W l } Ll =1 , and a classifier w θ : H → Y , that minimizes the multi-task error T T (cid:88) t =1 E ( x,y ) ∼ P t [ (cid:96) ( f ( x, t ) , y ) ] , (1)where H ∈ R m is an inner product space, Y = ∪ t ∈T Y t , and (cid:96) : Y × Y → R ≥ is a loss function.To further comply with the strict sequential setting, similar to prior work [Lopez-Paz and Ranzato,2017, Riemer et al., 2019], we consider streams of data that are experienced only once . We only focuson classification tasks where either the input or output distribution changes over time. We assumethat a task descriptor, identifying the correct classification head, is given at both train and test times. Metrics
Once the continual learning experience is finished, we measure two statistics to evaluate the qualityof the algorithm: average accuracy , and average maximum forgetting . First, the average accuracy isdefined as Accuracy = 1 T T (cid:88) j =1 a T,j , (2)where a i,j denotes the test accuracy on task j after the model has finished experiencing task i . Second,the average maximum forgetting is defined asForgetting = 1 T − T − (cid:88) j =1 max l ∈{ ,...,T − } ( a l,j − a T,j ) , (3)that is, the decrease in performance for each of the tasks between their peak accuracy and theiraccuracy after the continual learning experience is finished. Let the inner product in H be denoted by (cid:104)· , ·(cid:105) , and v be an element of H . A matrix O ∈ R m × r ,where r (cid:28) m , parameterizes an m × m dimensional orthogonal projection matrix P , given by P = O ( O (cid:62) O ) − O (cid:62) , where rank ( P ) = r . A vector u = P v , will be the projection of v in asubspace U ⊂ H with dim ( U ) = r . Furthermore, if the columns of O are assumed to be orthonormal,then the projection matrix is simplified to P = OO (cid:62) . Definition 2.1 (Orthogonal Subspace) . Subspaces U and W of a vector space H are orthogonal if (cid:104) u, w (cid:105) = 0 , ∀ u ∈ U , w ∈ W . Definition 2.2 (Isometry) . A linear transformation T : V → V is called an isometry if it is distancepreserving i.e . (cid:107) T ( v ) − T ( w ) (cid:107) = (cid:107) v − w (cid:107) , ∀ v, w ∈ V . A linear transformation that preserves distance must preserve angles and vice versa. We record this inthe following theorem.
Theorem 2.1. T is an isometry iff it preserves inner products. The proof is fairly standard and given in Appendix Appendix A.
Corollary 2.1.1. If T and T are two isometries then their composition T ◦ T is also an isometry. An orthogonal matrix preserves inner products and therefore acts as an isometry of Euclidean space.Enforcing orthogonality during network training corresponds to solving the following constrained Note, an orthogonal matrix is always square. However, the matrices we consider can be nonsquare. In thiswork, the orthogonal matrix is used in the sense of W (cid:62) W = I . tiefel ManifoldTangent plane at W t W t W t+1 Gradient of loss function at WtProjection of the gradient onto the tangent planeRetraction to the manifold
Figure 2: Gradient computed at a given point ( W t ) in the manifold is first projected to the tangentplane. There exists a closed form for this step. This projected gradient is then retracted to a point inthe manifold giving the final update W t +1 .optimization problem: min θ, Θ= { W l ,b l } Ll =1 (cid:96) ( f ( x, t ) , y ) , s.t. W (cid:62) l W l = I , ∀ l ∈ { , · · · , L } , (4)where I is an identity matrix of appropriate dimensions. The solution set of the above problem isa valid Riemannian manifold when the inner product is defined. It is called the Stiefel manifold,defined as ¯ M l = { W l ∈ R n l × n l − | W (cid:62) l W l = I } , where n l is the number of neurons in layer l , andit is assumed that n l ≥ n l − . For most of the neural network architectures, this assumption holds.For a convolutional layer W l ∈ R c out × c in × h × w , we reshape it to W l ∈ R c out × ( c in · h · w ) .The optimization of a differentiable cost function over a Stiefel manifold has been extensively studiedin literature [Absil et al., 2009, Bonnabel, 2013]. Here, we briefly summarize the two main steps ofthe optimization process and refer the reader to Absil et al. [2009] for further details. For a givenpoint W l on the Stiefel manifold, let T W l represent the tangent space at that point. Further, let g l , amatrix, be the gradient of the loss function with respect to W l . The first step of optimization projects g l to T W l using a closed form P roj T Wl ( g l ) = AW l , where ‘ A ’ is a skew-symmetric matrix given by(see Appendix B for the derivation): A = g l W (cid:62) l − W l g (cid:62) l . (5)Once the gradient projection in the tangent space is found, the second step is to generate a descentcurve of the loss function in the manifold. The Cayley transform defines one such curve using aparameter τ ≥ , specifying the length of the curve, and a skew-symmetric matrix U [Nishimori andAkaho, 2005]: Y ( τ ) = (cid:16) I + τ U (cid:17) − (cid:16) I − τ U (cid:17) W l , (6)It can be seen that the curve stays on the Stiefel manifold i.e . Y ( τ ) (cid:62) Y ( τ ) = I and Y (0) = W l ,and that its tangent vector at τ = 0 is Y (cid:48) (0) = − U W l . By setting U = A = g l W (cid:62) l − W l g (cid:62) l , thecurve will be a descent curve for the loss function. Li et al. [2020] showed that one can bypass theexpensive matrix inversion in (6) by following the fixed-point iteration of the Cayley transform, Y ( τ ) = W l − τ A ( W l + Y ( τ )) . (7)Li et al. [2020] further showed that under some mild continuity assumptions (7) converges to theclosed form (6) faster than other approximation algorithms. The overall optimization on Stiefelmanifold is shown in 2. We now describe our continual learning approach. Consider a feature extractor in the form of afeed-forward neural network consisting of L hidden layers, that takes an input x ∈ R d and passes itthrough the following recursions: h l = σ ( W l h l − + b l ) , where σ ( · ) is a non-linearity, h = x , and4 L = φ ∈ R m . The network is followed by an application-specific head ( e.g .) a classifier in caseof a classification task. The network can be thought of as a mapping, Φ :
X → H , from one vectorspace (
X ∈ R d ) to another ( H ∈ R m ). When the network is trained for more than one tasks, a sharedvector space H can be learned if the model has a simultaneous access to all the tasks. In continuallearning, on the other hand, when tasks arrive in a sequence, learning a new task can interfere in thespace where a previous task was learned. This can result in the catastrophic forgetting of the previoustask if the new task is different from the previous one(s). In this work, we propose to learn tasks in orthogonal subspaces such that learning of a new task has minimal interference with already learnedtasks.We assume that the network is sufficiently parameterized, which often is the case with deep networks,so that all the tasks can be learned in independent subspaces. We define a family of sets V thatpartitions H , such that, a ) V does not contain the empty set ( ∅ / ∈ V ), b ) the union of sets in V is equal to H ( ∪ V t ∈V V t = H ), and c ) the intersection of any two distinct sets in V is empty( ( ∀V i , V j ∈ V ) i (cid:54) = j = ⇒ V i ∩ V j = ∅ ) ). A set V t ∈ V defines a subspace for task t . We obtainsuch a subspace by projecting the feature map φ = h L ∈ R m into an r -dimensional space, where r (cid:28) m , via a projection matrix P t ∈ R m × m of rank r , i.e . we obtain the features for task t by φ t = P t h L , while ensuring: P (cid:62) t P t = I ,P (cid:62) t P k = , ∀ k (cid:54) = t. (8)The said projection matrix can be easily constructed by first generating a set of m random orthonormalbasis in R m , then picking r = (cid:98) m/T (cid:99) of those basis (matrix columns) to form a matrix O t , and,finally, obtaining the projections as P t = O t O (cid:62) t . For different tasks these basis form a disjoint set P = { P , · · · , P T } . If the total number of tasks exceeds T , then one can potentially dynamicallyresize the m × m orthogonal matrix while maintaining the required properties. For example, to makespace for T tasks one can resize the original matrix to m × m with zero padding, and backup theprevious matrix. This would entail dynamically resizing the second last layer of the network. The set P can be computed offline and stored in a hash table that can be readily fetched given a task identifier.The projection only adds a single matrix multiplication in the forward pass of the network making itas efficient as standard training.Next, lets examine the effect of the projection step on the backward pass of the backpropagationalgorithm. For a task t , the gradient of the objective (cid:96) ( · , · ) on any intermediate layer h l can bedecomposed using the chain rule as, g tl = ∂(cid:96)∂h l = (cid:18) ∂(cid:96)∂h L (cid:19) ∂h L ∂h l = (cid:18) ∂φ t ∂h L ∂(cid:96)∂φ t (cid:19) ∂h L ∂h l , = (cid:18) P t ∂(cid:96)∂φ t (cid:19) L − (cid:89) k = l ∂h k +1 ∂h k = g tL L − (cid:89) k = l D k +1 W k +1 , (9)where D k +1 is a diagonal matrix representing the Jacobian of the pointwise nonlinearity σ k +1 ( · ) .For a ReLU nonlinearity and assuming that the non-linearity remains in the linear region during thetraining [Serra et al., 2017, Arora et al., 2019], we assume the Jacobian matrix to be an identity. Itcan be seen that for the projected layer L (the second last layer), the gradients of different tasks areorthogonal by construction i.e . g tL ⊥ g k (cid:54) = tL ( c.f . (8)). Hence the gradient interference will be zero atthe layer L . However, according to (9), as the gradients are backpropogated to the previous layersthey start to become less and less orthogonal ( c.f . Fig. 4). This results in interference among differenttasks in earlier layers, especially when the network is relatively deep.Let us rewrite the gradients at the intermediate layer l during the training of task t as a lineartransformation of the gradient at the layer L i.e . g tl = T ( g tL ) . According to (9), and assuming theJacobian matrix of the non-linearity to be identity ( D k = I ), this transformation is given by T ( u ) = u L − (cid:89) k = l W k +1 . (10) We generate a random matrix and apply the Gram–Schmidt process offline, before the continual learningexperience begins. lgorithm 1 Training of
ORTHOG - SUBSPACE on sequential data D = {D , · · · , D T } , with Θ = { W l } Ll =1 initialized as orthonormalized matrices, P = { P , · · · , P T } orthogonal projections, learning rate ‘ α ’, s = 2 , q = 0 . , (cid:15) = 10 − . procedure ORTHOG - SUBSPACE ( D , P , α, s, q, (cid:15) )2: M ← {} ∗ T for t ∈ { , · · · , T } do for ( x t , y t ) ∼ D t do k ∼ { , · · · , t − } (cid:46) Sample a past task from the replay buffer6: ( x k , y k ) ∼ M k (cid:46) Sample data from the episodic memory7: g t ← ∇ Θ ,θ ( (cid:96) ( f ( x t , y t ) , P t )) (cid:46) Compute gradient on the current task8: g k ← ∇ Θ ,θ ( (cid:96) ( f ( x k , y k ) , P k )) (cid:46) Compute gradient on the past task9: g ← g t + g k for l = { , · · · , L } do (cid:46) Layer-wise update on Stiefel manifold11: A ← g l W (cid:62) l − W l g (cid:62) l U ← AW l (cid:46) Project the gradient onto the tangent space13: τ ← min( α, q/ ( || W l || + (cid:15) )) (cid:46) Select adaptive learning rate Li et al. [2020]14: Y ← W l − τ U (cid:46) Iterative estimation of the Cayley Transform15: for i = { , · · · , s } do Y i ← W l − τ A ( W l + Y i − ) end for W l ← Y s end for θ ← θ − α · g L +1 (cid:46) Update the classifier head21: M t ← ( x t , y t ) (cid:46) Add the sample to a ring buffer22: end for end for return Θ , θ end procedure As noted earlier, g tL ⊥ g k (cid:54) = tL by construction, then to reduce the interference between any g tl and g k (cid:54) = tl , the transformation T ( · ) in (10) needs to be such that it preserves the inner-product between T ( g tL ) and T ( g k (cid:54) = tL ) . In other words, T ( · ) needs to be an isometry 2.2. As discussed in Sec. 2.2, thisis equivalent to ensuring that weight matrices { W l } Ll =1 are orthonormal matrices.We learn orthonormal weights of a neural network by posing the network training as an optimizationproblem over a Stiefel manifold [Absil et al., 2009, Bonnabel, 2013]. More specifically, the networkis initialized from random orthonormal matrices. A tiny replay buffer, storing the examples of pasttasks ( k < t ), is maintained to compute the gradients { g kl } Ll =1 . The gradients on the current task t , { g tl } Ll =1 , are computed and weights in each layer l are updated as follows: a ) first the effectivegradient g l = g tl + g kl is projected onto the tangent space at the current estimate of the weight matrix W l , b ) the iterative Cayley Transform (7) is used to retract the update to the Stiefel manifold. Theprojection onto the tangent space is carried out using the closed-form described in (5). The resultingalgorithm keeps the network weights orthonormal throughout the continual learning experience whilereducing the loss using the projected gradients. Fig. 3 shows this orthonormality reduces the innerproduct between the gradients of different tasks. We denote our approach as ORTHOG - SUBSPACE and provide the pseudo-code in Alg. 1.
We now report experiments on continual learning benchmarks in classification tasks.
We evaluate average accuracy (2) and forgetting (3) on four supervised classification benchmarks.
Permuted MNIST is a variant of the MNIST dataset of handwritten digits [LeCun, 1998] whereeach task applies a fixed random pixel permutation to the original dataset.
Rotated MNIST isanother variant of MNIST, where each task applies a fixed random image rotation (between and degrees) to the original dataset. Both of the MNIST benchmark contain tasks, each with samples from different classes. Split CIFAR is a variant of the CIFAR-100 dataset [Krizhevsky6nd Hinton, 2009, Zenke et al., 2017], where each task contains the data pertaining random classes(without replacement) out of the total classes. Split miniImageNet is a variant of the ImageNetdataset [Russakovsky et al., 2015, Vinyals et al., 2016], containing a subset of images and classesfrom the original dataset. Similar to Split CIFAR, in Split miniImageNet each task contains the datafrom random classes (without replacement) out of the total classes. Both CIFAR-100 andminiImageNet contain tasks, each with samples from each of the classes.Similar to Chaudhry et al. [2019a], for each benchmark, the first tasks are used for hyper-parametertuning (grids are available in Appendix D). The learner can perform multiple passes over the datasetsof these three initial tasks. We assume that the continual learning experience begins after these initialtasks and ignore them in the final evaluation. We compare
ORTHOG - SUBSPACE against several baselines which we describe next.
Finetune isa vanilla model trained on a data stream, without any regularization or episodic memory.
ICARL [Rebuffi et al., 2017] is a memory-based method that uses knowledge-distillation [Hinton et al., 2014]and episodic memory to reduce forgetting.
EWC [Kirkpatrick et al., 2016] is a regularization-based method that uses the Fisher Information matrix to record the parameter importance.
VCL [Nguyenet al., 2018] is another regularization-based method that uses variational inference to approximate theposterior distribution of the parameters which is regularized during the continual learning experience.
AGEM [Chaudhry et al., 2019a] is a memory-based method similar to [Lopez-Paz and Ranzato,2017] that uses episodic memory as an optimization constraint to reduce forgetting on previous tasks.
MER [Riemer et al., 2019] is another memory-based method that uses first-order meta-learningformulation [Nichol and Schulman, 2018] to reduce forgetting on previous tasks.
ER-Ring [Chaudhryet al., 2019b] is the strongest memory-based method that jointly trains new task data with that of theprevious tasks. Finally,
Multitask is an oracle baseline that has access to all data to optimize (1). Itis useful to estimate an upper bound on the obtainable Accuracy (2).
Except for VCL, all baselines use the same unified code base which is made publicly available. ForVCL [Nguyen et al., 2018], the official implementation is used which only works on fully-connectednetworks. All baselines use the same neural network architectures: a fully-connected network withtwo hidden layers of 256 ReLU neurons in the MNIST experiments, and a standard ResNet18 [Heet al., 2016] in CIFAR and ImageNet experiments. All baselines do a single-pass over the datasetof a task, except for episodic memory that can be replayed multiple times. The task identifiersare used to select the correct output head in the CIFAR and ImageNet experiments. Batch size isset to across experiments and models. A tiny ring memory of example per class per task isstored for the memory-based methods. For ORTHOG - SUBSPACE , episodic memory is not used forMNIST experiments, and the same amount of memory as other baselines is used for CIFAR100 andminiImageNet experiments. All experiments run for five different random seeds, each correspondingto a different dataset ordering among tasks, that are fixed across baselines. Averages and standarddeviations are reported across these runs.
Tab. 1 shows the overall results on all benchmarks. First, we observe that on relatively shallowernetworks (MNIST benchmarks), even without memory and preservation of orthogonality during thenetwork training,
ORTHOG - SUBSPACE outperform the strong memory-based baselines by a largemargin: +7 . % and +9 . % absolute gain in average accuracy, % and % reduction in forgettingcompared to the strongest baseline (ER-Ring), on Permuted and Rotated MNIST, respectively. Thisshows that learning in orthogonal subspaces is an effective strategy in reducing interference amongdifferent tasks. Second, for deeper networks, when memory is used and orthogonality is preserved, ORTHOG - SUBSPACE improves upon ER-Ring considerably: . % and . % absolute gain in averageaccuracy, % and . % reduction in forgetting, on CIFAR100 and miniImageNet, respectively.While we focus on tiny episodic memory, in Tab. 3 of Appendix C, we provide results for largerepisodic memory sizes. Our conclusions on tiny memory hold, however, the gap between theperformance of ER-Ring and ORTHOG - SUBSPACE gets reduced as the episodic memory size isincreased. The network can sufficiently mitigate forgetting by relearning on a large episodic memory.7able 1:
Accuracy (2) and Forgetting (3) results of continual learning experiments. When used, episodicmemories contain up to one example per class per task. Last row is a multi-task oracle baseline. M ETHOD P ERMUTED
MNIST R
OTATED
MNIST M EMORY A CCURACY F ORGETTING A CCURACY F ORGETTING F INETUNE (cid:55)
IRKPATRICK ET AL ., 2016] (cid:55)
GUYEN ET AL ., 2018] (cid:55)
ANDOM [N GUYEN ET AL ., 2018] (cid:51)
HAUDHRY ET AL ., 2019 A ] (cid:51) IEMER ET AL ., 2019] (cid:51)
ING [C HAUDHRY ET AL ., 2019 B ] (cid:51) ORTHOG - SUBSPACE ( OURS ) (cid:55) (±0.91) (±0.01) (±0.95) (±0.01)M ULTITASK M ETHOD S PLIT
CIFAR S
PLIT MINI I MAGE N ET M EMORY A CCURACY F ORGETTING A CCURACY F ORGETTING F INETUNE (cid:55)
IRKPATRICK ET AL ., 2016] (cid:55)
EBUFFI ET AL ., 2017] (cid:51)
HAUDHRY ET AL ., 2019 A ] (cid:51) IEMER ET AL ., 2019] (cid:51)
ING [C HAUDHRY ET AL ., 2019 B ] (cid:51) ORTHOG - SUBSPACE ( OURS ) (cid:51) (±0.59) (±0.01) (±1.44) (±0.01)M ULTITASK
Table 2:
Systematic evaluation of Projection, Memory and Orthogonalization in
ORTHOG - SUBSPACE . M ETHOD S PLIT
CIFAR S
PLIT MINI I MAGE N ET P ROJECTION
ER S
TIEFEL A CCURACY F ORGETTING A CCURACY F ORGETTING (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)
Tab. 2 shows a systematic evaluation of projection (8), episodic memory and orthogonalization (7) in
ORTHOG - SUBSPACE . First, without memory and orthogonalization, while a simple projection yieldscompetitive results compared to various baselines ( c.f . Tab. 1), the performance is still a far cry fromER-Ring. However, when the memory is used along with subspace projection one can already see abetter performance compared to ER-Ring. Lastly, when the orthogonality is ensured by learning on aStiefel manifold, the model achieves the best performance both in terms of accuracy and forgetting.Finally, Fig. 3 shows the distribution of the inner product of gradients between the current andprevious tasks, stored in the episodic memory. Everything is kept the same except in one case theweight matrices are learned on the Stiefel manifold while in the other no such constraint is placed onthe weights. We observe that when the weights remain on the Stiefel manifold, the distribution is morepeaky around zero. This empirically validates our hypothesis that by keeping the transformation (10)isometric, the gradients of different tasks remain near orthogonal to one another in all the layers.
In continual learning [Ring, 1997], also called lifelong learning [Thrun, 1998], a learner faces a sequence of tasks without storing the complete datasets of these tasks. This is in contrast to multitasklearning [Caruana, 1997], where the learner can simultaneously access data from all tasks. Theobjective in continual learning is to avoid catastrophic forgetting The main challenge in continuallearning is to avoid catastrophic forgetting [McCloskey and Cohen, 1989, McClelland et al., 1995,Goodfellow et al., 2013] on already seen tasks so that the learner is able to learn new tasks quickly.Existing literature in continual learning can be broadly categorized into three categories.8 .00 0.05 0.10 0.15 0.20Inner product b\w Task and Memory gradients0100200300 C o un t No OrthogonalityStiefel (Orthogonality)
Layer C o un t No OrthogonalityStiefel (Orthogonality)
Layer Figure 3:
Histogram of inner product of current task and memory gradients in different layers in Split CIFAR.The more left the distribution is the more orthogonal the gradients are and the less the interference is betweenthe current and previous tasks.
First, regularization approaches reduce the drift in parameters important for past tasks [Kirkpatricket al., 2016, Aljundi et al., 2018, Nguyen et al., 2018, Zenke et al., 2017]. For the large number oftasks, the parameter importance measures suffer from brittleness as the locality assumption embeddedin the regularization-based approaches is violated [Titsias et al., 2019]. Furthermore, Chaudhryet al. [2019a] showed that these approaches can only be effective when the learner can performmultiple passes over the datasets of each task – a scenario not assumed in this work. Second, modularapproaches use different network modules that can be extended for each new task [Fernando et al.,2017, Aljundi et al., 2017, Rosenbaum et al., 2018, Chang et al., 2018, Xu and Zhu, 2018, Ferran Alet,2018]. By construction, modular approaches have zero forgetting, but their memory requirementsincrease with the number of tasks [Rusu et al., 2016, Lee et al., 2017]. Third, memory approaches maintain and replay a small episodic memory of data from past tasks. In some of these methods [Liand Hoiem, 2016, Rebuffi et al., 2017], examples in the episodic memory are replayed and predictionsare kept invariant by means of distillation [Hinton et al., 2014]. In other approaches [Lopez-Pazand Ranzato, 2017, Chaudhry et al., 2019a, Aljundi et al., 2019] the episodic memory is used as anoptimization constraint that discourages increases in loss at past tasks. Some works [Hayes et al.,2018, Riemer et al., 2019, Rolnick et al., 2018, Chaudhry et al., 2019b, 2020] have shown thatdirectly optimizing the loss on the episodic memory, also known as experience replay, is cheaper thanconstraint-based approaches and improves prediction performance. Recently, Prabhu et al. [2020]showed that training at test time, using a greedily balanced collection of episodic memory, improvedperformance on a variety of benchmarks. Similarly, Javed and White [2019], Beaulieu et al. [2020]showed that learning transferable representations via meta-learning reduces forgetting when themodel is trained on sequential tasks.Similar in spirit to our work is OGD Farajtabar et al. [2019] where the gradients of each taskare learned in the orthogonal space of all the previous tasks’ gradients. However, OGD differssignificantly from our work in terms of memory and compute requirements. Unlike OGD, where thememory of previous task gradients is maintained, which is equivalent to storing n t × S dimensionalmatrix for each task, where n t are the number of examples in each task and S is the network size, weonly store m × r dimensional matrix O t , where m is the dimension of the feature vector ( m (cid:28) S )and r is the rank of the subspace, and a tiny replay buffer for each task. For large network sizes, OGDis impractical to use. Furthermore, at each training step OGD subtracts the gradient projections fromthe space spanned by the gradients in memory, whereas we only project the feature extraction layerto a subspace and maintain orthogonality via learning on Stiefel manifolds.Finally, learning orthonormal weight matrices has been extensively studied in literature. Orthogonalmatrices are used in RNNs for avoiding exploding/ vanishing gradient problem [Arjovsky et al.,2016, Wisdom et al., 2016, Jing et al., 2017]. While the weight matrices are assumed to be square inthe earlier works, works including [Huang et al., 2018] considered learning non-square orthogonalmatrices (called orthonormal matrices in this work) by optimizing over the Stiefel manifolds. Morerecently, Li et al. [2020] proposed an iterative version of Cayley transform [Nishimori and Akaho,2005], a key component in optimizing over Stiefel manifolds. Whereas optimizing over the Stiefelmanifold ensure strict orthogonality in the weights, [Jia et al., 2019] proposed an algorithm, SingularValue Bounding (SVB), for soft orthogonality constraints. We use strict orthogonality in this workand leave the exploration of soft orthogonality for future research.9 Conclusion
We presented
ORTHOG - SUBSPACE , a continual learning method that learns different tasks in orthogo-nal subspaces. The gradients in the projected layer are kept orthogonal in earlier layers by learningisometric transformations. The isometric transformations are learned by posing the network trainingas an optimization problem over the Stiefel manifold. The proposed approach improved considerablyover strong memory replay-based baselines in standard continual learning benchmarks of imageclassification.
Continual learning methods like the one we propose allow machine learning models to efficientlylearn on new data without requiring constant retraining on previous data. This type of learning can beuseful when the model is expected to perform in multiple environments and a simultaneous retrainingon all the environments is not feasible. However, one of the core assumptions in continual learning isthat model should have zero forgetting on previous data. In some scenarios, partially forgetting olddata may be acceptable or even preferable, for example if older data was more biased (in any sense)than more recent data. A machine learning practitioner should be aware of this fact and use continuallearning approaches only when suitable.
Acknowledgment
This work was supported by EPSRC/MURI grant EP/N019474/1, Facebook (DeepFakes grant), FiveAI UK, and the Royal Academy of Engineering under the Research Chair and Senior ResearchFellowships scheme. AC is funded by the Amazon Research Award (ARA) program.
References
P.-A. Absil, R. Mahony, and R. Sepulchre.
Optimization algorithms on matrix manifolds . PrincetonUniversity Press, 2009.R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network ofexperts. In
CVPR , pages 7120–7129, 2017.R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses:Learning what (not) to forget. In
ECCV , 2018.R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Online continual learning with no task boundaries. arXiv preprint arXiv:1903.08671 , 2019.M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In
InternationalConference on Machine Learning , pages 1120–1128, 2016.S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and gen-eralization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 ,2019.S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney. Learning tocontinually learn. arXiv preprint arXiv:2002.09571 , 2020.S. Bonnabel. Stochastic gradient descent on riemannian manifolds.
IEEE Transactions on AutomaticControl , 58(9):2217–2229, 2013.R. Caruana. Multitask learning.
Machine learning , 28(1):41–75, 1997.M. Chang, A. Gupta, S. Levine, and T. L. Griffiths. Automatically composing representationtransformations as a means for generalization. In
ICML workshop Neural Abstract Machines andProgram Induction v2 , 2018.A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning:Understanding forgetting and intransigence. In
ECCV , 2018.10. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem.In
ICLR , 2019a.A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato.Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486 , 2019b.A. Chaudhry, A. Gordo, P. K. Dokania, P. Torr, and D. Lopez-Paz. Using hindsight to anchor pastknowledge in continual learning. arXiv preprint arXiv:2002.08165 , 2020.M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. arXiv preprint arXiv:1910.07104 , 2019.C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wier-stra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprintarXiv:1701.08734 , 2017.L. P. K. Ferran Alet, Tomas Lozano-Perez. Modular meta-learning. arXiv preprintarXiv:1806.10166v1 , 2018.I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation ofcatastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 , 2013.T. L. Hayes, N. D. Cahill, and C. Kanan. Memory efficient experience replay for streaming learning. arXiv preprint arXiv:1809.05922 , 2018.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR , pages770–778, 2016.G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In
NIPS , 2014.L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li. Orthogonal weight normalization:Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.D. Isele and A. Cosgun. Selective experience replay for lifelong learning. arXiv preprintarXiv:1802.10269 , 2018.K. Javed and M. White. Meta-learning representations for continual learning. In
Advances in NeuralInformation Processing Systems , pages 1820–1830, 2019.K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao. Orthogonal deep neural networks.
IEEE transactions onpattern analysis and machine intelligence , 2019.L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Soljaˇci´c. Tunableefficient unitary neural networks (eunn) and their application to rnns. In
Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , pages 1733–1741. JMLR. org, 2017.J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell.Overcoming catastrophic forgetting in neural networks.
PNAS , 2016.A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. , 2009.Y. LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ , 1998.J. Lee, J. Yun, S. Hwang, and E. Yang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547 , 2017.J. Li, F. Li, and S. Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayleytransform. In
International Conference on Learning Representations , 2020.Z. Li and D. Hoiem. Learning without forgetting. In
ECCV , pages 614–629, 2016.D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continuum learning. In
NIPS , 2017.11. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learningsystems in the hippocampus and neocortex: insights from the successes and failures of connectionistmodels of learning and memory.
Psychological review , 102(3):419, 1995.M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequentiallearning problem.
Psychology of learning and motivation , 24:109–165, 1989.C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning.
ICLR , 2018.A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprintarXiv:1803.02999, 2018 , 2018.Y. Nishimori and S. Akaho. Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold.
Neurocomputing , 67:106–135, 2005.A. Prabhu, P. H. S. Torr, and P. K. Dokania. GDumb: A simple approach that questions our progressin continual learning. In
ECCV , 2020.S.-V. Rebuffi, A. Kolesnikov, and C. H. Lampert. iCaRL: Incremental classifier and representationlearning. In
CVPR , 2017.M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn withoutforgetting by maximizing transfer and minimizing interference. In
ICLR , 2019.M. B. Ring. Child: A first step towards continual learning.
Machine Learning , 28(1):77–104, 1997.D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne. Experience replay for continuallearning.
CoRR , abs/1811.11682, 2018. URL http://arxiv.org/abs/1811.11682 .C. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linearfunctions for multi-task learning. In
ICLR , 2018.O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
IJCV , 115(3):211–252, 2015.A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu,and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671 , 2016.T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of deepneural networks. arXiv preprint arXiv:1711.02114 , 2017.H. D. Tagare. Notes on optimization on stiefel manifolds. In
Technical report, Technical report . YaleUniversity, 2011.S. Thrun. Lifelong learning algorithms. In
Learning to learn , pages 181–209. Springer, 1998.M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisationfor continual learning using gaussian processes. arXiv preprint arXiv:1901.11356 , 2019.O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In
NIPS , pages 3630–3638, 2016.S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neuralnetworks. In
Advances in neural information processing systems , pages 4880–4888, 2016.J. Xu and Z. Zhu. Reinforced continual learning. In arXiv preprint arXiv:1805.12369v1 , 2018.F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In
ICML , 2017.12 ppendix
Section A provides a proof that isometry preserves angles. Section B derives the closed-form of thegradient projection on the tangent space at a point in the Stiefel manifold. Section C gives furtherexperimental results. Section D lists the grid considered for hyper-parameters.
A Isometry Preserves Angles
Theorem A.1. T is an isometry iff it preserves inner products.Proof. Suppose T is an isometry. Then for any v, w ∈ V , (cid:107) T ( v ) − T ( w ) (cid:107) = (cid:107) v − w (cid:107) (cid:104) T ( v ) − T ( w ) , T ( v ) − T ( w ) (cid:105) = (cid:104) v − w, v − w (cid:105)(cid:107) T ( v ) (cid:107) + (cid:107) T ( w ) (cid:107) − (cid:104) T ( v ) , T ( w ) (cid:105) = (cid:107) v (cid:107) + (cid:107) w (cid:107) − (cid:104) v, w (cid:105) . Since (cid:107) T ( u ) (cid:107) = (cid:107) u (cid:107) for any u in V , all the length squared terms in the last expression above cancelout and we get (cid:104) T ( v ) , T ( w ) (cid:105) = (cid:104) v, w (cid:105) . Conversely, if T preserves inner products, then (cid:104) T ( v − w ) , T ( v − w ) (cid:105) = (cid:104) v − w, v − w (cid:105) , which implies (cid:107) T ( v − w ) (cid:107) = (cid:107) v − w (cid:107) , and since T is linear, (cid:107) T ( v ) − T ( w ) (cid:107) = (cid:107) v − w (cid:107) . This shows that T preserves distance. B Closed-form of Projection in Tangent Space
This section closely follows the arguments of Tagare [2011].Let { X ∈ R n × p | X (cid:62) X = I } defines a manifold in Euclidean space R n × p , where n > p . Thismanifold is called the Stiefel manifold. Let T X denotes a tangent space at X. Lemma B.1.
Any Z ∈ T X satisfies: Z (cid:62) X + X (cid:62) Z = 0 i.e . Z (cid:62) X is a skew-symmetric p × p matrix. Note, that X consists of p orthonormal vectors in R n . Let X ⊥ be a matrix consisting of the additional n − p orthonormal vectors in R n i.e . X ⊥ lies in the orthogonal compliment of X , X (cid:62) X ⊥ = 0 . Theconcatenation of X and X ⊥ , [ XX ⊥ ] is n × n orthonormal matrix. Then, any matrix U ∈ R n × p canbe represented as: U = XA + X ⊥ B , where A is a p × p matrix, and B is a ( n − p ) × p matrix. Lemma B.2.
A matrix Z = XA + X ⊥ B belongs to the tangent space at a point on Stiefel manifold T X iff A is skew-symmetric. Let G ∈ R n × p be the gradient computed at X . Let the projection of the gradient on the tangent spaceis denoted by π T X ( G ) . Lemma B.3.
Under the canonical inner product, the projection of the gradient on the tangent spaceis given by π T X ( G ) = AX , where A = GX (cid:62) − XG (cid:62) . roof. Express G = XG A + X ⊥ G B . Let Z be any vector in the tangent space, expressed as Z = XZ A + X ⊥ Z B , where Z A is a skew-symmetric matrix according to B.2. Therefore, π T X ( G ) = tr ( G (cid:62) Z ) , = tr (( XG A + X ⊥ G B ) (cid:62) ( XZ A + X ⊥ Z B )) , = tr ( G (cid:62) A Z A + G (cid:62) B Z B ) . (11)Writing G A as G A = sym ( G A ) + skew ( G A ) , and plugging in (11) gives, π T X ( G ) = tr ( skew ( G A ) (cid:62) Z A + G (cid:62) B Z B ) . (12)Let U = XA + X ⊥ B is the vector that represents the projection of G on the tangent space at X .Then, (cid:104) U, Z (cid:105) c = tr ( U (cid:62) ( I − XX (cid:62) ) Z ) , = tr (( XA + X ⊥ B ) (cid:62) ( I − XX (cid:62) )( XZ A + X ⊥ Z B )) , = tr ( 12 A (cid:62) Z A + B (cid:62) Z B ) (13)By comparing (12) and (13), we get A = 2 skew ( G A ) and B = G B . Thus, U = 2 X skew ( G A ) + X ⊥ G B , = X ( G A − G (cid:62) A ) + X ⊥ G B , ∵ skew ( G A ) = 12 ( G A − G (cid:62) A )= XG A − XG (cid:62) A + G − XG A , ∵ G = XG A + X ⊥ G B = G − XG (cid:62) A , = G − XG (cid:62) X, ∵ G A = X (cid:62) G, = GX (cid:62) X − XG (cid:62) X, = ( GX (cid:62) − XG (cid:62) ) X C More Results
Table 3:
Accuracy (2) and Forgetting (3) results of continual learning experiments for larger episodicmemory sizes. , and samples per class per task are stored, respectively. Top table is for SplitCIFAR. Bottom table is for Split miniImageNet. M ETHOD A CCURACY F ORGETTING
ING
ORTHOG - SUBSPACE M ETHOD A CCURACY F ORGETTING
ING
ORTHOG - SUBSPACE
D Hyper-parameter Selection
In this section, we report the hyper-parameters grid considered for experiments. The best values fordifferent benchmarks are given in parenthesis. 14 Multitask – learning rate: [0.003, 0.01, 0.03 (CIFAR, miniImageNet),0.1 (MNIST perm, rot), 0.3, 1.0] • Finetune – learning rate: [0.003, 0.01, 0.03 (CIFAR, miniImageNet),0.1 (MNIST perm, rot), 0.3, 1.0] • EWC – learning rate: [0.003, 0.01, 0.03 (CIFAR, miniImageNet),0.1 (MNIST perm, rot), 0.3, 1.0] – regularization: [0.1, 1, 10 (MNIST perm, rot, CIFAR,miniImageNet), 100, 1000] • AGEM – learning rate: [0.003, 0.01, 0.03 (CIFAR, miniImageNet),0.1 (MNIST perm, rot), 0.3, 1.0] • MER – learning rate: [0.003, 0.01, 0.03 (MNIST, CIFAR,miniImageNet), 0.1, 0.3, 1.0] – within batch meta-learning rate: [0.01, 0.03, 0.1(MNIST, CIFAR, miniImageNet), 0.3, 1.0] – current batch learning rate multiplier: [1, 2, 5 (CIFAR,miniImageNet), 10 (MNIST)] • ER-Ring – learning rate: [0.003, 0.01, 0.03 (CIFAR, miniImageNet),0.1 (MNIST perm, rot), 0.3, 1.0] • ORTHOG - SUBSPACE – learning rate: [0.003, 0.01, 0.03, 0.1 (MNIST perm,rot), 0.2 (miniImageNET), 0.4 (CIFAR), 1.0] .0 0.2 0.4 0.6 0.8Inner product b\w Task and Memory gradients020406080 C o un t No OrthogonalityStiefel (Orthogonality) (a) L1 C o un t No OrthogonalityStiefel (Orthogonality) (b) L2 C o un t No OrthogonalityStiefel (Orthogonality) (c) L3 C o un t No OrthogonalityStiefel (Orthogonality) (d) L4 C o un t No OrthogonalityStiefel (Orthogonality) (e) L5 C o un t No OrthogonalityStiefel (Orthogonality) (f) L6 C o un t No OrthogonalityStiefel (Orthogonality) (g) L7 C o un t No OrthogonalityStiefel (Orthogonality) (h) L8 C o un t No OrthogonalityStiefel (Orthogonality) (i) L9 C o un t No OrthogonalityStiefel (Orthogonality) (j) L10 C o un t No OrthogonalityStiefel (Orthogonality) (k) L11 C o un t No OrthogonalityStiefel (Orthogonality) (l) L12 C o un t No OrthogonalityStiefel (Orthogonality) (m) L13 C o un t No OrthogonalityStiefel (Orthogonality) (n) L14 C o un t No OrthogonalityStiefel (Orthogonality) (o) L15 C o un t No OrthogonalityStiefel (Orthogonality) (p) L16
Figure 4: