[PDF] Stochastic Filter Groups for Multi-Task CNNs: Learning Specialist and Generalist Convolution Kernels

Abstract

The performance of multi-task learning in Convolutional Neural Networks (CNNs) hinges on the design of feature sharing between tasks within the architecture. The number of possible sharing patterns are combinatorial in the depth of the network and the number of tasks, and thus hand-crafting an architecture, purely based on the human intuitions of task relationships can be time-consuming and suboptimal. In this paper, we present a probabilistic approach to learning task-specific and shared representations in CNNs for multi-task learning. Specifically, we propose "stochastic filter groups'' (SFG), a mechanism to assign convolution kernels in each layer to "specialist'' or "generalist'' groups, which are specific to or shared across different tasks, respectively. The SFG modules determine the connectivity between layers and the structures of task-specific and shared representations in the network. We employ variational inference to learn the posterior distribution over the possible grouping of kernels and network parameters. Experiments demonstrate that the proposed method generalises across multiple tasks and shows improved performance over baseline methods.

Full PDF

SStochastic Filter Groups for Multi-Task CNNs:Learning Specialist and Generalist Convolution Kernels

Felix J.S. Bragman ∗ University College London, UK [email protected]

Ryutaro Tanno ∗ University College London, UK [email protected]

Sebastien OurselinKings College London [email protected]

Daniel C. AlexanderUniversity College London [email protected]

M. Jorge CardosoKings College London [email protected]

Abstract

The performance of multi-task learning in ConvolutionalNeural Networks (CNNs) hinges on the design of featuresharing between tasks within the architecture. The num-ber of possible sharing patterns are combinatorial in thedepth of the network and the number of tasks, and thushand-crafting an architecture, purely based on the humanintuitions of task relationships can be time-consuming andsuboptimal. In this paper, we present a probabilistic ap-proach to learning task-speciﬁc and shared representationsin CNNs for multi-task learning. Speciﬁcally, we propose“stochastic ﬁlter groups” (SFG), a mechanism to assignconvolution kernels in each layer to “specialist” or “gener-alist” groups, which are speciﬁc to or shared across differ-ent tasks, respectively. The SFG modules determine the con-nectivity between layers and the structures of task-speciﬁcand shared representations in the network. We employ vari-ational inference to learn the posterior distribution overthe possible grouping of kernels and network parameters.Experiments demonstrate that the proposed method gen-eralises across multiple tasks and shows improved perfor-mance over baseline methods.

1. Introduction

Multi-task learning (MTL) aims to enhance learning efﬁ-ciency and predictive performance by simultaneously solv-ing multiple related tasks [1]. Recently, applications of con-volutional neural networks (CNNs) in MTL have demon-strated promising results in a wide-range of computer vi-sion applications, ranging from visual scene understanding[2, 3, 4, 5, 6, 7] to medical image computing [8, 9, 10, 11].A key factor for successful MTL neural network modelsis the ability to learn shared and task-speciﬁc representa-tions [4]. A mechanism to understand the commonalities ∗ Both authors contributed equally

Manually speciﬁedarchitecture x AgeGenderInput Learned architecturewith our method Shared Task 1Task 2

Figure 1:

Figure on the left illustrates a typical multi-task archi-tecture, while the ﬁgure on the right shows an example architecturethat can be learned with our method. We propose

Stochastic FilterGroups , a principled way to learn the assignment of convolutionkernels to task-speciﬁc and shared groups. and differences between tasks allows the model to trans-fer information between tasks while tailoring the predictivemodel to describe the distinct characteristics of the indi-vidual tasks. The quality of such representations is deter-mined by the architectural design of where model compo-nents such as features [12] and weights [13] are shared andseparated between tasks. However, the space of possible ar-chitectures is combinatorially large, and the manual explo-ration of this space is inefﬁcient and subject to human bi-ases. For example, Fig. 1 shows a typical CNN architecturefor MTL comprised of a shared “trunk” feature extractorand task-speciﬁc “branch” networks [11, 14, 15, 16, 6, 10].The desired amount of shared and task-speciﬁc representa-tions, and their interactions within the architecture are de-pendent on the difﬁculty of the individual tasks and the re-lation between them, neither of which are a priori known inmost cases [17]. This illustrates the challenge of handcraft-ing an appropriate architecture, and the need for an effectiveautomatic method to learn it from data.In this paper, we propose

Stochastic Filter Groups (SFGs); a probabilistic mechanism to learn the amountof task-speciﬁc and shared representations needed in eachlayer of MTL architectures (Fig. 1). Speciﬁcally, the SFGs1 a r X i v : . [ c s . C V ] A ug earns to allocate kernels in each convolution layer intoeither “specialist” groups or a “shared” trunk, which arespeciﬁc to or shared across different tasks, respectively(Fig. 2). The SFG equips the network with a mechanismto learn inter-layer connectivity and thus the structures oftask-speciﬁc and shared representations. We cast the learn-ing of SFG modules as a variational inference problem.We evaluate the efﬁcacy of SFGs on a variety of tasks. Inparticular, we focus on two multi-task learning problems: 1)age regression and gender classiﬁcation from face imageson UTKFace dataset [18] and 2) semantic regression (i.e.image synthesis) and semantic segmentation on a real-worldmedical imaging dataset, both of which require predictionsover all pixels. Experiments show that our method achievesconsiderably higher prediction accuracy than baselines withno mechanism to learn connectivity structures, and eitherhigher or comparable performance than a cross-stitch net-work [4], while being able to learn meaningful architecturesautomatically.

2. Related works

Our work is concerned with the goal of learning whereto share neural network components across different tasks tomaximise the beneﬁt of MTL. The main challenge of suchmethods lies in designing a mechanism that determines howand where to share weights within the network. There arebroadly two categories of methods that determine the natureof weight sharing and separation in MTL networks.The ﬁrst category is composed of methods that optimisethe structures of weight sharing in order to maximise task-wise performance. These methods set out to learn a set avectors that control which features are shared within a layerand how these are distributed across [19, 13, 4, 12]. Theystart with a baseline CNN architecture where they learn ad-ditional connections and pathways that deﬁne the ﬁnal MTLmodel. For instance, Cross-Stitch networks [4] control thedegree of weight sharing at each convolution layer whilstSoft-Layer Ordering [13] goes beyond the assumption ofparallel ordering of feature hierarchies to allow features tomix at different layers depending on the task. Routing net[20] proposes an architecture in which each layer is a set offunction blocks, and learns to decide which composition ofblocks to use given an input and a task.The second group of MTL methods focuses on weightclustering based on task-similarity [21, 22, 23, 24, 25]. Forexample, [24] employed an iterative algorithm to grow atree-like deep architecture that clusters similar tasks hier-archically or [25] which determines the degree of weightsharing based on statistical dependency between tasks.Our method falls into ﬁrst category, and differentiates it-self by performing “hard’ partitioning of task-speciﬁc andshared features. By contrast, prior methods are based on“soft” sharing of features [4, 12] or weights [19, 13]. These methods generally learn a set of mixing coefﬁcients that de-termine the weighted sum of features throughout the net-work, which does not impose connectivity structures on thearchitecture. On the other hand, our method learns a distri-bution over the connectivity of layers by grouping kernels.This allows our model to learn meaningful grouping of task-speciﬁc and shared features as illustrated in Fig. 7.

3. Methods

We introduce a new approach for determining whereto learn task-speciﬁc and shared representation in multi-task CNN architectures. We propose stochastic ﬁltergroups (SFG), a probabilistic mechanism to partition ker-nels in each convolution layer into “specialist” groups or a“shared” group, which are speciﬁc to or shared across dif-ferent tasks, respectively. We employ variational inferenceto learn the distributions over the possible grouping of ker-nels and network parameters that determines the connec-tivity between layers and the shared and task-speciﬁc fea-tures. This naturally results in a learning algorithm that op-timally allocate representation capacity across multi-tasksvia gradient-based stochastic optimization, e.g. stochasticgradient descent. p p p s Group probabilities

Cat ( Cat ( Cat ( Cat ( Cat ( Cat ( ) ~) ~) ~) ~) ~) ~

Filters Sample & Assign to Groups w w w w w w G G s G “Task 1”“Task 2”“Shared” Figure 2:

Illustration of ﬁlter assignment in a SFG module.Each kernel { w k } in the given convolution layer is probabilisti-cally assigned to one of the ﬁlter groups G , G s , G accordingto the sample drawn from the associated categorical distributionCat ( p , p s , p ) . SFGs introduce a sparse connection structure into the ar-chitecture of CNN for multi-task learning in order to sep-arate features into task-speciﬁc and shared components.Ioannou et al. [26] introduced ﬁlter groups to partition ker-nels in each convolution layer into groups, each of whichacts only on a subset of the preceding features. Theydemonstrated that such sparsity reduces computational costand number of parameters without compromising accuracy.Huang et al. [27] proposed a similar concept, but differs inthat the ﬁlter groups do not operate on mutually exclusivesets of features. Here we adapt the concept of ﬁlter groupsto the multi-task learning paradigm and propose an exten- nput (ii) increasing task specialisation (i) uniform splits (iv) other (iii) asymmetrical

Figure 3:

Illustration of possible grouping patterns learnable withthe proposed method. Each set of green, pink and yellow blocksrepresent the ratio of ﬁlter groups G (red), G s (green) and G (blue). (i) denotes the case where all kernels are uniformly split.(ii) & (iii) are the cases where the convolution kernels becomemore task-speciﬁc at deeper layers. (iv) shows an example withmore heterogeneous splits across tasks. Input G G s G G G s G G G s G . . .. . .. . . G G s G G G s G L L Task 1 LossTask 2 Loss (i) Forward Pass(ii) Backward Pass G G s G G G s G G G s G . . .. . .. . . G G s G G G s G L L Input

Figure 4:

Illustration of feature routing. The circles G , G s , G denote the task-speciﬁc and shared ﬁlter groups in each layer. (i)shows the directions of routing of activations between differentﬁlter groups while (ii) shows the directions of the gradient ﬂowfrom the task losses L and L . The red and blue arrows denotethe gradients that step from L and L , respectively. The task-speciﬁc groups G , G are only updated based on the associatedlosses, while the shared group G s is updated based on both. sion with an additional mechanism for learning an optimalkernel grouping rather than pre-specifying them.For simplicity, we describe SFGs for the case of multi-task learning with two tasks, but can be trivially extendedto a larger number of tasks. At the l th convolution layer in aCNN architecture with K l kernels { w ( l ) ,k } K l k =1 , the associ-ated SFG performs two operations:1. Filter Assignment: each kernel w ( l ) k is stochasti-cally assigned to either: i) the “task-1 speciﬁc group” G ( l )1 , ii) “shared group” G ( l ) s or iii) “task-2 speciﬁcgroup” G ( l )2 with respective probabilities p ( l ) ,k =[ p ( l ) ,k , p ( l ) ,ks , p ( l ) ,k ] ∈ [0 , . Convolving with therespecitve ﬁlter groups yields distinct sets of features F ( l )1 , F ( l ) s , F ( l )2 . Fig. 2 illustrates this operation andFig. 3 shows different learnable patterns.2. Feature Routing: as shown in Fig. 4 (i), the fea-tures F ( l )1 , F ( l ) s , F ( l )2 are routed to the ﬁlter groups G ( l +1)1 , G ( l +1) s , G ( l +1)2 in the subsequent ( l +1) th layerin such a way to respect the task-speciﬁcity and shared-ness of ﬁlter groups in the l th layer. Speciﬁcally, weperform the following routing for l > : F ( l +1)1 = h ( l +1) (cid:0) [ F ( l )1 | F ( l ) s ] ∗ G ( l +1)1 (cid:1) F ( l +1) s = h ( l +1) (cid:0) F ( l ) s ∗ G ( l +1) s (cid:1) F ( l +1)2 = h ( l +1) (cid:0) [ F ( l )2 | F ( l ) s ] ∗ G ( l +1)2 (cid:1) where each h ( l +1) deﬁnes the choice of non-linearfunction, ∗ denotes convolution operation and | de-notes a merging operation of arrays (e.g. concate-nation). At l = 0 , input image x is simply con-volved with the ﬁrst set of ﬁlter groups to yield F (1) i = h (1) (cid:0) x ∗ G (1) i (cid:1) , i ∈ { , , s } . Fig. 4(ii) shows that suchsparse connectivity ensures the parameters of G ( l )1 and G ( l )2 are only learned based on the respective tasklosses, while G ( l ) s is optimised based on both tasks.Fig. 5 provides a schematic of our overall architecture,in which each SFG module stochastically generates ﬁltergroups in each convolution layer and the resultant featuresare sparsely routed as described above. The merging mod-ules, denoted as black circles, combine the task-speciﬁc andshared features appropriately, i.e. [ F ( l ) i | F ( l ) s ] , i = 1 , andpass them to the ﬁlter groups in the next layer. Each whitecircle denotes the presence of additional transformations(e.g. convolutions or fully connected layers) in each h ( l +1) ,performed on top of the standard non-linearity (e.g. ReLU).The proposed sparse connectivity is integral to ensuretask performance and structured representations. In partic-ular, one might argue that the routing of “shared” features F ( l ) s to the respective “task-speciﬁc” ﬁlter groups G ( l +1)1 and G ( l +1)2 is not necessary to ensure the separation of gra-dients across the task losses. However, this connection al-lows for learning more complex task-speciﬁc features atdeeper layers in the network. For example, without thisrouting, having a large proportion of “shared” ﬁlter group G s at the ﬁrst layer (Fig. 3 (ii)) substantially reduces theamount of features available for learning task-speciﬁc ker-nels in the subsequent layers—in the extreme case in whichall kernels in one layer are assigned to G s , the task-speciﬁcﬁlter groups in the subsequent layers are effectively unused.Another important aspect that needs to be highlightedis the varying dimensionality of feature maps. Speciﬁ-cally, the number of kernels in the respective ﬁlter groups G ( l )1 , G ( l ) s , G ( l )2 can vary at each iteration of the training, nput G G s G G s . . .. . .. . . G G s G G G SFG SFG SFG = Merging Operation= Optional Transformation x OutputOutput

Figure 5:

Schematic of the proposed multi-task architecture based on a series of SFG modules in the presence of two tasks. At eachconvolution layer, kernels are stochastically assigned to task-speciﬁc and shared ﬁlter groups G , G s , G . Each input image is ﬁrstconvolved with the respective ﬁlter groups to yield three distinct sets of output activations, which are routed sparsely to the ﬁlter groups inthe second layer layer. This process repeats in the remaining SFG modules in the architecture until the last layer where the outputs of theﬁnal SFG module are combined into task-speciﬁc predictions ˆ y and ˆ y . Each small white circle denotes an optional transformation (e.g.extra convolutions) and black circle merges the incoming inputs (e.g. concatenation). and thus, so does the depth of the resultant feature maps F ( l )1 , F ( l ) s , F ( l )2 . Instead of directly working with featuresmaps of varying size, we implement the proposed architec-ture by deﬁning F ( l )1 , F ( l ) s , F ( l )2 as sparse tensors. At eachSFG module, we ﬁrst convolve the input features with allkernels, and generate the output features from each ﬁltergroup by zeroing out the channels that root from the ker-nels in the other groups, resulting in F ( l )1 , F ( l ) s , F ( l )2 that aresparse at non-overlapping channel indices. In the simplestform with no additional transformation (i.e. the grey circlesin Fig. 5 are identity functions), we deﬁne the merging op-eration [ F ( l ) i | F ( l ) s ] , i = 1 , as pixel-wise summation. In thepresence of more complex transforms (e.g. residual blocks),we concatenate the output features in the channel-axis andperform a 1x1 convolution to ensure the number of channelsin [ F ( l ) i | F ( l ) s ] is the same as in F ( l ) s . Here we derive the method for simultaneously optimis-ing the CNN parameters and grouping probabilities. Weachieve this by extending the variational interpretation ofbinary dropout [28, 29] to the ( T + 1) -way assignment ofeach convolution kernel to the ﬁlter groups where T is thenumber of tasks. As before, we consider the case T = 2 .Suppose that the architecture consists of L SFG mod-ules, each with K l kernels where l is the index. As theposterior distribution over the convolution kernels in SFGmodules p ( W| X , Y (1) , Y (2) ) is intractable, we approxi-mate it with a simpler distribution q φ ( W ) where W = { W ( l ) ,k } k =1 ,...,K l ,l =1 ,...,L . Assuming that the posteriordistribution factorizes over layers and kernels up to group assignment, we deﬁned the variational distribution as: q φ ( W ) = L (cid:89) l =1 K l (cid:89) k =1 q φ lk ( W ( l ) ,k )= L (cid:89) l =1 K l (cid:89) k =1 q φ lk ( W ( l ) ,k , W ( l ) ,ks , W ( l ) ,k ) where { W ( l ) ,k , W ( l ) ,ks , W ( l ) ,k } denotes the k th kernelin l th convolution layer after being routed into task-speciﬁc G ( l )1 , G ( l )2 and shared group G ( l ) s . We deﬁne each q φ lk ( W ( l ) ,k , W ( l ) ,k , W ( l ) ,ks ) as: W ( l ) ,ki = z ( l ) ,ki · M ( l ) ,k for i ∈ { , s, } (1) z ( l ) ,k = [ z ( l ) ,k , z ( l ) ,k , z ( l ) ,ks ] ∼ Cat ( p ( l ) ,k ) (2)where z ( l ) ,k is the one-hot encoding of a sample from thecategorical distribution over ﬁlter group assignments, and M ( l ) ,k denotes the parameters of the pre-grouping convolu-tion kernel. The set of variational parameters for each ker-nel in each layer is thus given by φ lk = { M ( l ) ,k , p ( l ) ,k =[ p ( l ) ,k , p ( l ) ,ks , p ( l ) ,k ] } .We minimize the KL divergence between the approxi-mate posterior q φ ( W ) and p ( W| X , Y (1) , Y (2) ) . Assumingthat the joint likelihood over the two tasks factorizes, wehave the following optimization objective: L MC ( φ ) = − NM M (cid:88) i =1 (cid:104) log p ( y (1) i | x i , W i )+ log p ( y (2) i | x i , W i ) (cid:105) + L (cid:88) l =1 K l (cid:88) k =1 KL ( q φ lk ( W ( l ) ,k ) || p ( W ( l ) ,k )) (3)here M is the size of the mini-batch, N is the total num-ber of training data points, and W i denotes a set of modelparameters sampled from q φ ( W ) . The last KL term regu-larizes the deviation of the approximate posterior from theprior p ( W ( l ) ,k ) = N (0 , I /l ) where l > . Adapting theapproximation presented in [28] to our scenario, we obtain:KL ( q φ lk ( W ( l ) ,k ) || p ( W ( l ) ,k )) ∝ l || M ( l ) ,k || − H ( p ( l ) ,k ) (4)where H ( p ( l ) ,k ) = − (cid:80) i ∈{ , ,s } p ( l ) ,ki log p ( l ) ,ki is the en-tropy of the grouping probabilities. While the ﬁrst term per-forms the L2-weight norm, the second term pulls the group-ing probabilities towards the uniform distribution. Pluggingeq.(4) into eq.(3) yields the overall loss: L MC ( φ )= − NM M (cid:88) i =1 (cid:104) log p (cid:16) y (1) i | x i , W i (cid:17) +log p (cid:16) y (2) i | x i , W i (cid:17)(cid:105) + λ · L (cid:88) l =1 K l (cid:88) k =1 || M ( l ) ,k || − λ · L (cid:88) l =1 K l (cid:88) k =1 H ( p ( l ) ,k ) (5)where λ > , λ > are regularization coefﬁcients.We note that the discrete sampling operation during ﬁl-ter group assignment (eq. (2)) creates discontinuities, giv-ing the ﬁrst term in the objective function (eq. 5) zero gra-dient with respect to the grouping probabilities { p ( l ) ,k } .We therefore, as employed in [16] for the binary case, ap-proximate each of the categorical variables Cat ( p ( l ) ,k ) bythe Gumbel-Softmax distribution, GSM ( p ( l ) ,k , τ ) [30, 31],a continuous relaxation which allows for sampling, dif-ferentiable with respect to the parameters p ( l ) ,k througha reparametrisation trick. The temperature term τ adjuststhe bias-variance tradeoff of gradient approximation; as thevalue of τ approaches 0, samples from the GSM distributionbecome one-hot (i.e. lower bias) while the variance of thegradients increases. In practice, we start at a high τ and an-neal to a small but non-zero value as in [31, 29] as detailedin supplementary materials.

4. Experiments

We tested stochastic ﬁlter groups (SFG) on two multi-task learning (MTL) problems: 1) age regression and gen-der classiﬁcation from face images on UTKFace dataset[18] and 2) semantic image regression (synthesis) and seg-mentation on a medical imaging dataset. Full details of thetraining and datasets are provided in Sec. A in the supple-mentary materials.

UTKFace dataset:

We tested our method on UTKFace[18], which consists of 23,703 cropped faced images in thewild with labels for age and gender. We created a dataset with a 70/15/15% split. We created a secondary separatedataset containing only 10% of images from the initial set,so as to simulate a data-starved scenario.

Medical imaging dataset:

We used a medical imagingdataset to evaluate our method in a real-world, multi-taskproblem where paucity of data is common and hard to miti-gate. The goal of radiotherapy treatment planning is to max-imise radiation dose to the tumour whilst minimising doseto the organs. To plan dose delivery, a Computed Tomogra-phy (CT) scan is needed as CT voxel intensity scales withtissue density, thus allowing dose propagation simulations.An MRI scan is needed to segment the surrounding organs.Instead of acquiring both an MRI and a CT, algorithms canbe used to synthesise a CT scan (task 1) and segment or-gans (task 2) given a single input MRI scan. For this ex-periment, we acquired , 3D prostate cancer scans withrespective CT and MRI scans with semantic 3D labels fororgans (prostate, bladder, rectum and left/right femur heads)obtained from a trained radiologist. We created a trainingset of patients, with the remaining used for testing. Wetrained our networks on 2D patches of size x ran-domly sampled from axial slices, and reconstructed the 3Dvolumes of size x x at test time by stitching to-gether the subimage-wise predictions. We compared our model against four baselines in addi-tion to Cross-Stitch networks [4] trained end-to-end ratherthan sequentially for fair comparison. The four baselinesconsidered are: 1) single-task networks, 2) hard-parametersharing multi-task network (MT-hard sharing), 3) SFG-networks with constant / allocated grouping (MT-constantmask) as per Fig. 3(i), and 4) SFG-networks with constantgrouping probabilities (MT-constant p ). We train all thebaselines in an end-to-end fashion for all the experiments.We note that all four baselines can be considered specialcases of an SFG-network. Two single-task networks can belearned when the shared grouping probability of kernels isset to zero. Considering Fig. 5, this would remove the di-agonal connections and the shared network. This may beimportant when faced with two unrelated tasks which shareno contextual information. A hard-parameter sharing net-work exists when all shared grouping probabilities are max-imised to one leading to a scenario where all features areshared within the network up until the task-speciﬁc layers.The MT-constant mask network is illustrated in Fig. 3(i),where / of kernels are allocated to the task , task andshared groups, yielding uniform splits across layers. Thisoccurs when an equal number of kernels in each layer ob-tain probabilities of p ( l ) ,k = [1 , , , [0 , , and [0 , , .Lastly, the MT-constant p model represents the situationwhere the grouping is non-informative and each kernel hasqual probability of being speciﬁc or shared with probabil-ity p ( l ) ,k = [ / , / , / ] . Training details for these mod-els, including the hyper-parameter settings, are provided inSec. B in the supplementary document. UTKFace network:

We used VGG-11 CNN architecture[32] for age and gender prediction. The network consistsof a series of x convolutional layers interleaved with maxpooling layers. In contrast to the original architecture, wereplaced the ﬁnal max pooling and fully connected layerswith global average pooling (GAP) followed by a fully con-nected layers for prediction. Our model’s version of VGG(SFG-VGG) replaces each convolutional layer in VGG-11with a SFG layer with max pooling applied to each featuremap F ( l )1 , F ( l )2 , F ( l ) s . We applied GAP to each ﬁnal fea-ture map before the ﬁnal merging operation and two fullyconnected layers for each task. Medical imaging network:

We used the HighResNetarchitecture [33] for CT synthesis and organ segmentation.This network has been developed for semantic segmenta-tion in medical imaging and has been used in a variety ofmedical applications such as CT synthesis [10] and brainsegmentation [33]. It consists of a series of residual blocks,which group two x convolutional layers with dilatedconvolutions. The baseline network is composed of a x convolutional layer followed by three sets of twice repeatedresidual blocks with dilated convolutions using factors d = [1 , , . There is a x convolutional layer betweeneach set of repeated residual blocks. The network ends withtwo ﬁnal x layers and either one or two x convolutionallayers for single and multi-task predictions. In our model,we replace each convolutional layer with an SFG module.After the ﬁrst SFG layer, three distinct repeated residualblocks are applied to F ( l =0)1 , F ( l =0)2 , F ( l =0) s . These arethen merged according the feature routing methodologyfollowed by a new SFG-layer and subsequent residuallayers. Our model concludes with 2 successive SFG-layersfollowed by x convolutional layers applied to the mergedfeatures F ( l = L )1 and F ( l = L )2 .

5. Results

Results on age prediction and gender classiﬁcation onboth datasets are presented in Tab. 1a and 1b. Our model(MT-SFG) achieved the best performance in comparison tothe baselines in both data regimes. In both sets of experi-ments, our model outperformed the hard-parameter sharing(

MT-hard sharing ) and constant allocation (

MT-constantmask ). This demonstrates the advantage of learning to al-locate kernels. In the

MT-constant mask model, kernels are (a) Full training dataMethod Age Gender(MAE) (Accuracy)One-task (VGG11) [32] .

32 90 . MT-hard sharing .

92 90 . MT-constant mask .

67 89 . MT-constant p =[ / , / , / ] .

34 92 . VGG11 Cross Stitch [4] .

78 90 . MT-SFG (ours) .

00 92 . (b) Small training dataMethod Age Gender(MAE) (Accuracy)One-task (VGG11) [32] .

79 85 . MT-hard sharing .

19 85 . MT-constant mask .

02 85 . MT-constant p =[ / , / , / ] .

15 86 . VGG11 Cross Stitch [4] .

85 83 . MT-SFG (ours) .

54 87 . Table 1:

Age regression and gender classiﬁcation results on UTK-Face [18] with (a) the full and (b) limited training set. The best andthe second best results are shown in red and blue. The mean abso-lute error (MAE) is reported for the age prediction and classiﬁca-tion accuracy for gender prediction. For our model, we performed stochastic forward passes at test time by sampling the kernelsfrom the approximate posterior q φ ( W ) . We calculated the averageage per subject and obtained gender prediction using the mode ofthe test-time predictions. equally allocated across groups. In contrast, our model isable to allocate kernels in varying proportions across differ-ent layers in the network (Fig. 6 - SFG-VGG11) to max-imise inductive transfer. Moreover, our methods performedbetter than a model with constant, non-informative group-ing probabilities ( MT-constant p = [ / , / , / ] ), displayingthe importance of learning structured representations andconnectivity across layers to yield good predictions. Results on CT image synthesis and organ segmentationfrom input MRI scans is detailed in Tab. 2. Our methodobtains equivalent (non-statistically signiﬁcant different) re-sults to the Cross-Stitch network [4] on both tasks. We have,however, observed best synthesis performance in the boneregions (femur heads and pelvic bone region) in our modelwhen compared against all the baselines, including Cross-Stitch. The bone voxel intensities are the most difﬁcult tosynthesise from an input MR scan as task uncertainty in theMR to CT mapping at the bone is often highest [10]. Ourmodel was able to disentangle features speciﬁc to the boneintensity mapping (Fig. 7) without supervision of the pelvic a) CT Synthesis (PSNR)

Method Overall Bones Organs Prostate Bladder RectumOne-task (HighResNet) [33] 25.76 (0.80) 30.35 (0.58) 38.04 (0.94) 51.38 (0.79) 33.34 (0.83) 34.19 (0.31)MT-hard sharing 26.31 (0.76) 31.25 (0.61) 39.19 (0.98) 52.93 (0.95) 34.12 (0.82) 34.15 (0.30)MT-constant mask . .

57) 29 . .

46) 37 . .

86) 50 . .

73) 32 . .

01) 33 . . MT-constant p =[ / , / , / ] 26.64(0.54) 31.05 (0.55) 39.11 (1.00) 53.20 (0.86) 34.34 (1.35) 35.61 (0.35)Cross Stitch [4] 27.86 (1.05) 32.27 (0.55) 40.45 (1.27) 54.51 (1.01) 36.81 (0.92) 36.35 (0.38)MT-SFG (ours) 27.74 (0.96) 32.29 (0.59) 39.93 (1.09) 53.01 (1.06) 35.65 (0.44) 35.65 (0.37) (b) Segmentation (DICE) Method Overall Left Femur Head Right Femur Head Prostate Bladder RectumOne-task (HighResNet) [33] . . . . . . p =[ / , / , / ] . . . . Table 2:

Performance on the medical imaging dataset with best results in red, and the second best results in blue. The PSNR is reportedfor the CT-synthesis (synCT) across the whole volume (overall), at the bone regions, across all organ labels and individually at the prostate,bladder and rectum. For the segmentation, the average DICE score per patient across all semantic labels is computed. The standarddeviations are computed over the test subject cohort. For our model, we perform stochastic forward passes at test-time by sampling thekernels from the approximated posterior distribution q φ ( W ) . We compute the average of all passes to obtain the synCT and calculate themode of the segmentation labels for the ﬁnal segmentation. location, which allowed it to learn a more accurate mappingof an intrinsically difﬁcult task. Analysis of the grouping probabilities of a network em-bedded with SFG modules permits visualisation of the net-work connectivity and thus the learned MTL architecture.To analyse the group allocation of kernels at each layer,we computed the sum of class-wise probabilities per layer.Learned groupings for both SFG-VGG11 network trainedon UTKFace and the SFG-HighResNet network trained onprostate scans are presented in Fig. 6. These ﬁgures il-lustrate increasing task specialisation in the kernels withnetwork depth. At the ﬁrst layer, all kernels are classiﬁedas shared ( p = [0 , , ) as low-order features such as edgeor contrast descriptors are generally learned earlier layers.In deeper layers, higher-order representations are learned,which describe various salient features speciﬁc to the tasks.This coincides with our network allocating kernels as taskspeciﬁc, as illustrated in Fig. 7, where activations are strati-ﬁed by allocated class per layer. Density plots of the learnedkernel probabilities and trajectory maps displaying trainingdynamics, along with more examples of feature visualisa-tions, are in Supp.Sec. C and D. The corresponding resultsin the case of duplicate tasks (two duplicates of the sametask) are also provided in Supp.Sec. E.Notably, the learned connectivity of both models showsstriking similarities to hard-parameter sharing architecturescommonly used in MTL. Generally, there is a set of sharedlayers, which aim to learn a feature set common to bothtasks. Task-speciﬁc branches then learn a mapping from this feature space for task-speciﬁc predictions. Our modelsare able to automatically learn this structure whilst allow-ing asymmetric allocation of task-speciﬁc kernels with nopriors on the network structure. p initialisation Fig. 3 shows the layer-wise proportion of the learned ker-nel groups on the UTKFace dataset for four different ini-tilization schemes of grouping probabilities p : (i) “domi-nantly shared”, with p = [0 . , . , . , (ii) “dominantlytask-speciﬁc”, with p = [0 . , . , . , (iii) “random”,where p is drawn from Dirichlet (1 , , , (iv) “start withMT-constant mask”, where an equal number of kernels ineach layer are set to probabilities of p = [1 , , , [0 , , and [0 , , . In all cases, the same set of hyper-parameters,including the annealing rate of the temperature term in GSM SFG-VGG11 SFG-HighResNet

Figure 6:

Learned kernel grouping in a) SFG-VGG11 network onUTKFace and b) SFG-HighResNet on medical scans. The propor-tions of task-1, shared and task-2 ﬁlter groups are shown in blue,green and pink. Within SFG-VGG11, task-1 age regression andtask-2 is gender classiﬁcation. For SFG-HighResNet, task-1 is CTsynthesis and task-2 is organ segmentation. nput SegmentationSynthesis

Figure 7: Activation maps from example kernels in the learned task-speciﬁc and shared ﬁlter groups, G ( l )1 , G ( l )2 , G ( l ) s (en-closed in blue, green and pink funnels) in the ﬁrst, the second last and the last convolution layers in the SFG-HighResNetmodel trained on the medical imaging dataset. The results from convolution kernels with low entropy (i.e. high “conﬁdence”)of group assignment probabilities p ( l ) are shown for the respective layers.approximation and the coefﬁcient of the entropy regularizer H ( p ) , were used during training. We observe that the ker-nel grouping of respective layers in (i), (ii) and (iii) all con-verge to a very similar conﬁguration observed in Sec. 5.3,highlighting the robustness of our method to different ini-tialisations of p . In case (iv), the learning of p were muchslower than the remaining cases, due to weaker gradients,and we speculate that a higher entropy regularizer is neces-sary to facilitate its convergence.

6. Discussion

In this paper, we have proposed stochastic ﬁlter groups (SFGs) to disentangle task-speciﬁc and generalist features.SFGs probabilistically deﬁnes the grouping of kernelsand thus the connectivity of features in a CNNs. Weuse variational inference to approximate the distribution (ii)(i)(iii) (iv)

Figure 8:

Effect of the initial values of grouping probabilities p on the learned kernel allocation after convergence. over connectivity given training data and sample overpossible architectures during training. Our method can beconsidered as a probabilistic form of multi-task architecturelearning [34], as the learned posterior embodies the optimalMTL architecture given the data.Our model learns structure in the representations. Thelearned shared (generalist) features may be exploited eitherin a transfer learning or continual learning scenario. Asseen in [35], an effective prior learned from multiple taskscan be a powerful tool for learning new, unrelated tasks.Our model consequently offers the possibility to exploitthe learned task-speciﬁc and generalist features when facedwith situations where a third task is needed, which maysuffer from unbalanced or limited training data. This isparticularly relevant in the medical ﬁeld, where trainingdata is expensive to acquire as well as laborious. We willinvestigate this in further work.Lastly, a network composed of SFG modules can beseen as a superset of numerous MTL architectures. De-pending on the data and the analysed problem, SFGs canrecover many different architectures such as single task net-works, traditional hard-parameter sharing, equivalent allo-cation across tasks, and asymmetrical grouping (Fig. 3).Note, however, that proposed SFG module only learns con-nectivity between neighbouring layers. Non-parallel order-ing of layers, a crucial concept of MTL models [13, 12],was not investigated. Future work will look to investigatethe applicability of SFG modules for learning connectionsacross grouped kernels between non-neighbouring layers. cknowledgments FB and MJC were supported by CRUK Accelerator GrantA21993. RT was supported by Microsoft Scholarship. DAwas supported by EU Horizon 2020 Research and Innova-tion Programme Grant 666992, EPSRC Grant M020533,R014019, and R006032 and the NIHR UCLH BRC. Wethank NVIDIA Corporation for hardware donation.

References [1] Rich Caruana. Multitask learning.

Machine learning ,28(1):41–75, 1997.[2] Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Math-ieu, Rob Fergus, and Yann LeCun. Overfeat: Integratedrecognition, localization and detection using convolutionalnetworks. In , 2014.[3] David Eigen and Rob Fergus. Predicting depth, surface nor-mals and semantic labels with a common multi-scale convo-lutional architecture. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV) , pages 2650–2658, 2015.[4] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Mar-tial Hebert. Cross-stitch Networks for Multi-task Learning.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2016.[5] Iasonas Kokkinos. Ubernet: Training a universal convolu-tional neural network for low-, mid-, and high-level visionusing diverse datasets and limited memory. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 6129–6138, 2017.[6] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hy-perface: A deep multi-task learning framework for face de-tection, landmark localization, pose estimation, and genderrecognition.

IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 41(1):121–135, 2019.[7] Hakan Bilen and Andrea Vedaldi. Integrated perception withrecurrent multi-task neural networks. In

Advances in NeuralInformation Processing Systems , pages 235–243, 2016.[8] Pim Moeskops, Jelmer M Wolterink, Bas HM van derVelden, Kenneth GA Gilhuijs, Tim Leiner, Max A Viergever,and Ivana Iˇsgum. Deep learning for multi-task medical im-age segmentation in multiple modalities. In

InternationalConference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , pages 478–486, 2016.[9] Sihong Chen, Dong Ni, Jing Qin, Baiying Lei, Tianfu Wang,and Jie-Zhi Cheng. Bridging computational features to-ward multiple semantic features with multi-task regression:A study of ct pulmonary nodules. In

International Confer-ence on Medical Image Computing and Computer-AssistedIntervention (MICCAI) , pages 53–60. Springer, 2016.[10] Felix Bragman, Ryutaro Tanno, Zach Eaton-Rosen, WenqiLi, David Hawkes, Sebastien Ourselin, Daniel Alexander,Jamie McClelland, and M. Jorge Cardoso. Uncertainty in multitask learning: joint representations for probabilistic mr-only radiotherapy planning. In

Medical Image Computingand Computer-Assisted Interventions (MICCAI) , pages 3–11, 2018.[11] Ryutaro Tanno, Antonios Makropoulos, Salim Arslan, OzanOktay, Sven Mischkewitz, Fouad Al-Noor, Jonas Oppen-heimer, Ramin Mandegaran, Bernhard Kainz, and Mattias PHeinrich. Autodvt: Joint real-time classiﬁcation for veincompressibility analysis in deep vein thrombosis ultrasounddiagnostics. In

International Conference on Medical ImageComputing and Computer-Assisted Intervention (MICCAI) ,pages 905–912, 2018.[12] Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, andAnders Søgaard. Latent multi-task architecture learning. In

Proceedings of AAAI , 2019.[13] Elliot Meyerson and Risto Miikkulainen. Beyond shared hi-erarchies: Deep multitask learning through soft layer order-ing. In , 2018.[14] Junshi Huang, Rogerio S Feris, Qiang Chen, and ShuichengYan. Cross-domain image retrieval with a dual attribute-aware ranking network. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV) , pages 1062–1070, 2015.[15] Brendan Jou and Shih-Fu Chang. Deep cross residual learn-ing for multitask visual recognition. In

Proceedings of the24th ACM international conference on Multimedia , pages998–1007. ACM, 2016.[16] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-tasklearning using uncertainty to weigh losses for scene geome-try and semantics. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018.[17] Amir R. Zamir, Alexander Sax, William B. Shen, Leonidas J.Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:Disentangling task transfer learning. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018.[18] Song Yang Zhang, Zhifei and Hairong Qi. Age progres-sion/regression by conditional adversarial autoencoder. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017.[19] Mingsheng Long and Jianmin Wang. Learning multiple taskswith deep relationship networks. In

Advances in Neural In-formation Processing Systems , 2017.[20] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer.Routing networks: Adaptive selection of non-linear func-tions for multi-task learning. In , 2018.[21] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krish-napuram. Multi-task learning for classiﬁcation with dirich-let process priors.

Journal of Machine Learning Research ,8(Jan):35–63, 2007.[22] Laurent Jacob, Jean philippe Vert, and Francis R. Bach.Clustered multi-task learning: A convex formulation. In

Ad-vances in Neural Information Processing Systems , 2009.23] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learningwith whom to share in multi-task feature learning. In

Pro-ceedings of the 28th International Conference on Interna-tional Conference on Machine Learning (ICML) , pages 521–528, USA, 2011. Omnipress.[24] Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng,Tara Javidi, and Rog´erio Schmidt Feris. Fully-adaptive fea-ture sharing in multi-task networks with applications in per-son attribute classiﬁcation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2017.[25] Youssef A Mejjati, Darren Cosker, and Kwang In Kim.Multi-task learning by maximizing statistical dependence.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3465–3473, 2018.[26] Yani Ioannou, Duncan Robertson, Roberto Cipolla, and An-tonio Criminisi. Deep roots: Improving cnn efﬁciencywith hierarchical ﬁlter groups. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) . IEEE, 2017.[27] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kil-ian Q Weinberger. Condensenet: An efﬁcient densenet us-ing learned group convolutions. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2752–2761, 2018.[28] Yarin Gal. Uncertainty in deep learning.

University of Cam-bridge , 2016.[29] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In

Advances in Neural Information Processing Systems , pages3581–3590, 2017.[30] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. Theconcrete distribution: A continuous relaxation of discreterandom variables. In , 2017.[31] Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa-rameterization with gumbel-softmax. In , 2017.[32] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition. In , 2015.[33] Wenqi Li, Guotai Wang, Lucas Fidon, Sebastien Ourselin,M. Jorge Cardoso, and Tom Vercauteren. On the compact-ness, efﬁciency, and representation of 3d convolutional net-works: Brain parcellation as a pretext task. In

InternationalConference on Information Processing in Medical Imaging(IPMI) , 2017.[34] Jason Liang, Elliot Meyerson, and Risto Miikkulainen. Evo-lutionary architecture search for deep multitask networks. In

Proceedings of the Genetic and Evolutionary ComputationConference , pages 466–473. ACM, 2018.[35] Alexandre Lacoste, Boris Oreshkin, Wonchang Chung,Thomas Boquet, Negar Rostamzadeh, and David Krueger.Uncertainty in multitask transfer learning. In

Advances inNeural Information Processing Systems , 2018. [36] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In , 2015.[37] Balaji Lakshminarayanan, Alexander Pritzel, and CharlesBlundell. Simple and scalable predictive uncertainty esti-mation using deep ensembles. In

Advances in Neural Infor-mation Processing Systems , pages 6402–6413, 2017.[38] Nicholas J Tustison, Brian B Avants, Philip A Cook, YuanjieZheng, Alexander Egan, Paul A Yushkevich, and James CGee. N4itk: Improved n3 bias correction.

IEEE Transactionson Medical Imaging , 29(6):1310–1320, 2010.[39] Fabian Isensee, Jens Petersen, Andre Klein, David Zim-merer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, GregorKoehler, Tobias Norajitra, Sebastian Wirkert, and Klaus H.Maier-Hein. nnu-net: Self-adapting framework for u-net-based medical image segmentation. In arXiv:1809.10486 ,2018.[40] L.G. Nyul, J.K. Udupa, and Xuan Zhang. New variants of amethod of MRI scale standardization.

IEEE Transactions onMedical Imaging , 19(2):143–150, 2000.[41] Eli Gibson, Wenqi Li, Carole Sudre, Lucas Fidon,Dzhoshkun I. Shakir, Guotai Wang, Zach Eaton-Rosen,Robert Gray, Tom Doel, Yipeng Hu, Tom Whyntie,Parashkev Nachev, Marc Modat, Dean C. Barratt, S´ebastienOurselin, M. Jorge Cardoso, and Tom Vercauteren. NiftyNet:a deep-learning platform for medical imaging.

ComputerMethods and Programs in Biomedicine , 158:113–122, 2018.[42] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In

Proceedings of the 32nd International Con-ference on Machine Learning (ICML) , 2015.[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectiﬁers: Surpassing human-level per-formance on imagenet classiﬁcation. In

Proceedings of theIEEE International Conference on Computer Vision (ICCV) ,pages 1026–1034, 2015.[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity mappings in deep residual networks. In

Proceedingsof European Conference on Computer Vision (ECCV) , 2016.[45] Marc Harper. python-ternary: Ternary plots in python. In , 2015. . Training and implementation details

A.1. Optimisation, regularisation and initialisation

All networks were trained with ADAM optimiser [36]with an initial learning rate of − and β = [0 . , . .We used values of λ = 10 − and λ = 10 − for the weightand entropy regularisation factors in Equation (5) in Sec-tion . . All stochastic ﬁlter group (SFG) modules wereinitialised with grouping probabilities p =[ . , . , . ] forevery convolution kernel. Positivity of the grouping proba-bilities p is enforced by passing the output through a soft-plus function f ( x ) = ln (1 + e x ) as in [37]. The scheduler τ = max (0 . , exp( − rt )) recommended in [31] was usedto anneal the Gumbel-Softmax temperature τ where r is theannealing rate and t is the current training iteration. Weused r = 10 − for our models.Hyper-parameters for the annealing rate and the entropyregularisation weight were obtained by analysis of the net-work performance on a secondary randomly split on theUTK dataset ( / / ). They were then applied to alltrained models (large and small dataset for UTKFace andmedical imaging dataset). A.2. UTKFace

For training the VGG networks (Section 4.1 - UTKFacenetwork), we used the root-mean-squared-error (RMSE) forage regression and the cross entropy loss for gender classi-ﬁcation. The labels for age were divided by prior totraining. The input RGB images ( x x ) were all nor-malised channel wise to have unit variance and zero meanprior to training and testing. A batch-size of was used.No augmentation was applied. We monitored performanceduring training using the validation set ( n = 3554 ) andtrained up to epochs. We performed validation it-erations every iterations, leading to predictionsper validation iteration. Performance on the validation setwas analysed and the iteration where Mean Absolute Er-ror (MAE) was minimised and classiﬁcation Accuracy wasmaximised was chosen for the test set. A.3. Medical imaging dataset

We used T2-weighted Magnetic Resonance Imaging(MRI) scans (3T, 2D spin echo, TE/TR: 80/2500ms, voxelsize 1.46x1.46x5mm ) and Computed Tomography (CT)scans (140 kVp, voxel size 0.98x0.98x1.5 mm ). TheMR and CT scans were resampled to isotropic resolution(1.46mm ). We performed intensity non-uniformity correc-tion on the MR scans [38].In the HighResNet networks (Section 4.1 - Medicalimaging network), we used the RMSE loss for the regres-sion task and the Dice + Cross-Entropy loss [39] for thesegmentation task. The CT scans were normalised usingthe transformation CT / + 1 . The original range of the CT voxel intensity was [ − , with the backgroundset to − . The input MRI scans were ﬁrst normalisedusing histogram normalisation based on the st and th percentile [40]. The MRI scans were then normalised tozero mean and unit variance. At test time, input MRI scanswere normalised using the histogram normalisation trans-formation obtained from the training set then normalised tohave zero mean and unit variance.All scans were of size x x . We sub-sampledrandom patches from random axial slices of size x .We sampled from all axial slices in the volume ( n = 62 ).We trained up to , iterations using a batch-size of .We applied augmentation to the randomly sampled patchesusing random scaling factors in the range [ − , and random rotation angles in the range [ − ◦ , ◦ ]. Thetrained patches were zero-padded to increase their size to x . However, the loss during training was only calcu-lated in non-padded regions.The inference iteration for the test set was determinedwhen the performance metrics on the training set (MeanAbsolute Error and Accuracy) ﬁrst started to converge forat least , iterations. In our model where the groupingprobabilities were learned, the iteration when convergencein the update of the grouping probabilities was ﬁrst ob-served was selected since performance generally increasedas the grouping probabilities were updated. A.4. Implementation details

We used Tensorﬂow and implemented our models withinthe NiftyNet framework [41]. Models were trained onNVIDIA Titan Xp, P6000 and V100. All networks weretrained in the Stochastic Filter Group paradigm. Single-task networks were trained by hard-coding the allocationof kernels to task and task i.e. % of kernels per layerwere allocated to task and % were allocated to task with constant probabilities p =[1,0,0] and p =[0,0,1] respec-tively. The multi-task hard parameter sharing (MT hard-sharing) network was trained by hard-coding the allocationof kernels to the shared group i.e. % of kernel per layerwere allocated to the shared group with constant probability p =[0, 1, 0]. The cross-stitch (CS) [4] networks were imple-mented in a similar fashion to the single-task networks, withCS modules applied to the output of the task-speciﬁc con-volutional layers. The other baselines (MT-constant maskand MT-constant p =[ / , / , / ]) were trained similarly.We used Batch-Normalisation [42] to help stabilise train-ing. We observed that the deviation between populationstatistics and batch statistics can be high, and thus we didnot use population statistic at test time. Rather, we nor-malised using batch-statistics instead, and this consistentlylead to better predictive performance. We also used theGumbel-Softmax approximation [31] at test-time using thetemperature value τ that corresponded to the iteration in τ nnealing schedule. B. CNN architectures and details

We include schematics and details of the single-taskVGG11 [32] and HighResNet [33] networks in Fig. 9. Inthis work, we constructed multi-task architectures by aug-menting these networks with the proposed SFG modules.We used the PReLU activation function [43] in all networks.For the residual blocks used in the HighResNet networksin Fig. 9 (ii), we applied PReLU and batch-norm as pre-activation [44] to the convolutional layers. The SFG mod-ule was used to cluster the kernels in every coloured layerin Fig. 9, and distinct sets of additional transformations(pooling operations for VGG and high-res blocks for High-ResNet) were applied to the outputs of the respective ﬁltergroups G , G , G s . For a fair comparison, the CS units [4]were added to the same set of layers.For clariﬁcation, the SFG layer number n (e.g. SFGlayer 2) corresponds to the n th layer with an SFG module.In the case of SFG-VGG11, each convolutional layer usesSFGs. The SFG layer number thus corresponds with layernumber in the network. In the case of SFG-HighResNet, notevery convolutional layer uses SFGs such as those withinresidual blocks. Consequently, SFG layer 1 corresponds tolayer 1, SFG layer 2 is layer 6, SFG layer 3 is layer 11, SFGlayer 4 is layer 16 and SFG layer 5 is layer 17. C. Learned grouping probability plots

In this section, we illustrate density plots of the learnedgrouping probabilities p for each trained network (Fig. 10and Fig. 11). We also plot the training trajectories of group-ing probabilities p of all kernels in each layer. These arecolour coded by iteration number—blue for low and yellowfor high iteration number. This shows that some groupingprobabilities are quickly learned in comparison to others.Fig. 10 and Fig. 11 show that most kernels are in theshared group at earlier layers of the network where mostlylow-order generic features are learned (as illustrated inFig. 12, SFG layer ). They converge quickly to theshared vertex of the -simplex as evidenced by the colour ofthe trajectory plots. As the network depth increases, task-specialisation in the kernels increases (see Fig. 12, SFGlayer ≥ ). This is illustrated by high density clusters attask-speciﬁc vertices and by the trajectory plots. D. Extra visualisation of activations

Here we visualise the activation maps of additionalspecialist and generalist kernels on the medical imagingdataset. To classify each kernel according to the group (task1, task 2 or shared), we selected the group with the respec-tive maximum assignment probability. The corresponding activation maps for various input images in the medicalimaging dataset can be viewed in Fig. 12 and Fig. 13.We ﬁrst analysed the activation maps generated by ker-nels with low entropy of p (i.e. highly conﬁdent group as-signment). At the ﬁrst layer, all kernels are classiﬁed asshared, and the examples in Fig. 12 support that these ker-nels tend to account for low-order features such as edgeand contrast of the images. On the other hand, at deeperlayers, higher-order representations are learned, which de-scribe various salient features speciﬁc to the tasks such asorgans for segmentation, and bones for CT-synthesis. Notethat the bones are generally the most difﬁcult region to syn-thesise CT intensities from an input MR scan [10].Secondly, we looked at activation maps from kernelswith high entropy of p (i.e. highly uncertain group assign-ment) in Fig. 13. In contrast to Fig. 12, the learned fea-tures do not appear to capture any meaningful structures forboth synthesis and segmentation tasks. Of particular noteis the dead kernel in the top row of the ﬁgure; displayingthat a high uncertainty in group allocation correlates withnon-informative features. E. Learned ﬁlter groups on duplicate tasks

We analysed the dynamics of a network with SFG mod-ules when trained with two duplicates of the same CT re-gression task (instead of two distinct tasks). Fig. 14 visu-alises the learned grouping and trajectories of the groupingprobabilities during training. In the ﬁrst SFG layers (lay-ers , and of the network), all the kernels are groupedas shared. In the penultimate SFG layer (layer ), eitherkernels are grouped as shared or with probability p =[ / , 0, / ], signifying that the kernels can belong to either task.The ﬁnal SFG layer (layer ) shows that most kernels haveprobabilities p =[ / , / , / ]. Kernels thus have equal proba-bility of being task-speciﬁc or shared. This is expected aswe are training on duplicate tasks and therefore the kernelsare equally likely to be useful across all groups. x3 convolutions64 kernels PReLU Batch Norm. Max Pooling2x2 S=2Repeated 3x3 convolutions256 kernels PReLU Batch Norm.Repeated 3x3 convolutions512 kernels PReLU Batch Norm. Global Average PoolingFully Connected Layer3x3 convolutions128 kernels PReLU Batch Norm. x ˆ y (i) VGG11(ii) HighResNet x ˆ y A block with residual connections 3x3 convolutions16 kernels BatchNorm. PReLU3x3 convolutions32 kernels BatchNorm. PReLU3x3 convolutions64 kernels BatchNorm. PReLU 3x3 convolutions16 kernelsBatchNorm. PReLU 3x3 convolutions32 kernels, dilated by 2BatchNorm. PReLU 3x3 convolutions64 kernels, dilated by 4BatchNorm. PReLU

Layers with SFG/CS modules

Additional transformationsLayers with SFG/CS modules Additional transformations

Figure 9: Illustration of the single-task architectures, (i) VGG11 and (ii) HighResNet used for UTKFace and medical imagingdataset, respectively. In each architecture, the coloured components indicate the layers to which SFG or cross-stitch (CS)modules are applied when extended to the multi-task learning scenario, whilst the components in black denote the additionaltransformations applied to the outputs of respective ﬁlter groups or CS operations (see the description of black circles in theschematic provided in Fig. 5 of the main text) .

FG Layer 1 SFG Layer 2 SFG Layer 3 SFG Layer 4SFG Layer 5 SFG Layer 6 SFG Layer 7 SFG Layer 8

Figure 10: Density plots and trajectory plots of the learned grouping probabilities for the SFG-VGG11 architecture. Thedensity plots represents the ﬁnal learned probabilities per layer for each kernel. The trajectory plots represent how thegrouping probabilities are learned during training and thus how the connectivity is determined. Histograms of the groupingprobabilities were smoothed with a Gaussian kernel with σ = 1 . The densities are mapped to and visualised in the -simplexusing python-ternary [45]. FG Layer 1 SFG Layer 2 SFG Layer 3 SFG Layer 4 SFG Layer 5

Figure 11: Density plots and trajectory plots of the learned grouping probabilities for the SFG-HighResNet architecture.The density plots represents the ﬁnal learned probabilities per layer for each kernel. The trajectory plots represent how thegrouping probabilities are learned during training and thus how the connectivity is determined. nput MR SFG Layer 1

Shared Task 2Task 2 Task 1Task 1

SFG Layer 4 SFG Layer 5

Figure 12: Example activations for kernels with low entropy of p (i.e. group assignment with high conﬁdence) for three inputMR slices in the SFG-HighResNet multi-task network. Columns “Shared”, “Task 1” & “Task 2” display the results from theshared, CT-synthesis and organ-segmentation speciﬁc ﬁlter groups in respective layers. We illustrate activations stratiﬁed bygroup in layer (SFG layer 1), layer (SFG layer 4) and layer (SFG layer 5). nput MR Task 2Task 2 Task 1Task 1

SFG Layer 4 SFG Layer 5