[PDF] OmniNet: A unified architecture for multi-modal multi-task learning

Abstract

Transformer is a popularly used neural network architecture, especially for language understanding. We introduce an extended and unified architecture that can be used for tasks involving a variety of modalities like image, text, videos, etc. We propose a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence. The proposed architecture further enables a single model to support tasks with multiple input modalities as well as asynchronous multi-task learning, thus we refer to it as OmniNet. For example, a single instance of OmniNet can concurrently learn to perform the tasks of part-of-speech tagging, image captioning, visual question answering and video activity recognition. We demonstrate that training these four tasks together results in about three times compressed model while retaining the performance in comparison to training them individually. We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering. This illustrates the generalization capacity of the self-attention mechanism on the spatio-temporal cache present in OmniNet.

Full PDF

OOmniNet: A uniﬁed architecture for multi-modalmulti-task learning

Subhojeet Pramanik , Priyanka Agrawal , and Aman Hussain IBM Cloud [email protected] IBM Research [email protected] University of Amsterdam [email protected]

Abstract.

Transformer is a popularly used neural network architecture,especially for language understanding. We introduce an extended anduniﬁed architecture that can be used for tasks involving a variety ofmodalities like image, text, videos, etc. We propose a spatio-temporalcache mechanism that enables learning spatial dimension of the inputin addition to the hidden states corresponding to the temporal inputsequence. The proposed architecture further enables a single model tosupport tasks with multiple input modalities as well as asynchronousmulti-task learning, thus we refer to it as

OmniNet . For example, a singleinstance of

OmniNet can concurrently learn to perform the tasks of part-of-speech tagging, image captioning, visual question answering and videoactivity recognition. We demonstrate that training these four tasks to-gether results in about three times compressed model while retaining theperformance in comparison to training them individually. We also showthat using this neural network pre-trained on some modalities assists inlearning unseen tasks such as video captioning and video question an-swering. This illustrates the generalization capacity of the self-attentionmechanism on the spatio-temporal cache present in

OmniNet . Keywords: multi-modal, multi-task learning, transformer, spatio-temporal,attention-networks, neural-network

Transformer [38] is currently one of the best performing models for any sequencetransduction tasks, especially those involving natural language. It is originallydesigned for a single task at a time. In fact, most of the generic deep learn-ing architectures [3, 37, 39] that have been designed and developed are able tolearn, albeit very well, a single task and handle one task speciﬁc input domainlike image, text or audio. Furthermore with these models, we often rely on thegeneralization capability of the trained network to guarantee performance onunseen examples. Transfer learning [12, 33] is another popular paradigm used toadapt the model to learn a related task with similar input domain. The success a r X i v : . [ c s . L G ] J u l Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain of neural networks across these challenges is known to be due to their abilityin learning eﬀective representations of the data. For example, the self-attentionmechanism in Transformers can capture the global temporal dependence in se-quential data very well. Naturally, the question arises whether we can extendthese architectures, like the Transformer, to be able to learn shared representa-tions from multiple input domains and to be able attend on these representationsto perform a multitude of tasks concurrently.The research into multi-task models that learn to solve varied tasks acrossa multitude of input domains is not new. Work done in [25] demonstrates anarchitecture capable of learning a shared representation across audio and videomodalities. Similarly in [6] a convolutional architecture has been designed tosupport a variety of NLP tasks. However, most of these architectures are designedto learn speciﬁc set of tasks with known input domains. To the best of ourknowledge, there does not exist a single uniﬁed architecture that works out ofthe box for any combination of multi-modal inputs.To address this gap, we extend Transformer towards a uniﬁed architecture,namely

OmniNet , which enables a single model to support tasks with multipleinput modalities and asynchronous multi-task learning. We consider that mostreal-life data like image, text, speech, video, etc. is a direct conjunction of spa-tial and temporal components. Therefore, we employ a spatio-temporal cachemechanism to learn a shared representation of the input data across the spatial(space) and temporal (time) dimension. Using a generalized encode() function,

OmniNet can process and store spatio-temporal representation for each of theinput domains and then decode() predictions across a multitude of tasks. In ourexperiments, we train a single instance of the

OmniNet to solve a number oftasks spanning multiple multi-domains such as part-of-speech tagging, imagecaptioning, visual question answering and video activity recognition. To makeour work reproducible, open to scrutiny and further development, we will opensource a demonstration of our system implemented using Pytorch [27].

Multi-task learning has been extensively studied in the literature, with appli-cations to a wide set of problems ranging from natural language processing(NLP) [6, 7, 9, 13] to speech recognition [18, 30] to vision [1, 4, 26, 40]. It hasalso found its use in a combination of diverse tasks like image captioning andtext translation and parsing [22, 29, 41]. However, most of these architecturesassume the set of tasks to be known in advance. Similarly, multi-modal learninghas been essential for solving a broad range of interesting problems such as Vi-sual Question Answering [15,16] and Video Question Answering [20]. Again, thestate-of-the-art models are highly speciﬁc to the objective in hand and not easilyadaptable to diﬀerent tasks or domains. [14] proposed MultiModel architecturefor learning multiple tasks but lacks support for multi-modal tasks with morethan one input domains such as visual question answering. mniNet: A uniﬁed architecture for multi-modal multi-task learning 3

We propose a uniﬁed architecture, namely

OmniNet , to enable learning multi-modal tasks with multiple input domains and support generic multi-tasking forany set of tasks. The

OmniNet architecture consists of multiple sub-networks,called peripheral networks, connected to a common central neural network calledthe Central Neural Processor (CNP) (Figure 1). Each peripheral network is usedto encode the domain speciﬁc input into feature representations. In this work,we describe image, text and video peripherals (Section 3.1). One can add more,say speech peripheral, depending on the task. The output representation of aperipheral network is always a spatio-temporal tensor x ∈ R t × s × d model , where t & s are the temporal and spatial dimensions of the input respectively, and d model is the model dimension input to the CNP. Fig. 1.

OmniNet performing image captioning, visual question answering and POStagging at once

The spatio-temporal representations generated by the peripheral networkscorresponding to each input domain are then processed by the CNP. The CNPuses fully attention based encoder-decoder [2, 5, 36] model for sequence trans-duction similar to the Transformer architecture [38], which is the state-of-the-art for multiple language modeling tasks (Section 3.2). During the encodingstage, the CNP implements a generic encode( x , D ) function to ﬁrst process andstore the spatio-temporal representations of the input, where x ∈ R t × s × d model is the spatio-temporal tensor produced by the peripheral networks and D ∈ Z : 0 ≤ D < D len is the domain id and D len is the max number of domainssupported by the CNP. The encode() function is called multiple times, once foreach multi-modal input from respective peripheral. During the decoding stage,a decode( y shifted , τ ) function is used to decode predictions as softmax proba-bilities, where y shifted ∈ Z N − are the target outputs shifted one time-step to Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain the right, N is the length of the output sequence; τ ∈ Z : 0 ≤ τ < τ len is taskid and τ len is the total number of supported tasks. The decoding step is similarto [38], modiﬁed to incorporate a two-step attention mechanism over spatial andtemporal cache. First, we elaborate on how we support multiple input domains using periph-eral networks. A peripheral network can use a pre-trained model from existingliterature to ultimately encode a given domain input to a standardized featurerepresentation x ∈ R t × s × d model , where t & s are the temporal and spatial dimen-sions of the input respectively, and d model is the model dimension input to theCentral Neural Processor. Here we detail text and vision peripherals and onecan add more peripherals or alter the peripheral design depending on the task. Vision peripheral:

This peripheral uses a convolutional neural network toencode image and video inputs in the tasks. For an image of dimension h × w × n c ,this peripheral down-samples in to h (cid:48) × w (cid:48) × n (cid:48) c , where h, w, n c are the height,width and number of input channels respectively. For a video, each frame isinput to the peripheral to produce F × h (cid:48) × w (cid:48) × n (cid:48) c , where F is the total numberof frames in the video. The encoding vectors are then projected to dimension d model using a fully connected layer. The output is then reshaped into a spatio-temporal tensor of x ∈ R t × h (cid:48) w (cid:48) × d model , where t = 1 for an image and t = F for avideo. In our experiments, we use the pre-trained ResNet-152 model, a variantof ResNet [10] consisting of 152 convolutional layers. We remove the ﬁnal fullyconnected and avg-pooling layers to generate spatial feature representations fora given image/video.

Language peripheral:

The Language peripheral uses byte-pair encoding[31] to generate subwords for a given input sentence. The subwords are passedto an embedding layer to generate subword embeddings of dimension d emb andprojected to dimension d model using a fully connected layer. The output is thenreshaped into a spatio-temporal tensor x ∈ R t × × d model , where t equal to numberof subwords in the input sentence. As we do not have any spatial dimension intextual data, the spatial dimension of x from a Language peripheral is always 1.In our experiments, We used pre-trained subword embeddings with d emb = 300and vocab size = 25000 from [11], which includes pre-trained subword embed-dings of over 275 languages, to initialize the weights of the embedding matrix. To process the spatio-temporal information in the input data, the CNP imple-ments a spatial cache C s , temporal cache C t and a link array L . The spatialand temporal cache and the link array are a list of elements, initialized as emptybefore the encoding process. During the encoding stage, an encode() routinetakes as input, the tensor x generated from the peripheral and correspondingdomain/peripheral id D . This function processes the spatial and temporal infor-mation in the input x and stores them into the spatial cache C s and the temporal mniNet: A uniﬁed architecture for multi-modal multi-task learning 5 cache C t , respectively and stores their dimensions t & s in the link array. Fora given task, this encode() routine is called K times, where K is the numberof inputs in the task. Note that these inputs can belong to same or diﬀerentdomains. Fig. 2. left:

TemporalEncoder architecture; right:

OmniNet decode() architecture.

Encode ( x , D ): For a given input x ∈ R t × s × d model and domain identiﬁer D , the encode() routine is described in Algorithm 1. Since inputs can come frommultiple peripherals, the algorithm ﬁrst concatenates the input with the domainembedding to ensure a domain-aware encoding of the input (Steps 2 to 3). Steps4 to 7 process the spatial information in x by unrolling the time dimension andadding these unrolled vectors into the spatial cache. Steps 8 to 10 process thetemporal information in x by averaging the spatial dimension of x and then Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain passing the averaged tensor to a self-attention based

TemporalEncoder . This

TemporalEncoder is similar to the encoder used in [38] as shown in Figure 2 isused to calculate temporal embeddings of the input sequence. The output fromthe

TemporalEncoder is appended to the temporal cache.

Algorithm 1 encode() : Encodes spatial and temporal representations into spa-tial and temporal cache

Require: x ∈ R t × s × d model , D , C s , L , C t L ← L ∪ ( t → s )2: D emb ← EmbedLayer ( D )3: x ← F C ( Concat ( x, D emb ) , d model )4: if s > then S ← Reshape ( x, ( ts, d model )) { where, output S = [ S , . . . , S ts ] s.t. S i ∈ R d model is a spatial feature vector. } C s ← C s ∪ [ S , . . . , S ts ] { Append spatial representations to spatial cache } end if T ← ( (cid:80) si =1 x [: , i, :]) /s T ← T emporalEncoder ( T ) { where, output T = [ T , . . . , T t ] s.t. T j ∈ R d model isthe encoding of temporal dimension in x . } C t ← C t ∪ [ T , . . . , T t ] { Append temporal representations to temporal cache } The above encoding routine keeps appending spatio-temporal information to C t & C s for each input x k ∈ R t k × s k × d model . Note the superscript k to denotecorrespondence to k -th input of the task, where k ∈ , . . . , K . After K calls,we have the temporal cache C t = [ T , . . . , T R ], where R = (cid:80) Kr =1 t r ; the spatialcache C s = [ S , . . . , S P ], where P = { (cid:80) p t p ∗ s p : p ∈ , . . . , K ∧ s p > } andthe link array L = [( t → s ) , . . . , ( t K → s K )]. Note that C s can also be emptyin case the encode() is only called with inputs with s k = 1 ∀ k . Next, we use the decode() routine to generate predictions as softmax probabilities. Decode ( y shifted , τ ) : The architecture of the decode() function is shown inFigure 2. The decode() takes as argument the output labels y shifted shifted onetime step to the right, a task id τ and generates predictions by attending fromthe spatial and temporal cache. The decode() function is structured similar tothe decoder used in the Transformer architecture [38] and jointly attends on thevectors stored in the temporal and spatial cache. Similar to [38], the decodingﬁrst starts by attending over the output embeddings using masked multi-headscaled dot product attention. The attention layer for the temporal cache usesscaled dot-product attention with multiple heads as speciﬁed in [38]. Attentionlayer for the spatial cache, uses gated multi-head attention to attend over theelements of the spatial cache. For inputs with both time and space dimension (e.g.video), we want the spatial attention layer to attend more on frames which haverelatively high attention scores in the temporal cache attention layer. Therefore,the attention score output from the temporal cache multi-head attention layer A ∈ R n h × N × R , is used to calculate the tensor G ∈ R n h × N × P used for gating the mniNet: A uniﬁed architecture for multi-modal multi-task learning 7 attention score output in the spatial attention layer, where n h is the number ofheads in multi-head attention as described in [38]. The tensor G is calculatedusing A & L as detailed in Algorithm 2. Given Q : the matrix of queries, K : keysof dimension d k and V : values of dimension d v , the scaled dot-product attentionfor the spatial layer is modiﬁed as:Attention( Q, K, V, G ) = (cid:18)

Softmax (cid:18) QK T √ d v (cid:19) (cid:12) G (cid:19) V (1)In order to use the same CNP for multiple tasks with varying output vocabu-laries, we use multiple output embedding layers OutputEmbedLayer , . . ., OutputEmbedLayer τ len , to generate the output embeddings for each task. Atthe ﬁnal layer, we use multiple ( F C + Sof tmax ) , . . . , ( F C + Sof tmax ) τ len clas-siﬁcation layers for each task. We also calculate a task embedding vector using τ and always start decoding using the task embedding vector. Algorithm 2

Calculate G using output scores from temporal attention and linkarray Require: L , A idx ← for each t, s in L do G ← []4: if s > then A (cid:48) ← A [: , : , idx : idx + t ]6: A (cid:48) ← Expand ( A (cid:48) , ( n h , N, t, s )) { where Expand ( tensor, dimension ) expandstensor according to a given dimension } A (cid:48) ← Reshape ( A (cid:48) , ( n h , N, ts ))8: G ← G ∪ A (cid:48) { Append the respective temporal attention scores to G } end if idx ← idx + t end for G ← Stack ( G ) { Stack the list of tensors to construct tensor G of dimension( n h , N, P ) } In order to train a single a model simultaneously on mutiple tasks we usedthe HogWild training approach as described in [28]. Similar to the approachdescribed in [24], the main process holds a global copy of the model. We createseparate worker processes for each task, where each process maintains a localcopy of a model. At each training iteration, each process starts by synchronizingits local model with the global copy. This is done through forward and backwardpropagation on its local copy and then copying the locally computed gradientsto the global model asynchronously. Each process then calls the global model For brevity, we reuse the notations of [38] in this description Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain optimizer asynchronously to update the weights of the global model. Insteadof storing the model in CPU as in [24] we always store the local copies acrossmultiple GPUs.

To evaluate the eﬀectiveness of our proposed framework for tasks spanningdiverse modalities, we choose a set covering all possible spatio-temporal dataarchetypes: Image Captioning, Part-of-Speech (POS) tagging, Visual QuestionAnswering (VQA) and Video-activity Recognition. Each of these tasks exploresa unique potential spatio-temporal conﬁguration of the input containing diﬀer-ent values of t and s , where t and s are the temporal and spatial dimensions ofthe input respectively. This enables us to perform a comprehensive study of themulti-modal and multi-input capabilities of the system. We further elaborate onthe properties of these tasks below.For training, we always use cross-entropy loss with Adam optimizer [17] andschedule the learning rate using Noam scheduler [32] similar to [38] . In thevision peripheral, we freeze the layers of pretrained ResNet model. Remainingparts of the architecture (peripherals and CNP) are kept trainable. In this sec-tion, we provide details on the datasets used and the model setup for each ofthese tasks. Part-of-speech (POS)Tagging:

To illustrate the task with only temporalmodality ( t > & s = 1 , where t and s are the temporal and spatial dimensionsof the input respectively), we consider POS tagging problem. Given an input se-quence of words, the model should produce a sequence of POS tags correspondingto each word. We use Penn Tree-bank [23] which contains gold annotations onEnglish WSJ articles by experts. During the encoding stage, each input sentenceis processed by the language peripheral to generate a spatio-temporal tensor x ∈ R t × × d model , where t is the sequence length of subwords in the input. TheCNP encode () function is then used to encode x into the temporal cache. Notethat spatial cache is empty for text inputs as s = 1. Therefore, the decodingstage is same as that of Transformers to predict the sequence of POS tags. Image Captioning:

This task represents the ones with inputs containingonly spatial modality ( t = 1 & s > ). The captioning model is required topredict text caption for a given image. We use the MSCOCO 2014 dataset [21]for training and present results on the COCO validation set. During the en-coding stage, the input image is resized to 224 ×

224 and processed by the vi-sion peripheral containing pre-trained ResNet-152 to produce image embeddings x ∈ R × × d model . x is then input to the encode() function which populates corre-sponding spatial and temporal cache. The decoding stage uses decode() functionwith output vocabulary size 25000 to generate the captions. The hyperparameter values used for n h , d model , N Layers , d k , d v are same as thatspeciﬁed in Transformer base model [38]. https://catalog.ldc.upenn.edu/LDC99T42; We use splits 0-18 as training, 19-21 asdevelopment and 22-24 as test setsmniNet: A uniﬁed architecture for multi-modal multi-task learning 9 Visual Question Answering:

For the task with inputs from multiple do-main, such that each contains either spatial or temporal modality (either t > & s = 1 , or t = 1 & s > for each input), we choose the task of visual ques-tion answering. Given a question over an image as inputs, the model is supposedto predict the correct answer label. We use the recently introduced VQA v2.0dataset [8] for this purpose and perform evaluation on the VQA test-dev set.All the images are resized to dimension 224 ×

224 before training. The encodingstage of this task utilizes two peripherals: the vision peripheral is used to gen-erate a tensor x ∈ R × × d model for the input image. The language peripheralis used to encode the questions into x ∈ R t × × d model , where t is equal to thelength of the subwords in the question. The encode() function is the called twotimes, ﬁrst with x and second with x as input. Finally, the decode () with out-put vocabulary size 3500 is to generate the answers as softmax probabilities ina single decoding step. Video Activity Recognition:

For tasks which contain both spatial andtemporal modality in a single input ( t > & s > ), we consider the actionrecognition task on videos. For this purpose, we use the HMDB dataset [19].The dataset consists of over 5000 short length clips of real life actions with 51classes. We present our results on train-test split 1. We use 16 frames per videoand resize each of them to 224 × x ∈ R × × d model which is then used as input to the encode() function. Finally, the decode () with output vocabulary size 51 is to predict theaction as softmax probabilities in a single decoding step. We present the evaluation on (a) Tasks of various modalities illustrated in Section4 (b) Multi-tasking setup for these tasks (Table 1) (c) Reuse of the multi-taskmodel for an unseen task (Figure 3). In addition, we also provide some ablationstudies on the architecture (Table 2)

Performance of proposed architecture on individual tasks:

We choosea set of four tasks with diverse input modalities and combinations as describedin previous Section 4. We train the

OmniNet model independently across eachof the above tasks. Each of the tasks demonstrates unique capabilities of thisgeneric architecture. More speciﬁcally, in Table 1 we compare our results withthe following state-of-the-art :- POS tagging: [35]; image captioning & VQA: [1]and HMDB: [34]. It is important to note that we do not perform any hyper-parameter optimization. We believe that, with more computational power, ﬁnetuning the hyperparameters towards these tasks should result in comparable oreven improved performance to the state-of-the-art. These results can indeed beused as a baseline for any future work which aims at using a single architecture Since most of these tasks are popular challenges, we compare with state-of-the-art which are generically applicable for the respective task instead of the challengedataset.0 Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain across various possible spatio-temporal archetypes. It is interesting to note thatthe model is extensible to a new domain without any modiﬁcation to the CNPas long as one can add a speciﬁc peripheral to convert domain inputs into spatio-temporal tensors. This aspect of the architecture makes it applicable to severalpopular multi-modal tasks.

POS Captioning Visual Question Answering HMDB

Acc. BLEU-4 Meteor Overall Y/N Num. Other Acc.

SOTA 97.44 36.2 27.0 63.2 80.3 42.8 55.8 59.4 -IND 95.61 28.9 25.2 55.31 74.09 35.17 46.35 55.29 450 m MULT-3 95.82 28.8 25.2 56.79 76.75 35.82 47.16 - 149 . m MULT-4 95.44 27.4 24.5 55.76 75.49 35.64 46.08 54.44 149 . m Table 1.

Performance of

OmniNet on diverse set of tasks. IND: Model trained in-dividually for each of the given tasks; MULT-3: Multi-task model trained on POS,Captioning & VQA; MULT-4: Multi-task model trained across all the four tasks.

Eﬀect of training a diverse set of tasks together:

We trained twomulti-task models: (1) MULT-3 (POS+VQA+Captioning) and(2) MULT-4 (POS+VQA+Captioning+HMDB), using hogwild approach. Whilethe MULT-3 model attains similar and sometimes better performance, the ﬁnalMULT-4 model attains slightly reduced performance, when compared to the in-dependent task scores. We believe this is due to the skewness in the size of theHMDB dataset containing only 5000 training samples. However as a tradeoﬀ,adding the HMDB task shows interesting zero-shot results demonstrated be-low. Using a multi-task model also results in three times reduction in the totalnumber of parameters. That is, when a separate model is used for each task,we have a total of over 450 × parameters. Whereas during multi-taskingsince a single model is shared, we have a total of over 149 × parameters,while achieving similar performance. Interestingly, the model is able to attendon spatio-temporal components of the inputs from diﬀerent tasks and concur-rently generate predictions across them, thus demonstrating the generalizationcapability of our architecture. Towards zero-shot learning: reuse of pre-trained network for un-seen task:

Sharing representations across multiple tasks provides the beneﬁtto transfer of useful knowledge across multiple domains. Since, image and videoare processed by the same vision peripheral, we conducted an experiment to seewhether our model pre-trained on all the four tasks (MULT-4) can perform videocaptioning and video question-answering without any explicit training on thesetasks i.e. zero-shot learning. The results of the evaluation on randomly pickedinstances from the HMDB test split 1 are shown in Figure 3. Interestingly, themodel performs quite well on related actions that were present in the COCOand VQA training set; such as captions related to horse riding and baseball; orquestions related to concepts present in VQA. Without training on any videocaptioning & video QA instance, the model could use the trained information mniNet: A uniﬁed architecture for multi-modal multi-task learning 11

Fig. 3.

Results of zero-shot video captioning and video question-answering. from image captioning, image QA (VQA) and video action recognition (HMDB)apply them on videos to generate meaningful predictions, hence demonstratingthe capability of the model to transfer knowledge across related multi-modaltasks. However, on concepts that are not present in the trained datasets, themodel either describes the environment in the video or replaces with alternateknown concepts. This case study, although not comprehensive, shows the ca-pability of the model to learn shared representations and ability to transferknowledge across domains. We believe that adding more tasks and domains willlead to more interesting zero-shot learning results in future across a wide rangeof problems.

Impact of individual architectural components:

In order to supportdiﬀerent input domains, our architecture introduces spatial cache and link arraycomponents to the original

Transformer architecture (which only consists ofmechanisms to handle temporal data). We conducted an ablation study on eachof these components to verify their importance across various tasks as shownin Table 2. The ablation was conducted on the independent (IND) as well themulti-tasking model (MULT-4). The second row ablates the link array from ourarchitecture i.e. removing the multiplication of G in Equation 1. The link arraywas designed to assist in tasks with inputs such as video, containing both spatialas well as temporal modality in a single input. The total number of spatialcomponents becomes very large as number of frames in the video increases,thereby making it diﬃcult to attend on various spatial regions throughout thevideo. Using link array the spatial attention layer can attend more on speciﬁcimportant frames in the video. Therefore, removal of link array leads to a hugereduction in performance in HMDB compared to other tasks as they do nothave both spatio-temporal modalities for any single input. Removal of spatialcache, on the other hand, has signiﬁcant eﬀect on performance across all taskscontaining spatial modality. Since, image captioning contains primarily spatialmodality and hence the BLEU drops signiﬁcantly after ablation. As other tasksutilize the temporal cache for prediction, in the multi-task setting the captioning task learns to utilize the spatial average of the image stored in the temporalcache for prediction and hence retains some performance after ablation of thespatial cache. On the other hand when trained independently on captioning, thenetwork learns to utilize the information in the spatial cache only, and hencethe score drops to zero after ablation. VQA leverages both, spatial informationfrom image and temporal information from question, retains some performancefrom the use of temporal cache. Note that, POS tagging task is not aﬀected byablation of any of the components since it only has temporal modality in theinput. POS (Acc.)

Captioning (BLEU-4)

VQA (Overall)

HMDB (Acc.)IND MULT-4 IND MULT-4 IND MULT-4 IND MULT-4

OmniNet

Table 2.

Ablation study on the eﬀect of proposed architectural components.

We present a uniﬁed neural network architecture

OmniNet capable of learn-ing tasks with multiple inputs of varying modalities. The architecture can befurther adopted for multi-task learning across any set of tasks containing spatio-temporal data. Sharing one model across multiple tasks also results in a signiﬁ-cant reduction in the total number of parameters. We further demonstrate thatthis shared model can learn robust representations from various spatio-temporalinputs which are reusable for unseen tasks. We believe that this proposed ar-chitecture has wide applicability to any task with spatio-temporal inputs. Toextend its usability, we would like to introduce new peripherals supporting moredomains such as speech. We are also keen on exploring other aspects to the databeyond temporal and spatial dimensions such as graphs and relational data. Fur-ther, it would be interesting to investigate scheduling mechanisms for optimizingthe multi-tasking framework.

References

1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.:Bottom-up and top-down attention for image captioning and visual question an-swering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition. pp. 6077–6086 (June 2018). https://doi.org/10.1109/CVPR.2018.006362. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learn-ing to align and translate. In: 3rd International Conference on Learning Repre-sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference TrackProceedings (2015), http://arxiv.org/abs/1409.0473 mniNet: A uniﬁed architecture for multi-modal multi-task learning 133. Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh,S., Seetapun, D., Sriram, A., Zhu, Z.: Exploring neural transducers for end-to-end speech recognition. CoRR abs/1707.07413 (2017), http://arxiv.org/abs/1707.07413

4. Chen, Y., Zhao, D., Lv, L., Zhang, Q.: Multi-task learning for danger-ous object detection in autonomous driving. Information Sciences , 559– 571 (2018). https://doi.org/https://doi.org/10.1016/j.ins.2017.08.035,

5. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares,F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNNencoder–decoder for statistical machine translation. In: Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). pp. 1724–1734. Association for Computational Linguistics, Doha,Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1179,

6. Collobert, R., Weston, J.: A uniﬁed architecture for natural language process-ing: Deep neural networks with multitask learning. In: Proceedings of the 25thInternational Conference on Machine Learning. pp. 160–167. ICML ’08, ACM,New York, NY, USA (2008). https://doi.org/10.1145/1390156.1390177, http://doi.acm.org/10.1145/1390156.1390177

7. Dong, D., Wu, H., He, W., Yu, D., Wang, H.: Multi-task learning for multiplelanguage translation. In: ACL (2015)8. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V inVQA matter: Elevating the role of image understanding in Visual Question An-swering. In: Conference on Computer Vision and Pattern Recognition (CVPR)(2017)9. Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model:Growing a neural network for multiple NLP tasks. In: Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing. pp.1923–1933. Association for Computational Linguistics, Copenhagen, Denmark(Sep 2017). https://doi.org/10.18653/v1/D17-1206,

10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).pp. 770–778 (June 2016). https://doi.org/10.1109/CVPR.2016.9011. Heinzerling, B., Strube, M.: BPEmb: Tokenization-free Pre-trained Subword Em-beddings in 275 Languages. In: chair), N.C.C., Choukri, K., Cieri, C., Declerck, T.,Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A.,Odijk, J., Piperidis, S., Tokunaga, T. (eds.) Proceedings of the Eleventh Interna-tional Conference on Language Resources and Evaluation (LREC 2018). EuropeanLanguage Resources Association (ELRA), Miyazaki, Japan (May 7-12, 2018 2018)12. Hu, L., Kan, M., Shan, S., Chen, X.: Duplex generative adversarial network forunsupervised domain adaptation. In: The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (June 2018)13. Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat,N., Vi´egas, F., Wattenberg, M., Corrado, G., Hughes, M., Dean, J.: Google’smultilingual neural machine translation system: Enabling zero-shot translation.Transactions of the Association for Computational Linguistics , 339–351 (2017), abs/1706.05137 (2017), http://arxiv.org/abs/1706.05137

15. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Ben-gio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Gar-nett, R. (eds.) Advances in Neural Information Processing Systems 31, pp.1564–1574. Curran Associates, Inc. (2018), http://papers.nips.cc/paper/7429-bilinear-attention-networks.pdf

16. Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.:Multimodal residual learning for visual qa. In: Lee, D.D., Sugiyama, M., Luxburg,U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information ProcessingSystems 29, pp. 361–369. Curran Associates, Inc. (2016), http://papers.nips.cc/paper/6446-multimodal-residual-learning-for-visual-qa.pdf

17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6980

18. Krishna, K., Toshniwal, S., Livescu, K.: Hierarchical multitask learning for ctc-based speech recognition. CoRR abs/1807.06234 (2018), http://arxiv.org/abs/1807.06234

19. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large videodatabase for human motion recognition. In: Proceedings of the International Con-ference on Computer Vision (ICCV) (2011)20. Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video ques-tion answering. CoRR abs/1809.01696 (2018), http://arxiv.org/abs/1809.01696

21. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla,T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755.Springer International Publishing, Cham (2014)22. Luong, T., Le, Q.V., Sutskever, I., Vinyals, O., Kaiser, L.: Multi-task sequenceto sequence learning. In: International Conference on Learning Representations(2016)23. Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, A., Ferguson,M., Katz, K., Schasberger, B.: The penn treebank: Annotating predicate argu-ment structure. In: Proceedings of the Workshop on Human Language Technol-ogy. pp. 114–119. HLT ’94, Association for Computational Linguistics, Strouds-burg, PA, USA (1994). https://doi.org/10.3115/1075812.1075835, https://doi.org/10.3115/1075812.1075835

24. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Harley, T., Lillicrap, T.P., Silver,D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In:Proceedings of the 33rd International Conference on International Conference onMachine Learning - Volume 48. pp. 1928–1937. ICML’16, JMLR.org (2016), http://dl.acm.org/citation.cfm?id=3045390.3045594

25. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deeplearning. In: Proceedings of the 28th International Conference on InternationalConference on Machine Learning. pp. 689–696. ICML’11, Omnipress, USA (2011), http://dl.acm.org/citation.cfm?id=3104482.3104569

26. Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailmentgeneration. CoRR abs/1704.07489 (2017), http://arxiv.org/abs/1704.07489 mniNet: A uniﬁed architecture for multi-modal multi-task learning 1527. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A.: Automatic diﬀerentiation in PyTorch. In:NIPS Autodiﬀ Workshop (2017)28. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free ap-proach to parallelizing stochastic gradient descent. In: Shawe-Taylor,J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.)Advances in Neural Information Processing Systems 24, pp. 693–701. Curran Associates, Inc. (2011), http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf

29. Ruder, S., Bingel, J., Augenstein, I., Sgaard, A.: Latent multi-task architecturelearning (2017)30. Seltzer, M.L., Droppo, J.: Multi-task learning in deep neural networksfor improved phoneme recognition. In: 2013 IEEE International Conferenceon Acoustics, Speech and Signal Processing. pp. 6965–6969 (May 2013).https://doi.org/10.1109/ICASSP.2013.663901231. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rarewords with subword units. In: Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers).pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany(Aug 2016). https://doi.org/10.18653/v1/P16-1162,

32. Shazeer, N., Stern, M.: Adafactor: Adaptive learning rates with sublinear memorycost. CoRR abs/1804.04235 (2018), http://arxiv.org/abs/1804.04235

33. Shu, R., Bui, H.H., Narui, H., Ermon, S.: A dirt-t approach to unsupervised domainadaptation. CoRR abs/1802.08735 (2018)34. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for actionrecognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence,N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems27, pp. 568–576. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf

35. Spoustov´a, D.j., Hajiˇc, J., Raab, J., Spousta, M.: Semi-supervised training forthe averaged perceptron POS tagger. In: Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009). pp. 763–771. Associationfor Computational Linguistics, Athens, Greece (Mar 2009),

36. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neu-ral networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D.,Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27,pp. 3104–3112. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

37. Szegedy, C., Ioﬀe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the im-pact of residual connections on learning. CoRR abs/1602.07261 (2016), http://arxiv.org/abs/1602.07261

38. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio,S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances inNeural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc.(2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf abs/1609.08144 (2016), http://arxiv.org/abs/1609.08144

40. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: In ECCV. 94108 (2014)41. Zhao, W., Wang, B., Ye, J., Yang, M., Zhao, Z., Luo, R., Qiao, Y.: A multi-task learning approach for image captioning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18. pp.1205–1211. International Joint Conferences on Artiﬁcial Intelligence Organization(7 2018). https://doi.org/10.24963/ijcai.2018/168,40. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: In ECCV. 94108 (2014)41. Zhao, W., Wang, B., Ye, J., Yang, M., Zhao, Z., Luo, R., Qiao, Y.: A multi-task learning approach for image captioning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18. pp.1205–1211. International Joint Conferences on Artiﬁcial Intelligence Organization(7 2018). https://doi.org/10.24963/ijcai.2018/168,