[PDF] Learn to Combine Modalities in Multimodal Deep Learning

Abstract

Combining complementary information from multiple modalities is intuitively appealing for improving the performance of learning-based approaches. However, it is challenging to fully leverage different modalities due to practical challenges such as varying levels of noise and conflicts between modalities. Existing methods do not adopt a joint approach to capturing synergies between the modalities while simultaneously filtering noise and resolving conflicts on a per sample basis. In this work we propose a novel deep neural network based technique that multiplicatively combines information from different source modalities. Thus the model training process automatically focuses on information from more reliable modalities while reducing emphasis on the less reliable modalities. Furthermore, we propose an extension that multiplicatively combines not only the single-source modalities, but a set of mixtured source modalities to better capture cross-modal signal correlations. We demonstrate the effectiveness of our proposed technique by presenting empirical results on three multimodal classification tasks from different domains. The results show consistent accuracy improvements on all three tasks.

Full PDF

aa r X i v : . [ s t a t . M L ] M a y Learn to Combine Modalities in Multimodal DeepLearning

Kuan Liu † , ‡∗ [email protected] Yanen Li † [email protected] Ning Xu † [email protected] Prem Natarajan ‡ [email protected] † Snap Research, Snap. Inc., Venice, CA ‡ Computer Science Department & Information Sciences Institute, University of Southern California

Abstract

Combining complementary information from multiple modalities is intuitively ap-pealing for improving the performance of learning-based approaches. However,it is challenging to fully leverage different modalities due to practical challengessuch as varying levels of noise and conﬂicts between modalities. Existing meth-ods do not adopt a joint approach to capturing synergies between the modalitieswhile simultaneously ﬁltering noise and resolving conﬂicts on a per sample ba-sis. In this work we propose a novel deep neural network based technique thatmultiplicatively combines information from different source modalities. Thus themodel training process automatically focuses on information from more reliablemodalities while reducing emphasis on the less reliable modalities. Furthermore,we propose an extension that multiplicatively combines not only the single-sourcemodalities, but a set of mixtured source modalities to better capture cross-modalsignal correlations. We demonstrate the effectiveness of our proposed techniqueby presenting empirical results on three multimodal classiﬁcation tasks from dif-ferent domains. The results show consistent accuracy improvements on all threetasks.

Signals from different modalities often carry complementary information about different aspects ofan object, event, or activity of interest. Therefore, learning-based methods that combine informationfrom multiple modalities are, in principle, capable of more robust inference. For example, a person’svisual appearance and the type of language he uses both carry information about his age. In the con-text of user proﬁling in a social network, it helps to predict users’ gender and age by modeling bothusers’ proﬁle pictures and their posts. A natural generalization of this idea is to aggregate signalsfrom all available modalities and build learning models on top of the aggregated information, ideallyallowing the learning technique to ﬁgure out the relative emphases to be placed on different modal-ities for a speciﬁc task. This idea is ubiquitous in existing multimodal techniques, including earlyand late fusion [42, 15], hybrid fusion [1], model ensemble [7], and more recently—joint trainingmethods based on deep neural networks [38, 45, 37]. In these methods, features (or intermediate fea-tures) are put together and are jointly modeled to make a decision. We call them additive approachesdue to the type of aggregation operation. Intuitively, they are able to gather useful information andmake predictions collectively. ∗ Majority of the work in this paper was carried out while the author was afﬁliated with Snap Research. Snap.Inc.Preprint. Work in progress. owever, it is practically challenging to learn to combine different modalities. Given multiple inputmodalities, artifacts such as noise may be a function of the sample as well as the modality; forexample, a clear, high-resolution photo may lead to a more conﬁdent estimation of age than a lowerquality photo. Also, either signal noise or classiﬁer vulnerabilities may result in decisions that leadto conﬂicts between modalities. For instance, in the example of user proﬁling, some users’ genderand age can be accurately predicted by a clear proﬁle photo, while others with a noisy or otherwiseunhelpful (e.g., cartoon) proﬁle photo may instead have the most relevant information encoded intheir social network engagement—such as posts and friend interactions, etc. In such a scenario, werefer to the affected modality, in this case the image modality, as a weak modality . We emphasizethat this weakness can be sample-dependent, and is thus not easily controlled with some global biasparameters. An ideal algorithm should be robust to the noise from those weak modalities and pickout the relevant information from the strong modalities on a per sample basis, while at the sametime capturing the possible complementariness among modalities.We would like to point out that the existing additive approaches do not fully address the challengesmentioned earlier. Their basic assumptions are 1) every modality is always potentially useful andshould be aggregated, and 2) the models (e.g., a neural network) on top of aggregated features canbe trained well enough to recover the complex function mapping to a desired output. While in theorythe second assumption should hold, i.e., the learned models should be able to determine the qualityof each modality per sample if given a sufﬁciently large amount of data. They are, in practice,difﬁcult to train and regularize due to the ﬁniteness of available data.In this work, we propose a new multiplicative multimodal method which explicitly models the factthat on any particular sample not all modalities may be equally useful. The method ﬁrst makes deci-sions on each modality independently. Then the multimodal combination is done in a differentiableand multiplicative fashion. This multiplicative combination suppresses the cost associated with theweak modalities and encourages the discovery of truly important patterns from informative modal-ities. In this way, on a particular sample, inferences from weak modalities gets suppressed in theﬁnal output. And even more importantly perhaps, they are not forced to generate a correct predic-tion (from noise!) during training . This accommodation of weak modalities helps to reduce modeloverﬁtting, especially to noise. As a consequence, the method effectively achieves an automaticselection of the more reliable modalities and ignores the less reliable ones. The proposed method isalso end-to-end and enables jointly training model components on different modalities.Furthermore, we extend the multiplicative method with the ideas of additive approaches to increasemodel capacity. The motivation is that certain unknown mixtures of modalities may be more use-ful than a good single modality. The new method ﬁrst creates different mixtures of modalities ascandidates, and they make decisions independently. Then the multiplicative combination automati-cally selects more appropriate candidates. In this way, the selection operates on "modality mixtures"instead of just a single modality. This mixture-based approach enables structured discovery of thepossible correlations and complementariness across modalities and increases the model capacity inthe ﬁrst step. A similar selection process applied in the second step ignores irrelevant and/or redun-dant modality mixtures. This again helps control model complexity and avoid excessive overﬁtting.We validate our approach on classiﬁcation tasks in three datasets from different domains: imagerecognition, physical process classiﬁcation, and user proﬁling. Each task provides more than onemodality as input. Our methods consistently outperform the existing, state-of-the-art multimodalmethods.In summary, the key contributions of this paper are as follows: • The multimodal classiﬁcation problem is considered with a focus on addressing the chal-lenge of weak modalities. • A novel deep learning combination method that automatically selects strong modalitiesper sample and ignores weak modalities is proposed and experimentally evaluated. Themethod works with different neural network architectures and is jointly trained in an end-to-end fashion. • A novel method to automatically select mixtures of modalities is presented and evaluated.This method increases model capacity to capture possible correlations and complementari-ness across modalities. 2igure 1: Illustration of different deep neural network based multimodal methods. (a) A genderprediction example with text (a fake userid) and dense feature (fake proﬁle information) modalityinputs. (b) Additive combination methods train neural networks on top of aggregate signals from dif-ferent modalities; Equal errors are back-propagated to different modality models. (c) Multiplicativecombination selects a decision from a more reliable modality; Errors back to the weaker modalityare suppressed. (d) Multiplicative modality mixture combination ﬁrst additively creates mixturecandidates and then selects useful modality mixtures with multiplicative combination procedure. • Experimental evaluations on three real-world datasets from different domains show that thenew methods consistently outperform existing multimodal methods.

To set the context for our work, we now describe two existing types of popular multimodal ap-proaches in this section. We begin ﬁrst with notations and then describe traditional approaches,followed by existing deep learning approaches.

Notations

We use M to indicate the number of modalities available in total. We denote eachinput modality/signal as a dense vector v m ∈ R d m , ∀ m = 1 , , .., M . For example, given M =3 modalities in the user proﬁling task, v is the proﬁle image represented as a vector, v is theposted text representation, and v encodes the friend network information. We consider a K-wayclassiﬁcation setting where y denotes the labels. p km denotes the prediction probability of the k thclass from the m th modality, and p k denotes the model’s ﬁnal prediction probability of the k thclass. Throughout the paper, superscripts are used with indices for classes and subscripts used formodalities. Early fusion methods create a joint representation of input features from multiplemodalities. Next, a single model is trained to learn the correlation and interactions between lowlevel features of each modality. We denote the single model as h . The ﬁnal prediction can be writtenas p = h ([ v , .., v m ]) , (1)where we use concatenation here as a commonly seen example of jointly representing modalityfeatures.Early fusion could be seen as an initial attempt to perform multimodal learning. The training pipelineis simple as only one model is involved. It usually requires the features from different modalitiesto be highly engineered and preprocessed so that they align well or are similar in their semantics.Furthermore, it uses one single model to make predictions, which assumes that the model is wellsuited for all the modalities. 3 ate Fusion Late fusion uses unimodal decision values and fuses them with a fusion mechanism F (such as averaging [41], voting [35], or a learned model [11, 40].) Suppose model h i is used onmodality i ( i = 1 , .., M ,) the ﬁnal prediction is p = F ( h ( v ) , ..., h m ( v m )) (2)Late fusion allows the use of different models on different modalities, thus allowing more ﬂexibility.It is easier to handle a missing modality as the predictions are made separately. However, becauselate fusion operates on inferences and not the raw inputs, it is not effective at modeling signal-levelinteractions between modalities. Due to the superior performance and computationally tractable representation capability (in vec-tor spaces) in multiple domains such as visual, audio, and text, deep neural networks have gainedtremendous popularity in multimodal learning tasks [38, 39, 44]. Typically, domain-speciﬁc neu-ral networks are used on different modalities to generate their representations and the individualrepresentations are merged or aggregated. Finally, the prediction is made on top of aggregated rep-resentation usually with another a neural network to capture the interactions between modalities andlearn complex function mapping between input and output. Addition (or average) and concatenationare two common aggregation methods, i.e., u = X m f m ( v m ) (3)or u = [ f ( v ) , .., f ( v m )] (4)where f is considered a domain speciﬁc neural network and f m : R d m → R d ( m = 1 , .., M ) . Giventhe combined vector output u ∈ R d (or R P d m ), another network g computes the ﬁnal output. p = g ( u ) where g : R d → R K (5)The network structure is illustrated in Figure 1(b) . The arrows are function mappings or computingoperations. The dotted boxes are representations of single and combined modality features. We callthem additive combinations because their critical step is to add modality hidden vectors (althoughoften in a nonlinear way).In Section 5, we present related work in areas such as learning joint multimodal representationsusing a shared semantic space. Those approaches are not directly applicable to our task where weare aim to predict latent attributes, not merely the observed identities of the sample. The additive approaches discussed above make no assumptions regarding the reliability of differentmodality inputs. As such, their performance critically relies on the single network g to ﬁgure outthe relative emphases to be placed on different modalities. From a modeling perspective, the aimis to recover the function mapping between the combined representation u and the desired outputs.This function can be complex in real scenarios. For instance, when the signals are similar or comple-mentary to each other, g is supposed to merge them to make a strengthened decision; when signalsconﬂict with each other, g should ﬁlter out the unreliable ones and make a decision based primarilyon more reliable modalities. While in theory g —often parameterized as a deep neural network—hasthe capability to recover an arbitrary function given a sufﬁciently large amount of data (essentially,unlimited), it can be, in practice, very difﬁcult to train and regularize given data constraints in realapplications. As a result, model performance degrades signiﬁcantly.Our aim is to design a more (statistically) efﬁcient method by explicitly assuming that some modali-ties are not as informative as others on a particular sample. As a result, they should not be fed intoa single network for training. Intuitively, it is easier to train a model on the input of a good modalityrather than a mix of good ones and bad ones. Here we differentiate modalities to be informative odalities (good) and weak modalities (bad). Note that the labels informative and noisy are appliedin respect to each particular sample.To begin, let every modality make its own independent decision with its modal-speciﬁc model (e.g., p i = g i ( v i ) .) Their decisions are combined by taking an average. Speciﬁcally, we have the follow-ing objective function, L ce = ℓ yce , ℓ yce = − M X i =1 log p yi (6)where y denotes the true class index, and we call ℓ y a class loss as it is part of the loss functionassociated with a particular class. In the testing stage, the model predicts the class with the smallestclass loss, i.e., ˆ y = arg min y ℓ yce . (7)This relatively standard approach allows us to train one model per modality. However, when weakmodalities exist, the objective (6) would signiﬁcantly increase. By minimizing (6), it forces everymodel based on its modality to perform well on the training data. This could lead to severe overﬁttingas the noisy modality simply does not contain the information required to make a correct prediction,but the loss function penalizes it heavily for incorrect predictions. To mitigate against the problem of overﬁtting, we propose a mechanism to suppress the penaltyincurred on noisy signals from certain modalities. A cost on a modality is down-weighted when thereexists other good modalites for this example. Speciﬁcally, a modality is good (or bad) when it assignsa high (or low) probability to the correct class. A higher probability indicates more informativesignals and stronger conﬁdence. With that in mind, we design a down-weighting factor as follows, q i = [ Y j = i (1 − p j )] β/ ( M − (8)where we omit the class index superscripts on p and q for brevity; β is a hyper parameter to controlthe strength of down-weighting and are chosen by cross-validation. Then the new training criterionbecomes L mul = ℓ ymul , ℓ ymul = − M X i =1 q yi log p yi . (9)The scaling factor [ Q j = i (1 − p j )] β/ ( M − represents the average prediction quality of the remainingmodalities. This term is close to 0 when some p j are close to 1. When those modalities ( j = i ) haveconﬁdent predictions on the correct class, the term would have a small value, thus suppressing thecost on the current modality ( p i ). Intuitively, when other modalities are already good, the currentmodality ( p i ) does not have to be equally good . This down-weighting reduces the training require-ment on all modalities and reduces overﬁtting. [24] uses this term to ensemble different layers of aconvolution network in image recognition task. We introduce important hyper-parameter β to con-trol the strength of these factors. Larger values give a stronger suppressing effect and vice versa.During the testing, we follow a similar criterion in (7) (replacing ℓ ce with ℓ mul .)We call this strategy a multiplicative combination due to the use of multiplicative operations in (8).During the training, the process always tries to select some modalities that give the correct predictionand tolerate mistakes made by other modalities. This tolerance encourages each modality to workbest in its own areas instead of on all examples.We emphasize that β implements a trade off between ensemble and non-smoothed multiplicativecombination. When β =0, we have q = 1 . and predictions from different modalities are averaged;when β =1, there is no smoothing at all on (1 − p j ) terms so that a good modality will strongly downweight losses from other modalities. 5he proposed combination can be implemented as the last layer of a combination neural network asit is differentiable. Errors in (9) can be back-propagated to different components of the model suchthat the model can be trained jointly. Despite it providing a mechanism to selectively combine good and bad modalities, the multiplicativelayer as conﬁgured above has some limitations. Speciﬁcally, there is a mismatch between minimizingthe objective function and maximizing the desired accuracy . To illustrate that, we take one step backand look at the standard cross entropy objective function (6) (then M = 1 ). We have exp( − ℓ ) +exp( − ℓ ) = p + p = 1 when Y = 2 . Let’s call it normalized . It makes intuitive sense to minimizeonly ℓ y in the training phase so that we have ℓ y < ℓ y ′ ( y ′ = y ) —thus maximizing the accuracy.However, if we look at (9), the same normalization does not apply any more due to the complicationof multiple modalities ( M > ) and the introduction of the down-weighting factors q i . Therefore,it does not guarantee that minimizing ℓ leads to driving ℓ < ℓ or vice versa. There are twoimportant consequences of this mismatch. First, the method may stop minimizing the class losseson the correct classes when it is still incorrect. Second, it may work on reducing the class losses thatalready have correct predictions. A tempting naive approach

When addressing the issue, a temptive approach is to normalize theclass losses similar to normalizing a probability vector. A deeper consideration reveals the pitfallinherent in that temptation—normalizing class losses does not make sense because the class lossesin an objective function are error surrogates which usually serve as the upper bound of the trainingerrors. While it makes sense to minimize the surrogates on the correct classes, but it is pointless,perhaps counterproductive, to maximize the losses on the wrong classes. What regular normalizationtechniques do is to maximize the gap between losses in the correct and wrong classes—effectivelyminimizing the correct ones and maximizing the wrong ones. Experimental results validate theanalysis presented above.

Boosting extension

We propose a modiﬁcation of the objective function in (9) to address the issue.Rather than always placing a loss on the correct class, we place a penalty only when the classloss values are not the smallest among all the classes. This creates a connection to the predictionmechanism in (7). If the prediction is correct, there is no need to further reduce the class loss onthat instance; if the prediction is wrong, the class loss should be reduced even if the loss value isalready relatively small. To increase the robustness, we add a margin formulation where the loss onthe correct class should be smaller by a margin. Thus, the objective we use is as follows, L = ℓ y (1 − Y ∀ y ′ = y ℓ ymul + δ < ℓ y ′ mul )) , (10)where the bracket part in the right-hand side of (10) computes whether the loss associated with thecorrect class is the smallest (by a margin). The margin δ is chosen in experiment by cross validation.The new objective function only aims to minimize the class losses which still need improvement.For those examples that already have correct classiﬁcation, the loss is counted as zero. Therefore,the objective only adjusts the losses that lead to wrong prediction. It makes model training anddesired prediction accuracy better aligned. Boosting connection

There is a clear connection between the new objective (10) and boosting ideasif we consider the examples where (7) makes wrong predictions as hard examples and others aseasy examples. The objective (10) looks at only the hard examples and directs efforts to improvethe losses. The hard examples change during the training process, and the algorithm adapts to that.Therefore, we call the new training criterion boosted multiplicative training.

The multiplicative combination layer explicitly assumes modalities are noisy and automatically se-lect good modalities. One limitation is that the models g i ( i = 1 , .., M ) are trained primarily basedon a single modality (although they do receive back-propagated errors from the other modalitiesthrough joint training). This prevents the method from fully capturing synergies across modalities.6n a twitter example, a user’s follower network and followee network are two modalities that aredifferent but closely related. They jointly contribute to predictions concerning the user’s interests,etc. The multiplicative combination in Section 3 would not be ideal in capturing such correlations.On the other hand, additive methods are able to capture model correlation more easily by design(although they do not explicitly handle modality noise and conﬂicts). Given the complementary qualities of the additive and multiplicative approaches, it is desirable toharness the advantages of both. To achieve that goal, we propose a new method. At a high level, wewant our methods to ﬁrst have the capability to capture all possible interactions between differentmodalities and then to ﬁlter out noises and pick useful signals.In order to be able to model interactions of different modalities, we ﬁrst create different mixturesof modalities . Particularly, we enumerate all possible mixtures from the power set of the set ofmodality features. On each mixture, we apply the additive operation to extract higher-level featurerepresentations as follows, u c = X k ∈ M c ; M c ⊂{ , ,..,M } f ( v k ) (11)where M c contains one or more modalities. Thus we have u c as the representation of the mixtureof modalities in set M c . It gathers signals from all the modalities in M c . Since there are M − different non-empty M c , there are M − u c , and each u c looks into the mix of a different modalitymixture. We call each u c a mixture candidate as we believe not every mixture is equally useful;some mixtures may be very helpful to model training while others could even be harmful.Given the generated candidates, we make predictions based on each of them independently. Con-cretely, as in the additive approach, a neural network is used to make prediction p c as follows, p c = g c ( u c ) , (12). where p c is the prediction result from an individual mixture. Different p c may not agree with eachother. It remains to have a mechanism to select which one to believe or how to combine p c . Among the combination candidates generated above, it is not clear which mixtures are strong andwhich are weak due to the way of enumerating proposals. One simple way is to average predictionsfrom all candidates. However, it loses the ability to discriminate between different modalities andagain takes modalities as equally useful. From the modeling perspective, it is similar to simplydoing additive approaches to modalities in the beginning. Our goal is to automatically select strongcandidates and ignore weak ones.To achieve that, we apply the multiplicative combination layer (9) in Section 3 to the selection ofmixture candidates in (12), i.e., ℓ y = − | M c | X c =1 q yc log p yc (13)where q c is deﬁned similarly. Equation (13) follows (9) except that each model here is based on amixture candidate instead of a single modality.With (10) (11) (12) (13) our method pipeline can be illustrated in Fig. 1(d). It ﬁrst additivelycreates modality mixture candidates. Such candidates can be features from one single modality andcan also be mixed features from multiple modalities. These candidates by design make it morestraightforward to consider signal correlation and complementariness across modalities. However,it is unknown which candidate is good for an example. Some candidates can be redundant and noisy.The method then combines the prediction of different mixtures multiplicatively. The multiplicativelayer enables candidate selection in an automatic way where strong candidates are picked whileweak ones are ignored without dramatically increasing the entire objective function. As a whole, themodel is able to pick the most useful modalities and modality mixtures with respect to our predictiontask. 7able 1: Datasets and modalities.Datasets ModalitiesCIFAR100 features output from 3 ResNet unitsHIGGS low-level, high-level featuresgender ﬁrst name, userid, engagement Multimodal learning

Traditional multimodal learning methods include early fusion (i.e., featurebased), late fusion (i.e., decision based), and hybrid fusion [1]. They also include model basedfusion such as multiple kernel learning [13, 4, 9], graphical model based approaches[36, 10, 16],etc.Deep neural networks are very actively explored in multimodal fusion [38]. They have been usedto fuse information for audio-visual emotion classiﬁcation [45], gesture recognition [37], affectanalysis [25], and video description generation [23]. While the modalities used, architectures, andoptimization techniques might differ, the general idea of fusing information in joint hidden layer ofa neural network remains the same.

Multiplicative combination technique

Multiplicative combination is widely explored in machinelearning methods. [6] uses an OR graphical model to combine similarity probabilities across dif-ferent feature components. The probabilities of dissimilarity between pairs of objects go throughmultiplication to generate the ﬁnal probability of be dissimilar, thus picking out the most optimisticcomponent. [24] ensembles multiple layers of a convolution network with a down-weighting objec-tive function which is a specialized instance of our (8). Our objective is more general and ﬂexible.Furthermore, we developed boosted training strategy and modality combination to address multi-modal classiﬁcation challenges. [30] develops focal loss to address class imbalance issue. In itssingle modality setting, it down-weights every class loss by one minus its own probability.Attention techniques [2, 5, 33] are also treated as multiplicative methods to combine multiple modal-ities. Features from different modalities are dynamically weighted before mixed together. The mul-tiplicative operation is performed at the feature level instead of decision level.

Other multimodal tasks . There are other multimodal tasks where the ultimate task is not classiﬁ-cation. These include such various image captioning tasks. In [43] a CNN image representation isdecoded using an LSTM language model. In [22], gLSTM incorporates the image data together withsentence decoding at every time step fusing visual and sentence data in a joint representation. Jointmultimodal representation learning is also used for visual and media question answering [8, 32, 46],visual integrity assessment [21], and personalized recommendation [19, 31].

We validate our methods on three datasets from different domains: image recognition, physicalprocess classiﬁcation, and user proﬁling. On these tasks, we are given more than one modalityinputs and try to best use these modalities to achieve good generalization performance. Our code ispublicly available . The CIFAR-100 dataset [28] contains 50,000 and 10,000 color images with size of 32 ×

32 from100 classes for training and testing purposes, respectively. As observed by [47, 17], different layersof a convolutional neural network (CNN) contain different signals of an image (different abstractionlevels) that may be useful for classiﬁcation on different examples. [24] takes three layers of networksin networks (NINs) [29] and demonstrates recognition accuracy improvements. In our experiments,the features from three different layers of a CNN are regarded as three different modalities. https://github.com/skywaLKer518/MultiplicativeMultimodal etwork architecture We use Resnet [18] as the network in our experiments as it signiﬁcantlyoutperforms NINs on this task and meanwhile Resnet also has the block structure so that it makesour choice of modality easier and more natural. We experimented with network architecture Resnet-32, Resnet-110. On both networks there are three residual units. We take the hidden states of thethree units as modalities. We follow [24] and weight the losses of different layers by (0.3, 0.3, 1.0).Our implementations are based on [12].

Methods

We experimented different methods: (1) Vanilla Resnet (“Base”)[12] predicts the imageclass only based on the last layer output; i.e., there is only one modality. (2) Resnet-Add (“Add”)concatenates the hidden nodes of three layers and builds fully connected neural networks (FCNNs)on top of the concatenated features. We tuned the network structure and found a two layer networkwith hidden 256 nodes gives the best result. (3) Resnet-Mul (“Mul”) multiplicatively combines pre-dictions from the hidden nodes of three layers. (4) Resnet-MulMix (“MulMix”) uses multiplicativemodality mixture combination on the three hidden layers. It uses default β value . . (5) Resnet-MulMix* (“MulMix*”) is the same as MulMix except β is tuned between 0 and 1. Training details

We strictly follow [12] to train Resnet and all other models. Speciﬁcally, we useSGD momentum with a ﬁxed schedule learning rate {0.1, 0.01, 0.001} and terminate at 80000iterations. We use 100 as batch size and choose weight decay 0.0002 on all network weights.

HIGGS [3] is a binary classiﬁcation problem to distinguish between a signal process which producesHiggs bosons and a background process which does not. The data has been produced using MonteCarlo simulations. We have two feature modalities—low level and high level features. Low levelfeatures are 21 features which are kinematic properties measured by the particle detectors in theaccelerator. High level features are anothor 7 features that are functions of the ﬁrst 21 features; theyare derived by physicists to help discriminate between the two classes. Details of feature names arein the original paper [3]. We follow the setup in [3] and use the last 500,000 examples as a test set. “HIGGS-small” and “HIGGS-full” . To investigate the algorithm behaviors under different datascales, we also randomly down-sample 1/3 of the examples from the entire training split. Thiscreates another subset which we call "HIGGS-Small."

Network architecture

We use feed-forward deep networks on each modality. We follow [3] to use300 hidden nodes in our networks and tried different number of layers. L2 weight decay is used withcoefﬁcient 0.00002. Dropout is not used as it hurts the performance in our experiments. Networkweights are randomly initialized and SGD with 0.9 momentum is used during the training.

Methods

We experimented with single modality prediction, late fusion of two modalities, andmodality combination methods similar to what we describe above in CIFAR100.

The dataset we use contains 7.5 million users from Snapchat app, with registered userid anduser activity logs (e.g., story posts, friend networks, message sent, etc.) It also contains the inferreduser ﬁrst names produced by an internal tool. The task is to predict user genders. Users’ inputs inBitmoji app are used as the ground truth. We randomly shufﬂe the data and use 6 million samplesas training, 500K as development, and 1 million for testing.There are three modalities in this dataset: userid as a short text, inferred user ﬁrst name as a letterstring, and dense features extracted from user activities (e.g., the count of message sent, the numberof male or female friends, the count of stories posted.)

Gender-6 and Gender-22

We experimented with two versions of the dataset. The versions differedin the richness of user activity features. The ﬁrst one has 6 highly engineered features (gender-6)and the other has 22 features (gender-22).

Network architecture

We use FCNNs to model dense features. We tune the architecture and even-tually use a 2 layer network with 2000 hidden nodes. We use (character based) Long Short-TermMemory Networks (LSTMs) [20] to model text string. The text string is fed into the networks onecharacter by one character and the hidden representation of the last character is connected with FC-NNs to predict the gender. We ﬁnd the vanilla single layer LSTMs outperforms or matches other9able 2: Test error rates/AUC comparisons on CIFAR100, HIGGS, and gender tasks. MulMix usesdefault β value 0.5. MulMix* tunes β between 0 and 1. Experimental results are from 5 randomruns. The best and second best results in each row are in bold and italic, respectively.Base Fuse Add [38] Mul Mulmix Mulmix*cifar100, resnet-32Err 30.3 ± ± ± ± ± cifar100, resnet110Err 26.5 ± ± ± ± ± higgs-smallErr 23.3 22.5 22.3 ± ± ± ± AUC 84.8 85.9 86.2 ± ± ± ± higgs-fullErr 21.7 20.6 20.0 ± ± ± ± AUC 86.6 88.0 88.6 ± ± ± ± gender-6Err 15.4 7.97 6.07 ± ± ± ± gender-22Err 10.1 5.15 3.85 ± ± ± ± variants including (multi-layer, bidirectional [14], attention-based LSTMs). We believe it is due tothe fact that we have sufﬁciently large amount of data. We also experimented character based Con-volutional Neural Networks (char-CNN) [26, 48] and CNNs+LSTMs for text modeling and foundLSTMs perform slightly better.Training details: our tuned LSTMs has 1 layer with hidden size 512. It uses ADAM [27] to train withlearning rate 0.0001 and learning rate decay 0.99. Gradients are truncated at 5.0. We stop modeltraining when there is no improvement on the development set for consecutive evaluations. Methods

In addition to methods described in CIFAR100 and HIGGS, we also experimented withan attention based combination methods [34].

The test error and AUC comparisons are reported in Table 2.

CIFAR100

Compared to the vanilla Resnet model (Base), additive modality combination (Add)does not necessarily help improve test error. Particularly, it helps Resnet-32 but not Resnet-110. Itmight be due to overﬁtting on Resnet-110 as there is already much more parameters.Multiplicative training (Mul), on the contrary, helps reduce error rates in both models. It demon-strates better capability of extracting signals from different modalities.Further, MulMix and MulMix*, which is designed to combine the strengths of additive and multi-plicative combination, give signiﬁcant boost in accuracy on both models.

HIGGS

Either fusion model or additive combination gives signiﬁcant error rate reduction comparedto single modality. This is expected as it is very intuitive to aggregate low and high level featuremodalities.Compared to Add, Multiplicative combination has clearly better results on higgs-small but slightlyworse results on higgs-full. This can be explained by the fact models are more prone to overﬁt onsmaller datasets and multiplicative training does reduce the overﬁt.Finally, MulMix and MulMix* give signiﬁcant boost on both small and full datasets.

Gender

It gives the most dramatic improvements to combine multiple modalities here due to thehigh level noise in each modalities. Add achieves less than half error rates than what the best single10 e rr o r r a t e s CIFAR100 (resnet110)h=128h=256mulmixmulmix* (a) CIFAR100 e rr o r r a t e s HIGGS (full) h=300h=500mulmixmulmix* (b) HIGGS e rr o r r a t e s gender (22) h=500h=300mulmixmulmix* (c) gender Figure 2: Comparisons to results from deeper networks. Error rates and standard deviations fromfusion networks with hidden layer structures are reported and compared to our models (i.e., MulMix,and MulMix*). Simply going deep in networks does not necessarily help improve generalization.Experimental results are from 5 random runs.Table 3: Error rate results of boosted training (MulMix) on HIGGS-full and gender-22. β * β =1.0 (vanilla) β =1.0 (boosted)HIGGS (full) 19.5 ± ± ± ± ± ± As the new approach, especially MulMix (or MulMix*), introduces additional parameters in fusionnetworks in each mixture, one natural question is whether the improvements simply come from theincreased number of parameters. We answer the question by running additional experiments withadditive combination (Add) models with deeper fusion networks.The results are plotted in Figure 2. We see on CIFAR100 and gender, networks with increased depthlead to worse results. The can be either due to increased optimization difﬁculty or overﬁtting. OnHIGGS, increased depth ﬁrst leads to slight improvements and then the error rates go up again. Wesee even the results at the optimal network depth are not as good as our approach. Overall the ﬁguresshow that it is the design rather than the depth of the fusion networks that holds their performance.On the contrary our approaches are explicitly designed to extract signals selectively and collectivelyacross modalities.

Our loss function in (8) implements a trade-off between model averaging and multiplicative com-bination. β =0 makes it an model averaging of different modalities (or modality mixtures) while β =1 makes the model a non-smoothed multiplicative combination. To understand the exact workingmechanism and to achieve the best results, we tune β between 0 and 1 and plot corresponding errorrates on different tasks in Figure 3.We observe that the optimal results do not appear at either end of the spectrum. On the contrary,smoothed multiplicative combinations with optimal β s achieve signiﬁcantly better results than pureensemble or multiplicative combination. On CIFAR100 and HIGGS, we see optimal β values 0.3and 0.8 respectively and they are consistent across models Mul and MulMix. On gender, Mul clearlyfavors β to be 1 as each single modality is very noisy and it makes less sense to evenly averagepredictions from different modalities.We do not have a clear theory how to choose β automatically. Our hypothesis is that smaller β leadsto stronger regularization (due to smoothed scaling factors) while larger β gives more ﬂexibility inmodeling (highly non-linear combination). As a result, we recommend choosing smaller β whenthe original models overﬁt and larger β when the models underﬁt.11 .1 0.3 0.5 0.8 1.0 beta e rr o r r a t e s CIFAR100 (resnet110)mulmulmix (a) CIFAR100 beta e rr o r r a t e s HIGGS (full) mulmulmix (b) HIGGS e rr o r r a t e s gender (22) mulmulmix (c) gender Figure 3: Error rates and standard deviations under different β values. Optimal results do not appearat either 0 or 1. Experimental results are from 5 random runs.Table 4: Gender-22 error analysis: Mistakes that multimodal methods make where individual modalsdo not (we call “over-learn”); Mul and MulMix* improve Add on that. The overall improvement isvery close to the improvement from “over-learn”.Add Mul MulMix* improveoverall 3.85 3.83 3.66 0.19Over-learn 2.90 2.87 2.72 0.18 We also validate the effectiveness of boosted training technique. We ﬁnd that when β in (8) isnot tuned, boosted training signiﬁcantly improves the results. Table 3 shows MulMix( β = 1 ) testerrors on HIGGS and gender. Boosted training helps MulMix( β = 1 . ) achieve almost identicalresults as MulMix*. It is interesting to see the second and fourth columns have very close numbers.We conjecture the smoothing effect of β makes the “mismatch’ issue discussed in Section 3.2 lesssevere. We are interested in seeing where the improvements aremade on this prediction task. It is known that ensemble-like methods help correct predictions onexamples where individual classiﬁers make wrong predictions. However, they also make mistakeson the examples where individual classiﬁers are correct. This is in general due to overﬁtting and wecall it "over learning."We expect our methods to reduce errors of "over learning" due to its regularization mechanism –we tolerate incorrect predictions from a weak modality while preserving its correct predictions. Weanalyze the errors only on the examples where individual modalities could make correct predictions .More clearly, we evaluate the errors on the examples which at least one single modality predictscorrectly.The result is reported in Table 4. We see Mul and MulMix* both make less “over learning” mistakes.Interestingly, the improvement of MulMixhere (0.18) is very close to the improvement on the entiredataset. It suggests our new methods do prevent individual modalities from "over learning."

Compared to attention models

We also tried attention methods [34] where attention modules areapplied to each modality before additively combined. We experimented on gender prediction be-cause on this task it is most common to see missing modalities. The results are reported in Table 5.We do not observe clear improvements.Table 5: Test errors of attention models on gender tasks.Add Add-Attendgender-6 6.07 ± ± ± ± ompared to CLDL[24] on CIFAR100 Speciﬁc to image recognition domain, CLDL[24] is onespecialization of our “Mul” approach based on NIN [29]. We implemented CLDL on Resnet. Theerror rates on two models are 29.6 ± ± This paper investigates new ways to combine multimodal data that accounts for heterogeneity ofmodality signal strength across modalities, both in general and at a per-sample level. We focus onaddressing the challenge of “weak modalities”: some modalities may provide better predictors onaverage, but worse for a given instance. To exploit these facts, we propose multiplicative combi-nation techniques to tolerate errors from the weak modalities, and help combat overﬁtting. We fur-ther propose multiplicative combination of modality mixtures to combine the strength of proposedmultiplicative combination and existing additive combination. Our experiments on three differentdomains demonstrate consistent accuracy improvements over the state-of-the-art in those domains,thereby demonstrating the fact that our new framework represents a general advance that is notlimited to a speciﬁc domain or problem.

References [1] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. Multimodal fusion formultimedia analysis: a survey.

Multimedia systems , 16(6):345–379, 2010.[2] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. arXivpreprint arXiv:1412.7755 , 2014.[3] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physicswith deep learning.

Nature communications , 5:4308, 2014.[4] S. S. Bucak, R. Jin, and A. K. Jain. Multiple kernel learning for visual object recognition: Areview.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 36(7):1354–1369,2014.[5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for largevocabulary conversational speech recognition. In

Acoustics, Speech and Signal Processing(ICASSP), 2016 IEEE International Conference on , pages 4960–4964. IEEE, 2016.[6] S. Changpinyo, K. Liu, and F. Sha. Similarity component analysis. In

Advances in NeuralInformation Processing Systems , pages 1511–1519, 2013.[7] T. G. Dietterich. Ensemble methods in machine learning. In

International workshop on multi-ple classiﬁer systems , pages 1–15. Springer, 2000.[8] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? datasetand methods for multilingual image question. In

Advances in neural information processingsystems , pages 2296–2304, 2015.[9] P. Gehler and S. Nowozin. On feature combination for multiclass object classiﬁcation. In

Computer Vision, 2009 IEEE 12th International Conference on , pages 221–228. IEEE, 2009.[10] Z. Ghahramani and M. I. Jordan. Factorial hidden markov models. In

Advances in NeuralInformation Processing Systems , pages 472–478, 1996.[11] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, M. Kächele,M. Schmidt, H. Neumann, G. Palm, et al. Multiple classiﬁer systems for the classiﬁcationof audio-visual emotional states. In

Affective Computing and Intelligent Interaction , pages359–368. Springer, 2011.[12] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Back-propagation without storing activations.[13] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms.

Journal of machine learningresearch , 12(Jul):2211–2268, 2011. 1314] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural net-works. In

Acoustics, speech and signal processing (icassp), 2013 ieee international conferenceon , pages 6645–6649. IEEE, 2013.[15] H. Gunes and M. Piccardi. Affect recognition from face and body: early fusion vs. late fusion.In

Systems, Man and Cybernetics, 2005 IEEE International Conference on , volume 4, pages3437–3443. IEEE, 2005.[16] M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition. In

Proceedings of the 10th internationalconference on Multimodal interfaces , pages 237–240. ACM, 2008.[17] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentationand ﬁne-grained localization. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 447–456, 2015.[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Pro-ceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778,2016.[19] B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk. Parallel recurrent neural network ar-chitectures for feature-rich session-based recommendations. In

Proceedings of the 10th ACMConference on Recommender Systems , pages 241–248. ACM, 2016.[20] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[21] A. Jaiswal, E. Sabir, W. AbdAlmageed, and P. Natarajan. Multimedia semantic integrity as-sessment using joint embedding of images and text. In

Proceedings of the 2017 ACM onMultimedia Conference , pages 1465–1471. ACM, 2017.[22] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars. Guiding the long-short term memory modelfor image caption generation. In

Computer Vision (ICCV), 2015 IEEE International Confer-ence on , pages 2407–2415. IEEE, 2015.[23] Q. Jin and J. Liang. Video description generation using audio and visual cues. In

Proceedingsof the 2016 ACM on International Conference on Multimedia Retrieval , pages 239–242. ACM,2016.[24] X. Jin, Y. Chen, J. Dong, J. Feng, and S. Yan. Collaborative layer-wise discriminative learn-ing in deep neural networks. In

European Conference on Computer Vision , pages 733–749.Springer, 2016.[25] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Frou-menty, Y. Dauphin, N. Boulanger-Lewandowski, et al. Emonets: Multimodal deep learning ap-proaches for emotion recognition in video.

Journal on Multimodal User Interfaces , 10(2):99–111, 2016.[26] Y. Kim. Convolutional neural networks for sentence classiﬁcation. arXiv preprintarXiv:1408.5882 , 2014.[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[28] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.[29] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400 , 2013.[30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 , 2017.[31] K. Liu and P. Natarajan. A batch learning framework for scalable personalized ranking. arXivpreprint arXiv:1711.04019 , 2017. 1432] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach toanswering questions about images. In

Proceedings of the 2015 IEEE International Conferenceon Computer Vision (ICCV) , pages 1–9. IEEE Computer Society, 2015.[33] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In

Advances inneural information processing systems , pages 2204–2212, 2014.[34] S. Moon, L. Neves, and V. Carvalho. Multimodal named entity recognition for short socialmedia posts. arXiv preprint arXiv:1802.07862 , 2018.[35] E. Morvant, A. Habrard, and S. Ayache. Majority vote of diverse classiﬁers for late fusion.In

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR)and Structural and Syntactic Pattern Recognition (SSPR) , pages 153–162. Springer, 2014.[36] A. V. Neﬁan, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy. A coupled hmm foraudio-visual speech recognition. In

Acoustics, Speech, and Signal Processing (ICASSP), 2002IEEE International Conference on , volume 2, pages II–2013. IEEE, 2002.[37] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesturerecognition.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 38(8):1692–1706, 2016.[38] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning.In

Proceedings of the 28th international conference on machine learning (ICML-11) , pages689–696, 2011.[39] W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages2329–2336, 2014.[40] G. A. Ramirez, T. Baltrušaitis, and L.-P. Morency. Modeling latent discriminative dynamic ofmulti-dimensional affective signals. In

Affective Computing and Intelligent Interaction , pages396–406. Springer, 2011.[41] E. Shutova, D. Kiela, and J. Maillard. Black holes and white rabbits: Metaphor identiﬁcationwith visual features. In

Proceedings of the 2016 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies , pages 160–170, 2016.[42] C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic videoanalysis. In

Proceedings of the 13th annual ACM international conference on Multimedia ,pages 399–402. ACM, 2005.[43] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image captiongenerator. In

Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on ,pages 3156–3164. IEEE, 2015.[44] D. Wang, P. Cui, M. Ou, and W. Zhu. Deep multimodal hashing with orthogonal regularization.In

IJCAI , volume 367, pages 2291–2297, 2015.[45] M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan. Context-sensitive multi-modal emotion recognition from speech and facial expression using bidirectional lstm model-ing. In

Proc. INTERSPEECH 2010, Makuhari, Japan , pages 2362–2365, 2010.[46] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attentionfor visual question answering. In

European Conference on Computer Vision , pages 451–466.Springer, 2016.[47] S. Yang and D. Ramanan. Multi-scale recognition with dag-cnns. In

Computer Vision (ICCV),2015 IEEE International Conference on , pages 1215–1223. IEEE, 2015.[48] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classiﬁcation.In