[PDF] A Universal Model for Cross Modality Mapping by Relational Reasoning

Abstract

With the aim of matching a pair of instances from two different modalities, cross modality mapping has attracted growing attention in the computer vision community. Existing methods usually formulate the mapping function as the similarity measure between the pair of instance features, which are embedded to a common space. However, we observe that the relationships among the instances within a single modality (intra relations) and those between the pair of heterogeneous instances (inter relations) are insufficiently explored in previous approaches. Motivated by this, we redefine the mapping function with relational reasoning via graph modeling, and further propose a GCN-based Relational Reasoning Network (RR-Net) in which inter and intra relations are efficiently computed to universally resolve the cross modality mapping problem. Concretely, we first construct two kinds of graph, i.e., Intra Graph and Inter Graph, to respectively model intra relations and inter relations. Then RR-Net updates all the node features and edge features in an iterative manner for learning intra and inter relations simultaneously. Last, RR-Net outputs the probabilities over the edges which link a pair of heterogeneous instances to estimate the mapping results. Extensive experiments on three example tasks, i.e., image classification, social recommendation and sound recognition, clearly demonstrate the superiority and universality of our proposed model.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 1

A Universal Model for Cross Modality Mappingby Relational Reasoning

Zun Li, Congyan Lang, Liqian Liang, Tao Wang, Songhe Feng, Jun Wu, and Yidong Li

Abstract —With the aim of matching a pair of instances fromtwo different modalities, cross modality mapping has attractedgrowing attention in the computer vision community. Existingmethods usually formulate the mapping function as the sim-ilarity measure between the pair of instance features, whichare embedded to a common space. However, we observe thatthe relationships among the instances within a single modality(intra relations) and those between the pair of heterogeneousinstances (inter relations) are insufﬁciently explored in previousapproaches. Motivated by this, we redeﬁne the mapping func-tion with relational reasoning via graph modeling, and furtherpropose a GCN-based Relational Reasoning Network (RR-Net)in which inter and intra relations are efﬁciently computedto universally resolve the cross modality mapping problem.Concretely, we ﬁrst construct two kinds of graph, i.e. , IntraGraph and Inter Graph, to respectively model intra relations andinter relations. Then RR-Net updates all the node features andedge features in an iterative manner for learning intra and interrelations simultaneously. Last, RR-Net outputs the probabilitiesover the edges which link a pair of heterogeneous instances toestimate the mapping results. Extensive experiments on threeexample tasks, i.e. , image classiﬁcation, social recommendationand sound recognition, clearly demonstrate the superiority anduniversality of our proposed model.

Index Terms —Cross modality mapping, Graph modeling, Re-lational reasoning, GCN

I. I

NTRODUCTION

With the explosive growth of multimedia information, crossmodality mapping has attracted much attention in the computervision community, the goal of which is to accurately associatea pair of instances from two different modalities. This researchtopic has shown great potential in many applications, suchas image caption generation [1][2], visual question answering[3][4][5], dimension reduction [6][7], domain adaption [8], toname a few.Existing cross modality mapping methods rely on thesimilarity measure between a pair of instances from twomodalities to formulate the mapping function. Therefore, howto learn a discriminative feature embedding to represent suchsimilarity for the pair of instances ( e.g. , image vs label inimage classiﬁcation area) plays a crucial role in conventionalcross modality mapping task. In terms of this, most approaches[9][10][11][12][13][14][15][16][17][18] learn the embeddingby projecting two modalities into a common latent space. Zun Li, Congyan Lang, Liqian Liang, Tao Wang, Songhe Feng, JunWu and Yidong Li are with the School of Computer and InformationTechnology at Beijing Jiaotong University, Beijing, 100044, China. E-mail:([email protected], [email protected], [email protected],[email protected], [email protected], [email protected] [email protected]). Congyan Lang is the corresponding author.

Modality 1 Modality 2Latent Common Space Modality 1 Modality 2Latent Common Space(a) (b)Modality 1 Modality 2(c) Inter-edge attributeIntra-edge attributeInter-edge Intra-edge Feature embedding

Fig. 1: The illustration of latent common space based featureembedding methods (a) (b) and the proposed methods (c) forcross modality mapping.Early studies [9][10][11][12][13][14] only employ linear pro-jection without taking any account of intrinsic relations amongthe instance within each single modality (hereinafter calledintra relation) or heterogeneous instances from two modal-ities (hereinafter called inter relation), as shown in Fig. 1(a). Recent progress [15][16][17][18] turn to incorporate theintra relations for learning the common space embedding bydesigning two bimodal auto-encoders, as shown in Fig. 1 (b).However, this line of approaches pays less attention to interrelation, which is critical for supplementing the intra-modalityinformation. There emerge several other methods [19][20]that separately investigate intra relations and inter relationsfor learning the embedding in speciﬁc task, i.e. , image-textretrieval task. Nevertheless, viewing that their inter relationis learned without the help of intra relation information fromtwo modalities, their performance is still heavily limited bythe heterogeneous gap of different data modalities.Above all, it is critical to explore the intra relations andinter relations simultaneously in a more effective mannerfor the problem of cross modality mapping. For any twomodality, we observe that intra relations can be modeledas a structural relationship among instances within a singlemodality, while the inter relation can be seen as a reasoningrelationship between the pair of instances from two modalities.Based on this observation, we naturally leverage graph to wellmodel these two relationships where each instance from twomodalities are treated as a node. Speciﬁcally, two kinds of a r X i v : . [ c s . C V ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 2 edges, named as intra-edges and inter-edges, are employedto respectively represent intra relations and inter relations.Thus we redeﬁne the mapping function in this literature viarelational reasoning instead of standard similarity measure,which can be implemented by estimating the existence of theinter-edges. To the best of our knowledge, unlike task-speciﬁcprevious arts, we are the ﬁrst attempt to resolve cross modalitymapping with relational reasoning and consider a universalsolution that can jointly represent the intra relations and interrelations via graph modeling.In this work, inspired by the superiority of graph convolu-tional network (GCN), we propose a GCN-based R elational R easoning Net work (RR-Net), a universal model to resolvethe problem of cross modality mapping. Concretely, we ﬁrstconstruct two kinds of graph: Intra Graph and Inter Graph.Intuitively, the former includes two graphs lying in eachmodality while the latter actually links the instances acrosstwo modalities, as shown in Fig. 1 (c). Each Intra Graphtakes every instance from the same modality as a node (intra-node) and assigns intra-edges via a clustering algorithm, e.g. , K NN. As for Inter Graph, the inter-edges link candidate pairsfrom two modalities with high initial conﬁdence according tospeciﬁc task. On top of the constructed graphs, our RR-Netﬁrst employs an encoder to map the raw features to a desiredspace, then simultaneously learn all the node features and edgefeatures in an iterative manner via the core component, i.e. ,a relational GCN module implemented by stacking severalGCN units. Finally, RR-Net utilizes a decoder to output theprobabilities over the inter-edges to search for the most likelycross modality mapping pairs. Note that we derive two kindsof GCN units corresponding to Intra Graph and Inter Graph, i.e. , intra GCN unit and inter GCN unit, each containing oneedge convolutional layer (intra- or inter-edge layer) and onenode convolutional layer (intra- or inter- node layer). Thedifference between two kinds of GCN units lies in intra- andinter-edge layer, which are exploited respectively for learningthe intra relations and inter relations. In particular, inter-edgelayer takes the output of its former intra-node layer as inputand utilizes a weight matrix as a kernel when performingaggregation of inter-edge features.We conduct extensive experiments on three tasks, i.e. , soundrecognition, image classiﬁcation and social recommendation,to verify the universality and effectiveness of our proposedmodel. Main contributions of this paper are summarized asfollows: • We are the ﬁrst to resolve cross modality mappingwith relational reasoning and propose a task-agnosticuniversal solution to learn both intra and inter relationssimultaneously via graph modeling. • We propose a GCN-based

Relational Reasoning Network (RR-Net) to jointly learn all the node and edge featureswith multiple intra and inter GCN units. • On several different cross modality mapping tasks withpublic benchmark datasets, the proposed RR-Net im-proves the performance signiﬁcantly over the state-of-the-art competitors. II. R

ELATED W ORK

In this section, we ﬁrst give a brieﬂy review approachesabout the cross modality learning in Sec. II-A, we thenintroduce studies of relational reasoning and graph neuralnetwork that are closely related to this work in Sec. II-B andSec. II-C, respectively.

A. Cross Modality Learning

Most of the existing cross modality algorithms can beclassiﬁed into two categories, that is, joint embedding learningand coordinated embedding learning. Below, we brieﬂy reviewthese two categories of approaches.

1) Joint Embedding Learning:

This kind of methods em-beds data from two modalities together into a commonfeature space and performs the cross modality similaritymeasure. Studies of [11][12][21][22] directly concatenate thefeatures of different modalities to form the common featurespace. Unlike such straightforward method, some methods[23][19][24][25][26][9][10][13][27] ﬁrst convert all of themodalities into different representations, they then concatenatemultiple representations together to a joint feature space. Forexample, Ngiam et al . [26] stacked several auto-encoders forindividually learning the representation of each modality, thenthey fused those representations into a common embeddingspace. Srivastava et al . [24] introduced a multimodal DBMsto fuse multimodal representations. Following the DBMs, Suk et al . [25] utilized milti-modal DBM representation to performAlzheimer’s disease classiﬁcation from positron emission to-mography and magnetic resonance imaging data. Afterwards,Wang et al . [28] jointly learned several projection matrices tomap multi-modal data into a common subspace and measuredsimilarities of different data modality. Recently, Wu et al . [27]factorized image and its descriptions into different levels tolearn a joint space of visual representation and textual seman-tics. However, these approaches only consider the commonfeature space embedding for each modality, which ignoredthe structural interactions between two modalities, and thusthey lack the capacity to represent complicated heterogeneousmodality data.

2) Coordinated Embedding Learning:

Instead of projectingthe data modalities into a joint space, the coordinated embed-ding learning method separately learn the representations foreach modality but coordinate them through a constraint, typ-ically using metric learning [29], linear transfer [16][18][20],margin ranking loss [15][17], pairwise similarity loss [30], etc . For instance, Andrew et al . [16] mapped the multi-modalfeatures into a shared space by learning two linear transfers,and they jointly maximized the correlation across two modal-ities to compute their similarities. WSABIE [15] and DeVISE[17] learned to linearly transform both image and text featuresinto a joint feature space with a margin ranking loss. Yu etal . [30] proposed a dual-path neural network model to learnboth image and text feature representations and then learnedtheir correlation with a pairwise similarity loss. Although theseapproaches have achieved great improvement for learning thecross modality mapping, they merely consider intra relationswithin single modality and ignore the inter relations between

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 3

Initialized Intra Graphs

Relational GCN Module

Initialized Inter Graph Encoder DecoderModality 1 Modality 2 Intra-edge attribute Intra-edge Inter-edge attribute Inter-edge Node convolutional layer Intra-edge convolutional layer Inter-edge convolutional layer edge convolutional layer in encoder Information flow between intra GCN unit and inter GCN unit FeatureRepresentation

Intra GCN UnitInter GCN UnitIntra GCN Unit Inter GCN Unit

Encoder Updated Inter Graph

Fig. 2: The illustration of overall framework. We ﬁrst construct three graphs, two Intra Graphs and one Inter Graph. Thenour RR-Net utilizes two components: an encoder-decoder and the core Relational GCN Module to model the graphs. Amongwhich, Relational GCN Module is implemented by stacking two kinds of GCN units, i.e. , intra GCN unit and inter GCN unit,in an iterative manner. Finally, RR-Net feds the updated inter-edge features into a decoder to produce a set of probabilitiesto search for the most likely cross modality mapping pairs.two modalities. And their learned representations lack dis-tinctiveness and comprehensiveness, thus leading to a severedegradation of performance. In [20], Huang et al . proposeda joint embedding modal to combine the social relations forrepresentation learning of the multimodal contents. However,this method is particularly designed for social images, whichis not suitable for other type of data medias. In this work, werepresent each data modality as a graph, and profoundly mineboth the intra and inter relations by jointly learning the nodefeatures and edge features of different data modalities.

B. Relational Reasoning

Relational reasoning aims to infer about certain relation-ships between different entities. It plays an important rolein many computer vision tasks such as activity recognition[31], text detection [32], video understanding [33], and vi-sual question answering [34][35], etc.

For learning the intu-itive interactions between entities, many relational approaches[32][33][36][35][37][38][39] have been developed. For ex-ample, Zhang et al . [32] reasoned the linkage relationshipbetween the text components by exploiting a spectral-basedgraph convolution network. Zhou et al . [33] designed a tempo-ral relational network (TRN) to reason about the interactionsbetween frames of videos in varying scales. Yi et al . [39]disentangled reasoning from image and language understand-ing, by ﬁrst extracting symbolic representations from imagesand text, and then executing symbolic programs over them.Gao et al . [37] dynamically fused visual features and questionwords with intra- and inter-modality information ﬂow, whichreasoned their relations by alternatively passing informationbetween and across multi-modalities. These relational reason-ing methods are commonly split into two stages: the ﬁrst one isstructured sets of representations extraction, which is intendedto correspond to entities from the raw data; While the second one is how to utilize those representations for reasoning theirintrinsic relationships.Our work mainly focuses on how to utilize the raw repre-sentations of data modalities to model both intra and inter rela-tions in cross modality mapping. For any two data modalities,once we model the intra relations as a structural relationshipamong instances within a single modality, and meanwhile viewthe inter relations as a reasoning relationship between the pairof instances from two modalities, then the problem of crossmodality mapping can be modeled as a structural and relationalreasoning problem. Intuitively, we are the ﬁrst attempt toreason about the structural relations within both single andmultiple data modalities simultaneously, which are commonly-existed, important, but ignored by most existing cross datamodality mapping studies. With the exploitation of structuralrelations, our model is able to learn the mapping relationbetween different data modalities more comprehensively.

C. Graph Neural Network

Recently, graph neural networks [40][41][42][43][44], espe-cially the graph convolutional network (GCN), have realizedobvious progress because of its expressive power in handinggraph relational structures. It can express complex interactionsamong data instances by performing feature aggregation fromneighbors via message passing. Studies of [45][46] learned vi-sual relationships among images by applying graph reasoningmodels. In [47], Michael et al . proposed a relational GCN tolearn speciﬁc contextual transformation for each relation type.Chen et al . [18] decomposed data modalities into hierarchicalsemantic levels and generated corresponding embedding via ahierarchical graph reasoning network. More recently, Wang etal . [48] proposed a spectral-based GCN to solve the problemof clustering faces, where the designed GCN can rationallylink different face instances belonging to the same person in

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 4 complex situations. Motivated by these studies, we model thecross modality mapping with relational reasoning via graphmodeling, representing each data modality as an Intra Graphand constructing an Inter Graph on top of those Intra graphs.Upon these graphs, we further propose a GCN-based relationalreasoning Network in which inter and intra relations areefﬁciently learned to universally resolve the cross modalitymapping problem. III. M

ETHODOLOGY

In this section, we ﬁrst present the overall architectureof the proposed network in Sec. III-A. Then we give somepreliminaries as well as schemes of graph construction forour method in Sec. III-B. Details of the proposed RR-Net areintroduced in Sec. III-C. Finally, we describe the loss functionfor training our RR-Net in Sec. III-D.

A. Framework Overview

Fig. 2 shows the information ﬂow of our proposed method.Two Intra Graphs and one Inter Graph are ﬁrstly initial-ized on top of raw representations extracted for each datamodality. Taking these constructed graphs as inputs, RR-Netthen transfers the node and edge attributes of graphs into alatent representations through the encoder module. Next, RR-Net updates the representations by jointly learning the nodefeatures and edge features in an iterative manner via the corerelational GCN module. After that, we cast the updated edgefeatures into the decoder to produce a set of probabilities overthe inter-edges, with the goal of obtaining the most likely crossmodality mapping pairs.

B. Graph Construction1) Preliminary:

Generally, an attributed graph can be rep-resented as G = ( V , E , ν, ε ) , where • V = { v , ..., v n } denotes the node set, in which n is thenumber of nodes; • E ⊆ V × V denotes the edge set; • ν = { v i | v i ∈ R d V , i = 1 , , ...n } denotes the nodeattribute set, where d V indicates the dimension of nodefeatures; • ε = { e i | e i ∈ R d E , i = 1 , , ... | E |} denotes the edgeattribute set, where d E indicates the dimension of edgefeatures. | E | refers to the number of edges.Given two modalities, we represent them as two Intra Graphs, i.e. , G = ( V , E , ν , ε ) and G = ( V , E , ν , ε ) . Onthe basis of G and G , we further construct an Inter Graph G A = ( V A , E A , ν A , ε A ) . Our goal is to infer a probabilityset P for E A to predict whether the candidate pairs exist.

2) Intra Graph Construction:

Given raw feature repre-sentations (extracted from a pre-trained convolutional neuralnetwork, i.e. , CNN networks [49][50] for extracting the visualfeature, word2vec [51] for extracting the text feature, etc .) ofall instances in each single modality, we initialize the IntraGraph G t ( t = 1 , by treating each instance i as one intra-node v ti and then generate intra-edges using task-dependentstrategies, such as K NN algorithm in image classiﬁcation (Sec. IV-B) and the natural social relation in recommendationsystem (Sec. IV-C), etc . Raw intra-node attributes v ti aredirectly derived from the raw feature representations of theinstance i , while intra-edge attributes e ti are initialized byconcatenating the attributes of the two associated two intra-nodes, i.e. , e ti = v ts i © v tr i where s i and r i denote the sendernode and the receiver node respectively, and © denotes theconcatenate operation.

3) Inter Graph Construction:

On top of two Intra Graphs G and G , we construct an Inter Graph G A to model the interrelation between two heterogeneous modalities. Speciﬁcally,we take all the nodes in two Intra Graphs as the the inter-nodeset V A = V ∪ V . Each inter-node attribute is representedby inheriting the intra-node set ν and ν . For the inter-edge generation, a naive way is to build all edges cross thetwo Intra Graphs G and G . However, this strategy notonly increases the computational cost and memory burden,but also introduces too much noise for inferring the interrelation between the two modalities. In this paper, for eachinter-node v A i , we generate only a few inter-edges associatedwith it with high conﬁdences that is computed according todomain knowledge. Similarly, we represent inter-edge attribute e A i by concatenating the attributes of its associated inter-nodes, e A i = v s i © v r i . Each inter-edge indicates a candidate mappingbetween instances of the two modalities, and we develop adeep graph network to learn for selecting reliable inter-edgesfrom the built graphs. C. RR-Net

Taking the constructed graphs as input, RR-Net learnsto form structured representations for all nodes and edgessimultaneously via relational reasoning. RR-Net contains twomodules: the Encoder-Decoder Module and the core Rela-tional GCN Module, which are elaborated in Sec. III-C1 andSec. III-C2.

1) Encoder-Decoder Module:

The encoder module aimsto transfer the edge and node attributes in G , G and G A into latent representations, exploiting two parametric updatefunctions Ψ e and Ψ v . Similar to studies in [10][52], we designthe two functions as two multi-layer-perceptions (MLPs). Foreach graph, the encoder module updates the attributes byapplying Ψ v to all nodes and Ψ e to edges: v i ← Ψ v ( v i ) , e i ← Ψ e ( e i ) , v i ← Ψ v ( v i ) , e i ← Ψ e ( e i ) , v A i ← Ψ v ( v A i ) , e A i ← Ψ e ( e A i ) . (1)After that, we pass all the graphs to the subsequent relationalGCN Module for joint learning of intra and inter relations.The decoder module aims to predict a probability P ∈ R | E A | over all the inter-edges. Like the encoder, we employ one MLPthat is implemented with one parametric update function φ totransform the inter-edge attribute e A into a desired space:P = φ ( e A ) (2)

2) Relational GCN Module:

This module is the core com-ponent of RR-Net, aiming to learn all the node features andedge features simultaneously in an iterative manner. Relational

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 5

GCN Module is implemented by stacking L copies of twokinds of GCN units, intra-GCN unit and inter-GCN unit, whichcorresponds to Intra Graph and Inter Graph respectively. EachGCN unit contains an edge convolutional layer (intra-edgelayer or inter-edge layer) and a following node convolutionallayer (intra-node layer or inter-node layer). The intra-nodelayer and inter-node layer are the same in all the GCN units,while the intra-edge layer and inter-edge layer are derived ina different form for learning the intra relationship and interrelationship respectively, considering the heterogeneous andinterconnected characteristics in cross data modality.Both kinds of GCN units consist of two steps: messageaggregation and message regeneration. The forward prop-agation of our model alternatively updates the intra-nodeattributes and intra-edge attributes through intra GCN unit,and it then updates the inter-node attributes and inter-edgeattributes through inter GCN unit. Below, we provide learningprocess of each unit in detail. (i) Intra-edge convolutional layer Taking Intra Graphs G t ( t = 1 , , as input, the intra-edge layer ﬁrst employsan aggregation function ϕ e intra that aggregates information ofassociated nodes for each intra-edge e i in G and e j in G .Formally, for e i and e j , we deﬁne its message aggregation as: ˆ e i ← ϕ e intra ( v s i , v r i ) , ˆ e j ← ϕ e intra ( v s j , v r j ) (3)where v s i ∈ ν , v r i ∈ ν are the attributes of two connectednodes of edge e i . Since nodes in the Intra Graph all come fromone single data modality, we design the aggregation function ϕ e intra by directly concatenating two node attributes associatedwith the current edge, ϕ e intra ( v i , v j ) = v i © v j (4)where © is the concatenate operation of two vectors. Takingthe aggregated information, for example ˆ e i and ˆ e j , as input,the intra-edge layer adopts a regeneration function ζ e intra togenerate new features and use them to update the intra-edgeattributes as: e i ← ζ e intra (ˆ e i , e i ) , e j ← ζ e intra (ˆ e j , e j ) (5)Like [52], we implement the regeneration function ζ e intra as anMLP to output an update intra-edge attribute. (ii) Inter-edge convolutional layer This layer updates theinter-edge attributes via two functions: an aggregation function ϕ e inter which incorporates its associated inter-node attributes,and an update function ζ e inter that generates a new inter-edgeattribute. For each inter-edge e A i and its associated sender node v A s i and receiver node v A r i , we deﬁne operators in inter-edgelayer as: ˆ e A i ← ϕ e inter ( v A s i , v A r i ) , e A i ← ζ e inter (ˆ e A i , e A i ) . (6)Different from that in the intra-edge layer, we specify theaggregation function ϕ e inter as: ϕ e inter ( v A s i , v A r i ) = W ( v A s i © v A r i ) (7)where W is a learnable weight matrix that can be interpretedas a kernel to balance the heterogeneous gap between two modalities. While for the update function ζ e inter , we similarlyspecify it as an MLP that takes the concatenated vector ˆ e A i © e A i as input and outputs an updated inter-edge attribute. (iii) Node convolutional layer Following the edge convo-lutional layer, the node convolutional layer is used to collectthe attributes of all the adjacent edges to the centering nodeto update their attributes. In our model, we design this layerwith two functions: an aggregation function ϕ v and an updatefunction ζ v . Similar to the edge convolution layer, for eachnode v ki in graph G k ( k ∈ { , , A} ) , we update its attributesas follows: ˆ v ki ← ϕ v (cid:0) E ki (cid:1) , v ki ← ζ v (cid:0) ˆ v ki , v ki (cid:1) , (8)where E i denotes the set of all edges associated with the v ki .Similar to studies [52][10], the aggregation function ϕ v is non-parametric, and the update function ζ v is parameterized by anMLP. D. Loss Function

After L iterations of node and edge feature updates, RR-Net outputs the probabilities P ∈ R | E A | over the inter-edgesfrom the ﬁnal decoder module, which is a set of the mostlikely cross modality mapping pairs. Then, given the ground-truth mapping Y ∈ { , } | E A | of cross data modality, weevaluate the difference between the predicted mapping P andthe annotation Y adopting a cross entropy loss : L = − | E A | (cid:88) i =1 (cid:8) Y i log(P i ) + (1 − Y i ) log(1 − P i ) (cid:9) . (9)IV. E XPERIMENTS

In this section, we ﬁrst study key proprieties of the proposedRR-Net on sound recognition task (Sec. IV-A) and image clas-siﬁcation task (Sec. IV-B). To examine whether our proposedmodel can be generalized well in those tasks with the lack ofintra relations, we further verify the proposed model on thesocial recommendation task (Sec. IV-C).

A. Sound Recognition

This task aims to recognize the type of the sound events inan audio streams. In this paper, we verify the effectiveness ofRR-Net on learning the mapping between audio and texturaldata modalities.

1) Dataset:

We evaluate the performance of the proposedRRNet in complex environmental sound recognition task, ontwo datasets with different scales: ESC-10 [53] and ESC-50[53] datasets. The ESC-50 dataset comprehends 2000 audioclips of 5s each. It equally divides all the clips into ﬁne-grained 50 categories with 5 major groups: animals, naturalsoundscapes and water sounds, human non-speech sound,interior/domestic sounds, and exterior/urban noises. The ESC-10 dataset is a selection of 10 classes from the ESC-50 dataset.It comprises of 400 audio clips of 5s each. In our experiments,we divide the dataset into 5 folds and adopt the leave-one-fold-out evaluations to compute the mean accuracy rate. For a fair

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 6

TABLE I: Sounds recognition accuracy (%) on both ESC-10 and ESC-50 dataset. § means method is training withstrong augment and between-class samples. Methods Accuracy (%) on dataset

ESC-10 ESC-50

M18 [56] 81.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± § [59] 89.1 ± ± RR-Net (Ours) 96.5 ± ± Human Performance. ± ± comparison, we remove completely silent sections in whichthe value was equal to 0 at the beginning or end of samples inthe dataset, and then convert all sound ﬁles to monaural 16-bitWAV ﬁles, following the studies [54][55].

2) Implementation Details:

We ﬁrst construct Intra Graphs,including audio graph G , textural graph G , and Inter Graph G A , using schema in Sec. III. As for G , we take each audio asan intra-node and represent its attribute by extracting the audiorepresentations of 512 dimension from the baseline model, i.e. ,EnvNet [54]. Similarly, G takes each text as an intra-node, ituses the word2vec model to extract text representations of 300dimension for representing the attribute of each node. For eachintra-node, its intra-edges are assigned via K NN algorithm tosearch some nearest neighbor nodes. As for an inter-node inthe G A , we build its connected edges by selecting top-K nodeswhich have high initial mapping probabilities according to thebaseline model. We set the number of nearest neighbors ofeach intra-node in ESC-10 as 5 in G and 2 in G , empirically.The number of nearest neighbors of each node in ESC-50 isset as 10 in G and 2 in G . Besides, the top-K parameter in G A is set as 20 on ESC-50, and 10 on ESC-10, respectively.We employ one hidden layer in MLP for encoder, where thenumber of neurons is empirically set to 16. In order to betterexplore the relational reasoning of graph model, we stack oneintra GCN unit and inter GCN unit, thus the total number ofGCN units equals to 2. To train RR-Net, we use momentumSGD optimizer and set the initial learning rate as 0.01 on ESC-10 and 0.1 on ESC-50, momentum as 0.9 and weight decayas 5e-4.

3) Comparison with State-of-the-arts:

We compare our RR-Net with 8 state-of-the-art sounds recognition methods, includ-ing M18 [56], LESM [57], DMCU [58], EnvNet [54], EnvNet-v2 [59], SoundNet8+SVM [60] AReN [61] and EnvNet-v2+strong augment [59] (w.r.t, EnvNet-v2 § ). Tab. I reportstheir accuracy results on both ESC-10 and ESC-50 datasets.We observe that our proposed model achieves the best per-formance with 96.5% and 80.8% accuracy on the ESC-10 andESC-50 datasets, respectively. It has a signiﬁcant improvementof 2.0% on the ESC-10 dataset and 1.0% on the ESC-50dataset, compared with the second best method DMCU [58].Note that for the EnvNet-v2 § , despite of authors pre-process the training data using strong augments and between classesaudio samples, accuracies of this method are obviously lowerthan us by 7.4% and 2.0% on ESC-10 and ESC-50 datasetrespectively. It is worth noting that when comparing to thehuman performance, our model promotes the recognition accu-racy by 0.8% on ESC-10 dataset. Since the accuracy achievedby human is already quite high, thus the improvement of ourmodel is indeed signiﬁcant. More speciﬁcally, by comparingthe baseline model EnvNet [54], our model yields an accuracyboost around 9.0% and 10.0% on ESC-10 and ESC-50 respec-tively, as shown in Tab. I. These results obviously illustrate theeffectiveness of our RR-Net for solving mapping among audio vs textural modality. B. Image Classiﬁcation

Previous section clearly illustrates the effectiveness of ourRR-Net on audio and textual modality mapping. In thissection, we further verify the universality and effectivenessof our model on learning the mapping between image andtextual modality. Taking image classiﬁcation as an example,we are not to obtain state-of-the-art results on this task, butto give room for potential accuracy and robustness improve-ments in exploring a universal cross modality mapping model.Below, utilizing different networks, including ResNet18 [49],ResNet50 [49], and MobileNetV2 [62] as baseline modelrespectively, we reproduce them for image classiﬁcation atﬁrst, and then we evaluate our model on top of those baselinesfor universality illustration.

1) Dataset:

We adopt two different scales of image classi-ﬁcation datasets: CIFAR-10 [63] and CIFAR-100 [63] for ourevaluation. The CIFAR-10 consists of 60, 000 32x32 colorimages belonging to 10 categories, with 6,000 images for eachcategory. This dataset is split into 50, 000 training images and10,000 test images. The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each.There are 500 training images and 100 testing images perclass. The 100 classes in the CIFAR-100 are grouped into 20superclasses. Each image comes with a ”ﬁne” label (the classto which it belongs) and a ”coarse” label (the superclass towhich it belongs). Data augmentation strategy includes randomcrop and random ﬂipping is used during training, following instudies [49][64].

2) Implementation Details:

Similar to the previous task,we also construct two Intra Graphs: image graph G andtext graph G , and one Inter Graph G A . The Intra Graphsrespectively take each image and each text as an intra-node. G builds attribute of each intra-node by extracting imagefeatures of 512 dimension from the baseline model. While G adopts the word2vec model to extract the textural featuresof 300 dimension for representing attribute of each node. Foreach intra-node, we assign its intra-edge using K NN clusteringalgorithm for searching its connected nodes. As for an inter-node in G A , we build its connected edges by selecting top-Knodes with high initial probabilities according to the baselinemodel. We empirically set the number of nearest neighborsfor each intra-node in CIFAR-10 as 10 in G and 2 in G ,while 20 and 3 numbers for the CIFAR-100 settings. The top-K parameter in G A is set as 15 on CIFAR-100, and 10 on OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 7

TABLE II: Image classiﬁcation on top-1 accuracy (%) ofdifferent baseline methods and RR-Net on CIFAR-10 andCIFAR-100 datasets.

Baseline

RR-Net

Top-1 Accuracy (%) on

CIFAR-10 CIFAR-100

ResNet18 (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:51)

CIFAR-10, respectively. Similar to previous tasks, we employone hidden layer in MLP for encoder, where the numberof neurons is empirically set to 16 for image classiﬁcation.we stack two intra GCN units and three inter GCN units,thus the total number of GCN units equals to 5. Finally, wetrain the baselines and RR-Net using the SGD optimizer withinitial learning rate 0.01, momentum 0.9, weight decay 5e-4,shufﬂing the training samples.

3) Comparison with Baselines:

Tab. II presents compar-ison results of top-1 accuracy between our model and thebaseline models. Obviously, we receive great improvementsover different baseline models on both datasets. In particularly,RR-Net improves the classiﬁcation accuracy over ResNet18,ResNet50 and MobileNetV2 by 2.59%, 1.68%, 1.79% onCIFAR-10 dataset, and 2.01%, 1.23%, 3.23% on CIFAR-100dataset, correspondingly. These results clearly demonstrate theeffectiveness of the proposed model for the image-texturalmodality mapping. Tab. II illustrates that better performance ofimage classiﬁcation can be achieved by using better backbonessuch as ResNet-50, MoblieNetV2, but thanks to the relationalreasoning ability of RR-Net, our model further improves theirperformance with a large margin. Besides, it also can beseen that using different baselines that have different featurerepresenting performance for initializing graphs in our model,RR-Net consistently improves their accuracy, demonstratingthe generalization ability of our proposed model.

C. Social Recommendation

We further verify the generalization of our model on Socialrecommendation which aims to provide personalized itemsuggestions to each user, according to the user-item ratingrecords. In this task, social network for users are explicitlyprovided via the rating records, but no any relation informationexisted between items. Thus, we evaluate our model on thistask under lacking intra relations.

1) Dataset:

We evaluate the performance of our modelin social recommendation task, adopting two public dataset:Filmtrust [65] and Ciao . Details of these two datasets arepresented in Tab. III. As for Ciao , we ﬁlter out all theuser nodes and item nodes whose length of id is larger that99999, since they are the conﬁdent unreliable id records. Withthose ﬁlter nodes, we naturally remove their connected sociallinks and rating links. To ensure the high generalization of TABLE III: Dataset statistics. The rating information existon R and social information are available on S . Dataset user item rating ( R ) social ( S ) Ciao

FilmTrust

TABLE IV: Comparison results of MAE and RMSEvalues on FilmTrust and Ciao dataset. Our method con-sistently achieves better performance than the previousstate-of-the-art approaches.

Methods FilmTrust[65] Ciao MAE ↓ RMSE ↓ MAE ↓ RMSE ↓ SoReg [66] 0.674 0.878 1.306 1.547SVD++ [67] 0.659 0.846 0.844 1.188SocialMF [68] 0.638 0.837 0.946 1.254TrustMF [69] 0.650 0.833 0.937 1.212TrustSVD [70] 0.649 0.832 0.925 1.202LightGCN [71] 0.669 0.893 0.796 1.037

RR-Net (Ours) 0.646 0.824 0.825 1.050 our model, We randomly spit each data set into training,valuation and testing data set. The ﬁnal performance is gainedby meaning results of ﬁve times jointly training and testingthe corresponding data set.

2) Evaluation metrics: we evaluate our model by twowidely used metrics, namely mean absolute error (MAE) androot mean square error (RMSE). Formally, these metrics aredeﬁned as: MAE = (cid:80) u,j | ˆ r u,j − r u,j | N RMSE = (cid:115) (cid:80) u,j (ˆ r u,j − r u,j ) N (10)where r u,j ∈ R is the rating record for user i and item j . ˆ r u,j is the predicted rating of user i on item j , and N is thenumber of rating records. Smaller values of MAE and RMSEindicate better performance.

3) Implementation Details:

In order to predict the potentialrating between speciﬁc user and item, we also construct twoIntra Graphs: user graph G and item graph G , and oneInter Graph G A on two dataset respectively. For the intra-node in the Intra Graph, G and G build it with each userand each item in the dataset, respectively. Then we generate theembedding of each user and item from a uniform distributionwithin [0,1), with dimension of 128, which is used to representattribute of each intra node in G and G , respectively. Theintra-edge in G according to users’ social relation S providedby dataset. Since items in dataset are represented by a set of idrecords, no any relations between items can be used because ofthe ambiguous relationship. Thus, we build G without usingany intro edges. As for inter-graph G A , we build the inter edgeby linking each user to all the items for FilmTrust. But on theCiao , we build inter edge by linking to the users appeared inits rating links R . Similar to the sound recognition task, onehidden layer in MLP for encoder with the number of 16 areutilized in RR-Net. Two intra GCN units and three inter GCN OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 8

Number of Neurals816 32 64 128 A cc u r acy ( % ) Number of Neurals8 16 32 64 128 V a l u e Number of Neurals0 816 32 64 128

Top - A cc u r acy ( % ) Fig. 3: Comparison results of different number of neurons in latent space (w.r.t. the number of MLP neural betweenencoder and decoder) over ESC-10, Filmtrust and CIFAR-10 dataset, respectively.

Number of GCN Units0 2 5 8 10 A cc u r acy ( % ) Number of GCN Units2 5 8 10 V a l u e Number of GCN Units0 2 5 8 10

Top - A cc u r acy ( % ) Fig. 4: Comparison resluts of different number of GCN units in our RR-Net over ESC-10, Filmtrust and CIFAR-10dataset, respectively.units are stacked for exploring the relational reasoning in ourrelational GCN module. Finally, we train our model using theSGD optimizer with initial learning rate 0.01, momentum 0.9,weight decay 5e-4, and 500 epochs on both the two datasetuntil convergence.

4) Comparison with State-of-the-arts:

We compare ourmodel with 6 state-of-the-art social recommendation methods,including SoReg [66], SVD++ [67], TrustSVD [70], SocialMF[68], TrustMF [69] and LightGCN [71]. For a fair comparison,we reproduce these methods in type of rating prediction andreport their results in Tab. IV. From the results, it can be seenthat our model greatly outperforms most state-of-the-arts overthe FilmTrust, despite lacking of intra relations in item graph.Particularly, by comparing the recent method LightGCN [71],RR-Net reduces the recognition error by around 6% and 2%in terms of RMSE and MAE on the FilmTrust, respectively.While on the Ciao, although no intra relation can be exploitedduring our relational reasoning, our RR-Net still achievescomparable performance without using any prior knowledgefor the intra relations reasoning. This speaks well that ourmodel is effective and generalized for learning the mappingbetween different data modalities.

D. Universality Analysis

To further prove the universality of the presented RR-Net, we also analyze some common characteristics over theabove mentioned cross modality mapping tasks. Firstly, RR-Net is trained with nearly the same learning rate 0.01 fordifferent tasks, which is stable and robust for training underdifferent data domains. Secondly, we vary different numberof neurons (w.r.t, the number of MLP neurons before outputfrom decoder) in the encoder-decoder module, and ﬁnd thatour model can achieve better performance using same length TABLE V: Accuracy for top-K nearest neighbor nodes for oneinter-node in Inter Graph, by setting different values for K.

ESC-50

Impact of k A Size of k A

40 50Accuracy (%) 79.4 79.7

CIFAR-100

Size of k A

20 50 100Accuracy (%) 63.24 of neurons for different tasks. Fig. 3 gives the correspondingcomparison result curves. We can see that RR-Net consistentlyperforms the best under the same setting, i.e. , the number ofneurons equals to 16 on ESC-10, FilmTrust and CIFAR-10dataset. This proves that our RR-Net is not affected by thedimension of latent space. Moreover, we also notice that RR-Net is not sensitive to the total number of GCN units whenperforming the relational reasoning, as illustrating in Fig. 4.From this ﬁgure, it can be seen that our model performs thebest by setting the total number of GCN units as 2, 5, 5for ESC-10, FilmTrust and CIFAR-10 dataset, respectively.This illustrates our model is effective by setting the GCN unitwithin a range of [2, 5] for different tasks. Similar situation canbe found in Tab. V. We adopt different number of candidateinter-edges for one inter-node to construct Inter Graph onboth sound recognition and image classiﬁcation task. Resultingin Tab. V shows that our model reaches better performancewith the number of top-K nearest neighbors ranging from 15to 20, which further illustrates the strengthen universality ofour model for cross modality mapping leaning. Interestingly,it also can be seen that performance would be consistentlydecreased when building inter-edge by fully connecting intra-nodes in two Intra Graphs. This mainly because of too muchnoise are introduced in the Inter Graph, which limits the

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 9 reasoning ability of RR-Net for inferring the inter relationbetween the two modalities. On the contrary, our modelexhibits best performance with several high conﬁdent inter-edges in the Inter Graph.V. C

ONCLUSION

In this paper, we resolve the cross modality mappingproblem with relational reasoning via graph modeling andpropose a universal RR-Net to learn both intra relations andinter relations simultaneously. Speciﬁcally, we ﬁrst constructIntra Graph and Inter Graph. On top of the constructed graphs,RR-Net mainly takes advantage of Relational GCN moduleto update the node features and edge features in an iterativemanner, which is implemented by stacking multipleGCN units.Extensive experiments on different types of cross modalitymapping clearly demonstrate the superiority and universalityof our proposed RR-Net.R

EFERENCES[1] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille,“Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv:1412.6632 , 2014. 1[2] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in

Proc. ACM International Conference onMachine Learning , 2015, pp. 2048–2057. 1[3] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attentionnetworks for image question answering,” in

Proc. IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 21–29. 1[4] H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in

Proc. IEEEEuropean Conference on Computer Vision , 2016, pp. 451–466. 1[5] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, andL. Zhang, “Bottom-up and top-down attention for image captioning andvisual question answering,” in

Proc. IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 6077–6086. 1[6] J. Zhang, J. Yu, and D. Tao, “Local deep-feature alignment for unsuper-vised dimension reduction,”

IEEE Transactions on Image Processing ,no. 5, pp. 2420–2432, 2018. 1[7] X. Wang, R. Chen, Z. Zeng, C. Hong, and F. Yan, “Robust dimensionreduction for clustering with local adaptive learning,” vol. 30, no. 3, pp.657–669, 2019. 1[8] W. Zhang, D. Xu, J. Zhanga, and W. Ouyang, “Progressive modalitycooperation for multi-modality domain adaptation,” pp. 1–1, 2021. 1[9] H. Hu, I. Misra, and L. van der Maaten, “Evaluating text-to-imagematching using binary image selection (bison),” in

Proc. IEEE Inter-national Conference on Computer Vision Workshop. , 2019. 1, 2[10] M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained actionretrieval through multiple parts-of-speech embeddings,” in

Proc. IEEEConference on Computer Vision and Pattern Recognition , 2019, pp. 450–459. 1, 2, 4, 5[11] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang, “Look, imagine and match:Improving textual-visual cross-modal retrieval with generative models,”in

Proc. IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 7181–7189. 1, 2[12] Y. Huang, Q. Wu, and L. Wang, “Learning semantic concepts andorder for image and sentence matching,” in

Proc. IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 663–6171. 1, 2[13] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv:1411.2593 , 2014. 1, 2[14] F. Yan and K. Mikolajczyk, “Deep correlation for matching imagesand text,” in

Proc. IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 3441–3450. 1[15] J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to largevocabulary image annotation,” in

International Joint Conference onArtiﬁcial Intelligence , 2011, pp. 2764–2770. 1, 2[16] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical cor-relation analysis,” in

Proc. ACM International Conference on MachineLearning , 2013, pp. 1247–1255. 1, 2 [17] M. Norouzi, T. Mikolov, S. Bengio, J. Singer, Yoram Shlens, A. Frome,and J. Corrado, Greg S.and Dean, “Devise: A deep visual-semanticembedding model,” in

Annual Conference on Neural Information Pro-cessing Systems , 2013, pp. 2121–2129. 1, 2[18] S. Chen, Y. Zhao, Q. Jin, and Q. Wu, “Fine-grained video-text retrievalwith hierarchical graph reasoning,” in

Proc. IEEE Conference on Com-puter Vision and Pattern Recognition , 2020, pp. 10 638–10 647. 1, 2,3[19] X. H. Yuxin Peng and J. Qi, “Cross-media shared representation byhierarchical learning with multiple deep networks,” in

International JointConference on Artiﬁcial Intelligence , 2016, pp. 3846–3853. 1, 2[20] F. Huang, X. Zhang, J. Xu, Z. Zhao, and Z. Li, “Multimodal learningof social image representation by exploiting social relations,” in

IEEETransactions on Neural Networks and Learning Systems , 2013, pp. 1–13.1, 2, 3[21] J. M. Manno, M. M. Bronstein, A. M. Bronstein, and J. Schmidhu-ber, “Multimodal similarity-preserving hashing,”

IEEE Transactions onPattern Analysis and Machine Intelligence , pp. 824–830, 2014. 2[22] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codesfor multimodal representations using orthogonal deep structure,”

IEEETransactions on Multimedia , pp. 1404–1416, 2015. 2[23] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classiﬁcation with convolutional neuralnetworks,” in

Proc. IEEE Conference on Computer Vision and PatternRecognition , 2014, pp. 1725–1732. 2[24] N. Srivastava and R. Salakhutdinov, “Multimodal learning with deepboltzmann machines,” in

Annual Conference on Neural InformationProcessing Systems , 2012, pp. 2231–2239. 2[25] H.-I. Suk, S.-W. Lee, and D. Shen, “Modeling text with graph convo-lutional network for cross-modal information retrieval,” in

Paciﬁc-RimConference on Multimedia. , 2014. 2[26] J. Ngiam, A. Khosla, M. Kim, and J. Nam, “Multimodal deep learning,”in

Proc. ACM International Conference on Machine Learning , 2011. 2[27] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W.-Y. Ma,“Uniﬁed visual-semantic embeddings: Bridging vision and languagewith structured meaning representations,” in

Proc. IEEE Conference onComputer Vision and Pattern Recognition , 2019. 2[28] K. Wang, R. He, L. Wang, W. Wang, and T. Tan, “Joint feature selectionand subspace learning for cross-modal retrieval,”

IEEE Transactions onPattern Analysis and Machine Intelligence , pp. 2010–2023, 2016. 2[29] Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multimodal data,”in

Annual Conference on Neural Information Processing Systems , 2012,pp. 1376–1384. 2[30] J. Yu, Y. Lu, Z. Qin, Y. Liu, J. Tan, L. Guo, and W. Zhang, “Hierarchicalfeature representation and multimodal fusion with deep learning forad/mci diagnosis,” arXiv:1802.00985 , 2018. 2[31] K. Simonyan and A. Zisserman, “Two-stream convolutional networks foraction recognition in videos,”

Annual Conference on Neural InformationProcessing Systems , vol. 1, 2014. 3[32] S.-X. Zhang, X. Zhu, J.-B. Hou, C. Liu, C. Yang, H. Wang, and X.-C.Yin, “Deep relational reasoning graph network for arbitrary shape textdetection,” in

Proc. IEEE Conference on Computer Vision and PatternRecognition , 2020, pp. 9699–9708. 3[33] B. Zhou, A. Andonian, and A. Torralba, “Temporal relational reasoningin videos,” pp. 803–818, 2018. 3[34] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, “Murel: Mul-timodal relational reasoning for visual question answering,” in

Proc.IEEE Conference on Computer Vision and Pattern Recognition , 2019,pp. 1989–1998. 3[35] Y. Feng, X. Chen, B. Y. Lin, P. Wang, J. Yan, and X. Ren, “Scalablemulti-hop relational reasoning for knowledge-aware question answer-ing,” arXiv preprint arXiv:2005.00646 , 2020. 3[36] A. Santoro, D. Raposo, M. M. David G.T. Barrett, R. Pascanu,P. Battaglia, and T. Lillicrap, “A simple neural network module forrelational reasoning,” in

Annual Conference on Neural InformationProcessing Systems , 2017. 3[37] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamicfusion with intra-and inter-modality attention ﬂow for visual questionanswering,” pp. 6639–6648, 2019. 3[38] Q. Huang, H. He, A. Singh, Y. Zhang, S. N. Lim, and A. Benson, “Betterset representations for relational reasoning,” 2020. 3[39] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neural-symbolic vqa: Disentangling reasoning from vision and language under-standing,” pp. 1039–1050, 2018. 3[40] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,“The graph neural network model,”

IEEE Transactions on NeuralNetworks and Learning Systems , pp. 61–80, 2019. 3

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 10 [41] P. L. Adriana Romero and Y. Bengio, “Graph attention networks,” in

International Conference on Learning and Representation , 2018. 3[42] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graphconvolutional networks,” in

International Conference on Learning andRepresentation , 2017. 3[43] R. Y. William L. Hamilton and J. Leskovec, “Inductive representationlearning on large graphs,” in

Annual Conference on Neural InformationProcessing Systems , 2017. 3[44] X. B. Michael Defferrard and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

AnnualConference on Neural Information Processing Systems , 2016. 3[45] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoningfor image-text matching,” in

Proc. IEEE International Conference onComputer Vision , 2019, pp. 4654–4662. 3[46] P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. van den Hen-gel, “Neighbourhood watch: Referring expression comprehension vialanguage-guided graph attention networks,” in

Proc. IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 1960–1968. 3[47] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. D. Berg, I. Titov, andM. Welling, “Modeling relational data with graph convolutional net-works,” in

Extended Semantic Web Conference , 2018, pp. 593–607. 3[48] Z. Wang, L. Zheng, Y. Li, and S. Wang, “Linkage based face clusteringvia graph convolution network,” in

Proc. IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 1117–1125. 3[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” 2016. 4, 6[50] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” 2015, vol. abs/1409.1556. 4[51] Q. Le and T. Mikolov, “Distributed representations of sentences and doc-uments,” in

Proc. ACM International Conference on Machine Learning ,2014, pp. 1188–1196. 4[52] T. Wang, H. Liu, Y. Li, Y. Jin, X. Hou, and H. Ling, “Learningcombinatorial solver for graph matching,” in

Proc. IEEE Conference onComputer Vision and Pattern Recognition , June 2020, pp. 7568–7577.4, 5[53] H. Xu and K. Saenko, “Esc: Dataset for environmental sound classiﬁ-cation,” in

ACM International Conference on Multimedia , 2015. 5[54] Y. Tokozume and T. Harada, “Learning environmental sounds withend-to-end convolutional neural network,” in

Proc. IEEE InternationalConference on Acoustics, Speech and SP , 2017, pp. 2721–2725. 6[55] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-classexamples for deep sound recognition,” 2018. 6[56] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional neuralnetworks for raw waveforms,” in

Proc. IEEE International Conferenceon Acoustics, Speech and SP , 2017. 6[57] B. Zhu, C. Wang, F. Liu, J. Lei, Z. Lu, and Y. Peng, “Learningenvironmental sounds with multi-scale convolutional neural network,” arXiv:1803.10219 , 2018. 6[58] D. Hu, F. Nie, and X. Li, “Deep multimodal clustering for unsupervisedaudiovisual learning,” in

Proc. IEEE Conference on Computer Visionand Pattern Recognition , 2019. 6[59] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-classexamples for deep sound recognition,” in

International Conference onLearning and Representation , 2018. 6[60] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning soundrepresentations from unlabeled video,” in

Annual Conference on NeuralInformation Processing Systems , 2016. 6[61] A. Greco, N. Petkov, A. Saggese, and M. Vento, “Aren: A deep learningapproach for sound event recognition using a brain inspired representa-tion,”

IEEE Transactions on Information Forensics and Security , vol. 15,pp. 3610–3624, 2020. 6[62] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mo-bilenetv2: Inverted residuals and linear bottlenecks,” arXiv 1801.04381 ,2018. 6[63] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” in

Handbook Syst. Autoimmune. Diseases. , 2009. 6[64] H. Kai, W. Yunhe, X. Yixing, X. Chunjing, W. Enhua, and X. Chang,“Training binary neural networks through learning with noisy supervi-sion,” in

Proc. ACM International Conference on Machine Learning ,2020. 6[65] G. Guo, J. Zhang, and N. Yorke-Smith, “A novel bayesian similaritymeasure for recommender systems,” in

International Joint Conferenceon Artiﬁcial Intelligence , 2013, pp. 2619–2625. 7[66] H. Ma, D. Zhou, and C. Liu, “Recommender systems with socialregularization,” in

ACM International Conference on Web Search andData Mining , 2011, pp. 287–296. 7, 8 [67] Y. Koren, “Factorization meets the neighborhood: a multifaceted collab-orative ﬁltering model,” in

Proc. ACM Knowledge Discovery and DataMining , 2008, pp. 426–434. 7, 8[68] M. Jamali and imageMartin Ester, “A matrix factorization technique withtrust propagation for recommendation in social networks,” in

Proc. ACMConference on Recognition System , 2010, pp. 135–142. 7, 8[69] B. Yang, Y. Lei, D. Liu, and J. Liu, “Social collaborative ﬁltering bytrust,” in

International Joint Conference on Artiﬁcial Intelligence , 2013,pp. 2747–2753. 7, 8[70] G. Guo, J. Zhang, and N. Yorkesmith, “Trustsvd: Collaborative ﬁlteringwith both the explicit and implicit inﬂuence of user trust and of itemratings,” in

AAAI Conference on Artiﬁcial Intelligence , 2015, pp. 1482–1489. 7, 8[71] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn:Simplifying and powering graph convolution network for recommen-dation,” in