[PDF] Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data

Abstract

With the rapid evolution of social media, fake news has become a significant social problem, which cannot be addressed in a timely manner using manual investigation. This has motivated numerous studies on automating fake news detection. Most studies explore supervised training models with different modalities (e.g., text, images, and propagation networks) of news records to identify fake news. However, the performance of such techniques generally drops if news records are coming from different domains (e.g., politics, entertainment), especially for domains that are unseen or rarely-seen during training. As motivation, we empirically show that news records from different domains have significantly different word usage and propagation patterns. Furthermore, due to the sheer volume of unlabelled news records, it is challenging to select news records for manual labelling so that the domain-coverage of the labelled dataset is maximized. Hence, this work: (1) proposes a novel framework that jointly preserves domain-specific and cross-domain knowledge in news records to detect fake news from different domains; and (2) introduces an unsupervised technique to select a set of unlabelled informative news records for manual labelling, which can be ultimately used to train a fake news detection model that performs well for many domains while minimizing the labelling cost. Our experiments show that the integration of the proposed fake news model and the selective annotation approach achieves state-of-the-art performance for cross-domain news datasets, while yielding notable improvements for rarely-appearing domains in news datasets.

Full PDF

EEmbracing Domain Differences in Fake News: Cross-domain Fake News Detectionusing Multimodal Data

Amila Silva, Ling Luo, Shanika Karunasekera, Christopher Leckie

School of Computing and Information SystemsThe University of MelbourneParkville, Victoria, Australia { amila.silva@student., ling.luo@, karus@, caleckie@ } unimelb.edu.au Abstract

With the rapid evolution of social media, fake news has be-come a signiﬁcant social problem, which cannot be addressedin a timely manner using manual investigation. This has mo-tivated numerous studies on automating fake news detection.Most studies explore supervised training models with differ-ent modalities (e.g., text, images, and propagation networks)of news records to identify fake news. However, the perfor-mance of such techniques generally drops if news recordsare coming from different domains (e.g., politics, entertain-ment), especially for domains that are unseen or rarely-seenduring training. As motivation, we empirically show thatnews records from different domains have signiﬁcantly dif-ferent word usage and propagation patterns. Furthermore, dueto the sheer volume of unlabelled news records, it is chal-lenging to select news records for manual labelling so thatthe domain-coverage of the labelled dataset is maximized.Hence, this work: (1) proposes a novel framework that jointlypreserves domain-speciﬁc and cross-domain knowledge innews records to detect fake news from different domains;and (2) introduces an unsupervised technique to select a setof unlabelled informative news records for manual labelling,which can be ultimately used to train a fake news detectionmodel that performs well for many domains while minimiz-ing the labelling cost. Our experiments show that the integra-tion of the proposed fake news model and the selective an-notation approach achieves state-of-the-art performance forcross-domain news datasets, while yielding notable improve-ments for rarely-appearing domains in news datasets.

Introduction

Motivation.

Today, social media is considered as one ofthe leading and fastest media to seek news information on-line.Thus, social media platforms provide an ideal environ-ment to spread fake news (i.e., disinformation). Many timesthe cost and damage due to fake news are high and early de-tection to stop spreading such information is of importance.For example, it has been estimated that at least 800 peopledied and 5800 were admitted to hospital as a result of falseinformation related to the COVID-19 pandemic, e.g., believ-ing alcohol-based cleaning products are a cure for the virus .Due to the high volumes of news generated on a daily basis, it is not practical to identify fake news using manual factchecking. Therefore, automatic detection of fake news hasrecently become a signiﬁcant problem attracting immenseresearch effort. Challenges.

Nevertheless, most existing fake news detec-tion techniques fail to identify fake news in a real-worldnews stream for the following reasons. First, most existingtechniques (Silva et al. 2020; Zhou et al. 2020; Shu et al.2019, 2020b; Ruchansky et al. 2017) are trained and evalu-ated using datasets (Shu et al. 2020a; Cui et al. 2020) thatare limited to a single domain such as politics, entertain-ment, healthcare. However, a real news stream typically cov-ers a wide variety of domains. We have empirically foundthat existing fake news detection techniques perform poorlyfor such a cross-domain news dataset despite yielding goodresults for domain-speciﬁc news datasets. This observationmay be due to two reasons: (1) domain-speciﬁc word usage;and (2) domain-speciﬁc propagation patterns. For example,Figure 1 adopts two datasets from different domains, Poli-tiFact for politics and GossipCop for entertainment, whichare two widely used labelled datasets to train fake newsdetection models. Fig. 1 shows that there are signiﬁcantdifferences in the frequently used words and propagationpatterns of these two datasets. To address this challenge,some previous works (Wang et al. 2018; Castelo et al. 2019)learned models to overlook such domain-speciﬁc informa-tion and only rely on cross-domain information (e.g., web-markup and readability features) for fake news detection.However, domain-speciﬁc knowledge could be useful for ac-curate identiﬁcation of fake news. As a solution, this workaims to address how to preserve domain-speciﬁc and cross-domain knowledge in news records to detect fake news incross-domain news datasets.

Second, the studies in (Hanet al. 2020; Janicka et al. 2019) show that most fake newsdetection techniques are not good at identifying fake newsrecords from unseen or rarely-seen domains during training.As a solution, fake news detection models can be learned us-ing a dataset that covers as many domains as possible. Herewe assume that the fake news detection model requires su-pervision as supervised techniques are known to be substan-tially better at identifying fake news compared to the unsu-pervised methods (Yang et al. 2019a). In such a supervisedlearning setting, each training (i.e., labelled) data point hasan associated labelling cost. Thus, the total labelling budget a r X i v : . [ c s . C L ] F e b PolitiFact GossipCop (a)

Feature WeinerIndex NetworkDepth MaximumOutdegree PropagationSpeedp-value 1.81e-2 5.81e-19 4.11e-4 3.42e-29 (b)

Figure 1: (a) Word clouds for the top 20 words in PolitiFactand GossipCop. (b) Two-sample t-test results conducted us-ing different graph-level features extracted from the propa-gation networks in PolitiFact and GossipCop.constrains the number of data instances that can be selectedfor manual labelling. Due to the sheer volume of unlabellednews records available, there is a need to identify informa-tive news records to annotate such that the labelled datasetultimately covers many domains while avoiding any selec-tion biases.

Contribution.

To address the aforementioned challenges,this work makes the following contributions:• We propose a multimodal fake news detection techniquefor cross-domain news datasets that learns domain-speciﬁcand cross-domain information of news records using two in-dependent embedding spaces, which are subsequently usedto identify fake news records. Our experiments show that theproposed framework outperforms state-of-the-art fake newsdetection models by as much as . in F1-score.• We propose an unsupervised technique to select a givennumber of news records from a large data pool such that theselected dataset maximizes the domain coverage. By usingsuch a dataset to train a fake news detection model, we showthat the model achieves around F1-score improvementsfor rarely-appearing domains in news datasets.

Related Work

Fake news detection methods mainly rely on different at-tributes (text, image, social context) of news records to de-termine their veracity. Text content-based approaches (Yanget al. 2016; Volkova et al. 2017; P´erez-Rosas et al. 2018;Pennebaker et al. 2015) mainly explore word usage and lin-guistic styles in the headline and body of news records toidentify fake news. Some works analyse the images in newsrecords along with the text content for fake news detection.For example, the studies in (Jin et al. 2017; Wang et al. 2018; We deﬁne multimodality as information acquired from differ-ent sources/attributes following (Zhang et al. 2017), instead of re-stricting just for sensory media (e.g., text, image).

Khattar et al. 2019) use pre-trained image models (e.g.,VGG-19, ResNet) to extract features from images, whichare integrated with text features to identify fake news. Also,some works consider the social context of a news record,i.e., how the record is propagated across social media, asanother modality to differentiate fake news records fromreal ones. Existing work in this line mostly applies variousmachine learning techniques to extract features from prop-agation patterns, including Propagation Tree Kernels (Maet al. 2017), Recurrent Neural Networks (Wu et al. 2018;Liu et al. 2018), and Graph Neural Networks (Monti et al.2019). However, all these modalities (i.e., text, propagationpatterns) generally show notable differences (see Figure 1)for news records in different domains. Thus, most existingtechniques perform poorly for cross-domain news datasetsdue to their inability to capture such domain-speciﬁc vari-ations. Our model also relies on the text content and socialcontext of news. However, the main objective of our model isto capture such domain-speciﬁc variations of news records.

Domain-agnostic Fake News Detection.

Several previ-ous works have attempted to perform fake news detectionusing cross-domain datasets. In (Wang et al. 2018), an eventdiscriminator is learned along with a multimodal fake newsdetector to overlook domain-speciﬁc information in newsrecords. The study in (Castelo et al. 2019) carefully se-lects a set of features (e.g., psychological features, readabil-ity features) from news records that are domain-invariant.These techniques rely only on cross-domain information innews records. In contrast, Han et al. (2020) consider cross-domain fake news detection as a continual learning task,which learns a model for a large number of tasks sequen-tially. This work adopts Graph Neural Networks to detectfake news using their propagation patterns and applies well-known continual learning approaches Elastic Weight Con-solidation (Kirkpatrick et al. 2017) and Gradient EpisodicMemory (Lopez-Paz et al. 2017) to address cross-domainfake news detection problem. This approach has two lim-itations: (1) it assumes that the news records from differ-ent domains arrive sequentially, though this is not alwaystrue for real-world streams; and (2) it requires the domain ofnews records to be known, which is not generally available.In contrast, our approach exploits both domain-speciﬁc andcross-domain knowledge of news records without knowingthe actual domain of news records.

Active Learning for Fake News Detection.

Almost allthe aforementioned models are supervised. Although thereare unsupervised fake news detection techniques (Yang et al.2019b; Hosseinimotlagh et al. 2018), they are generally in-ferior to the supervised approaches in terms of accuracy.However, the training of supervised models requires largelabelled datasets, which are costly to collect. Therefore,how to obtain fresh and high-quality labelled samples fora given labelling budget is challenging. Some works (Wanget al. 2020; Bhattacharjee et al. 2017) adopt conventionalactive learning frameworks to select high-quality samples,in which the model is initially trained using a small ran-domly selected dataset. Then, the beliefs derived from theinitial model are used to select subsequent instances to an-notate. This approach has two limitations: (1) it requires a ill Gates is plotting to use COVID-19 testing and a future vaccine to track people with microchips

Detection Deadline News RecordPropagation Network Textual ContentINPUT Unsupervised Multimodal Domain Discovery(Module A) Domain Embedding LSH-basedInstance Selection(Module C) If the record is selected Labelled Training DatasetMultimodalInput Representation Reconstructed MultimodalInput RepresentationBinary Classiﬁer Predicted News LabelManual AnnotationCross-DomainEmbedding SpaceDomain-SpeciﬁcEmbedding SpaceSupervised Domain-agnostic Fake News Classiﬁcation (Module B) OUTPUT

Figure 2: Overview of the proposed framework. In the illustrated embedding spaces, each data point’s colour and shape denoteits domain label and veracity label (i.e., triangle for fake news and circle otherwise) respectively.Table 1: Descriptive statistics of PolitiFact, GossipCop andCoAID datasets.Dataset PolitiFact GossipCop CoAID

Problem Statement

Let R be a set of news records. Each record r ∈ R is repre-sented as a tuple h t r , W r , G r i , where (1) t r is the timestampwhen r is published online; (2) W r is the text content of r ;and (3) G r is the propagation network of r for time bound ∆ T . We keep ∆ T low (= ﬁve hours) for our experiments toevaluate early detection performance. Each propagation net-work G r is an attributed directed graph ( V r , E r , X r ) , wherenodes V r represent the tweets/retweets of r and the edges E r represent the retweet relationships among them. X r isthe set of attributes of the nodes (i.e., tweets) in G r . Moredetails about E r and G r are given in (Silva et al. 2021).Our problem consists of two sub-tasks: (1) select a set ofinstances R L from R to label while adhering to the given la-belling budget B , which constrains the number of instancesin R L . The labelling process assigns a binary label y r foreach record r : y r is 1 if r is false and 0 otherwise; (2) learnan effective model using R L to predict the label y r for unla-belled news records r ∈ R U as false or real news records. Inthis work, R ( R L ∪ R U ) is not constrained to a speciﬁc do-main. To emulate such a domain-agnostic dataset, we com- bine three publicly available datasets: (1) PolitiFact (Shuet al. 2020a), which consists of news related to politics; (2)GossipCop (Shu et al. 2020a), a set of news related to en-tertainment stories; and (3) CoAID (Cui et al. 2020), a newscollection related to COVID-19. All three datasets providelabelled news records and all the tweets related to each newsitem. The statistics of the datasets are shown in Table 1. Our Approach

As shown in Fig. 2, the proposed fake news detection modelconsists of two main components: (1) unsupervised do-main embedding learning (Module A); and (2) superviseddomain-agnostic news classiﬁcation (Module B). These twocomponents are integrated to identify fake news while ex-ploiting domain-speciﬁc and cross-domain knowledge in thenews records. In addition, the proposed instance selectionapproach (Module C) adopts the same domain embeddinglearning component to select informative news records forlabelling, which eventually yields a labelled dataset thatmaximizes the domain-coverage.

Unsupervised Domain Discovery

For a given news record r , assume that its domain label is notavailable. The proposed unsupervised domain embeddinglearning technique exploits multimodal content (e.g., text,propagation network) of r to represent the domain of r as alow-dimensional vector f domain ( r ) . Our approach is moti-vated by: (1) the tendency of users to form groups contain-ing people with similar interests (i.e., homophily) (McPher-son et al. 2001), which results in different domains hav-ing distinct user bases; and (2) the signiﬁcant differences indomain-speciﬁc word usage as shown in Figure 1a.We exploit the aforementioned motivations by construct-ing a heterogeneous network which consists of both userstweeting the news items and words in the news title as nodes,using the following steps (Line 1-9 in Algo. 1): (1) create a lgorithm 1: Domain Embedding Learning

Input:

A collection of news records R Output:

Domain embeddings f domain ( r ) of r ∈ R // Network construction Initialize an empty graph G ; for r ∈ R do S r ← X r ∪ U r for each pair ( s , s ) ∈ S do e ← ( { s , s } , ; if edge e exists in graph G then Increment edge e in graph G by 1; else Add edge e to graph G ; // Community Detection C ← Find communities in G using Louvain; // Embedding Learning for r ∈ R do Compute f domain ( r ) using Eq. 2 Return f domain ( r ) of r ∈ R .set S r for each news record r by adding all the users U r in the propagation network G r and all the words appearingin the news title W r (tokenized using whitespaces); (2) foreach pair of items in S r , build a weighted edge e linking thetwo items in the graph; and (3) repeat Steps 1 and 2 for allthe news records, until we obtain the ﬁnal network G . Then,we adopt the Louvain algorithm (Blondel et al. 2008) toidentify communities in G . Here, we select the Louvain al-gorithm as it was shown to be one of the best performingparameter-free community detection algorithms in (Fortu-nato 2010). At the end of this step, we obtain a set of com-munities/clusters C , each having either a highly connectedset of users or words. As the nodes of G contain both usersand words, such communities may have formed either due toa set of users engaging with similar news records or a set ofwords only appearing within a fraction of news records. Fol-lowing the aforementioned motivations, this work assumeseach community in C belongs to a single domain.In the next step, we compute the soft membership p ( r ∈ c ) of r in a cluster c using the following equation: p ( r ∈ c ) = X v ∈ c ∩ r v deg / X c ∈ C X v ∈ r v deg (1)Here p ( r ∈ c ) is proportional to the number of commonusers or words that r and c have. Each node (i.e., user orword) v is weighted using the degree v deg in G (i.e., numberof occurrences) to reﬂect their varying importance for thecorresponding community. Finally, we produce the domainembedding f domain ( r ) ∈ R | C | of r as the concatenation of r ’s likelihood belonging to communities in C : f domain ( r ) = p ( r ∈ c ) ⊕ p ( r ∈ c ) ⊕ . . . p ( r ∈ c | C | ) (2)where ⊕ denotes concatenation. Please see Supplementary Material in (Silva et al. 2021) fordetailed pseudo code of the Louvain algorithm −

50 0 50 − (a) −

50 0 50 − (b)PolitiFact GossipCop CoAIDFigure 3: t-SNE visualization of domain embeddings from:(a) user-based domain discovery algorithm in (Chen et al.2020) and (b) multimodal domain discovery approach pro-posed in this work.In Figure 3, we adopt t-SNE (Maaten et al. 2008) to vi-sualize the domain embedding space of the proposed ap-proach and the user-based domain discovery algorithm pro-posed in (Chen et al. 2020). Due to space limitations, wepresent more details about the baseline in (Silva et al. 2021).As can be seen in Figure 3, the proposed approach yields aclear separation between the domains compared to the base-line. This may be mainly due to the ability of our approach tojointly exploit multimodalities, both users and text of newsrecords to discover their domains. In addition, most previousworks on domain discovery ultimately assign hard domainlabels for news records, which could lead to substantial in-formation loss. For example, some news records may belongto multiple domains, which cannot be captured using harddomain labels. Hence, by having a low-dimensional vectorto represent embedding, our approach could preserve suchknowledge related to the domains of news records. Domain-agnostic News Classiﬁcation

In our news classiﬁcation model, each news record r isrepresented as a vector f input ( r ) using the textual con-tent W r and the propagation network G r of r (elabo-rated in the section Experiments). Then, our classiﬁcationmodel maps f input ( r ) into two different subspaces such thatone preserves the domain-speciﬁc knowledge, f specific : f input ( r ) → R d , and the other preserves the cross-domainknowledge f shared : f input ( r ) → R d , of r . Here d isthe dimension of the subspaces. Then, the concatenation f specific ( r ) and f shared ( r ) is used to recover the label y r and the input representation f input ( r ) of r during trainingvia two decoder functions g pred and g recons respectively. y r = g pred ( f specific ( r ) ⊕ f shared ( r )) f input ( r ) = g recon ( f specific ( r ) ⊕ f shared ( r )) L pred = BCE ( y r , y r ) (3) L recon = || f input ( r ) − f input ( r ) || (4)where y r and f input ( r ) denote the predicted label and thepredicted input representation respectively. BCE standsfor the Binary Cross-Entropy loss function. We mini-ize L pred and L recon to ﬁnd the optimal parameters of ( f specific , f shared , g pred , g recon ) .However, L pred and L recon do not leverage domain dif-ferences in news records. Hence, we now discuss how themapping functions for subspaces, f specific and f shared , arefurther learned to preserve the domain-speciﬁc and cross-domain knowledge in news records. Leveraging Domain-speciﬁc Knowledge

To preserve thedomain-speciﬁc knowledge, we introduce an auxiliary lossterm L specific to learn a new decoder function g specific to recover the domain embedding f domain ( r ) of r us-ing the domain-speciﬁc representation f specific ( r ) . Weminimize L specific to ﬁnd the optimal parameters for ( f specific , g specific ) to capture the domain-speciﬁc knowl-edge by f specific , and this process can be deﬁned as follows: L specific = || f domain ( r ) − g specific ( f specific ( r )) || (ˆ g specific , ˆ f specific ) = argmin ( g specific ,f specific ) ( L specific ) (5) Leveraging Cross-domain Knowledge

In contrast, welearn f shared to overlook domain-speciﬁc knowledge of thenews records. Consequently, f shared preserves the cross-domain knowledge in the news records. Here, we train adecoder function g shared to accurately predict the domainof r using f shared ( r ) . Meanwhile, we learn f shared to foolthe decoder g shared by maximizing the loss of g shared .Such a formulation forces f shared to only rely on cross-domain knowledge, which are useful to transfer the knowl-edge across domains. This process can be deﬁned as a mini-max game between g shared and f shared as follows: L shared = || g shared ( f shared ( r )) − f domain ( r ) || (ˆ g shared , ˆ f shared ) = argmin f shared argmax g shared ( − L shared ) (6) Integrated Model

Then the ﬁnal loss function of themodel is formulated as: L final = L pred + λ L recon + λ L specific − λ L shared (7)where λ , λ and λ controls the importance given to eachloss term compared to L pred (i.e., main task).To learn the minimax game in L shared , the ﬁnal loss func-tion L final is sequentially optimized using the followingtwo steps: ( b θ ) = argmin θ L final ( θ , θ ) (8) ( b θ ) = argmax θ L final ( b θ , θ ) (9)where θ and θ denote the parameters in ( f specific , f shared , g specific , g pred , g recon ) and g shared respectively. The em-pirically studied convergence properties of the proposed op-timization scheme are presented in (Silva et al. 2021). LSH-based Instance Selection

The aforementioned model is able to exploit the domain-speciﬁc and cross-domain knowledge in news records toidentify their veracity. Nevertheless, if the model is used to identify fake news records in unseen or rarely appearing do-mains during training, we empirically observe that the per-formance of the model substantially drops. This observationis expected and is consistent with the ﬁndings in (Casteloet al. 2019), which could be due to the domain-speciﬁc wordusage and propagation patterns as shown in Fig. 1. Hence,we propose an unsupervised technique to come up with a la-belled training dataset for a given labelling budget B suchthat it covers as many domains as possible. The ultimate ob-jective of this technique is to learn a model using such adataset that performs well for many domains.Our approach initially represents each news record r ∈ R using its domain embedding f domain ( r ) . Then, we proposea Locality-Sensitive Hashing (LSH) algorithm based on ran-dom projection to select a set of records in R that are distantin the domain embedding space, which can be elaboratedusing the following steps:1. Create | H | different hash functions such as H i ( r ) = sgn ( h i · f domain ( r )) , where i ∈ { , , . . . , | H |− } and h i isa random vector, and sgn ( . ) is the sign function. The randomvectors h i are generated using the following probability dis-tribution, as such a distribution was shown to perform wellfor random projection-based techniques (Achlioptas 2001): h i,j = √ ×  +1 with probability / with probability / − with probability / (10)2. Construct an | H | -dimensional hash value for each newsrecord r as H ( r ) ⊕ H ( r ) ⊕ . . . ⊕ H | H |− ( r ) , where ⊕ de-ﬁnes the concatenation operation. According to the Johnson-Lindenstrauss lemma (Johnson et al. 1984), such hash val-ues approximately preserve the distances between the newsrecords in the original embedding space with high probabil-ity. Hence, neighbouring records in the domain embeddingspace are mapped to similar hash values.3. Group the news records with similar hash values to con-struct a hash table.4. Randomly pick a record from each bin in the hash tableand add to the selected dataset pool.5. Repeat steps (1), (2), (3) and (4) until the size of thedataset pool reaches the labelling budget B .In Figure 4a, we compare of the original dataset se-lected using the proposed approach and random selection.As can be seen, random selection follows the empirical dis-tribution of the datasets in Table 1 and picks few instancesfrom rarely appearing domains (e.g., fake/real news in Poli-tiFact, fake news in CoAID). Thus, the model trained onsuch a dataset may poorly perform on rarely appearing do-mains. In contrast, the proposed approach provides a signiﬁ-cant number of samples from even rarely occurring domains.In addition, the proposed approach is efﬁcient ( O ( | H || R | ) complexity) compared to the naive farthest point selectionalgorithms (e.g., k-Means (Lloyd 1982) with O ( | R | ) com-plexity, where | R | >> | H | ). To measure the domain cov-erage of the instances selected from the proposed instanceselection approach, we adopt the metric introduced in (Laibet al. 2017), which can be computed as follows for a givenset of records r , r , ..., r n that are represented using theirLT GSP CVD I n s t a n ce s Rand-Fake Rand-RealLSH-Fake LSH-Real (a) . . . . . . B/ | R | λ Rand LSH (b)

Figure 4: Statistics of datasets selected using random selec-tion (Rand) and the proposed LSH-based technique (LSH).(a) Number of fake and real news records selected from eachdomain when B/ | R | = 0 . and (b) domain-coverage mea-sure λ (lower λ is better) for different B/ | R | values.domain embeddings: λ = δ ( n P ni =1 ( δ i − δ ) ) , where δ i = min k ( L norm ( f domain ( r i ) , f domain ( r k ))) and δ = P δ i /n . If the coverage is high, λ is small. Hence, the pro-posed approach yields a better domain-coverage comparedto random instance selection as shown in Figure 4b. Experiments

Experimental Setup

Encoding and Decoding Functions

In our model, eachrecord r is initially represented as a low-dimensional vec-tor f input ( r ) using its text content and propagation network.We adopt RoBERTa-base, a robustly optimized BERT pre-training model (Liu et al. 2019) to learn the text-based rep-resentation f text ( r ) of r . The propagation network-basedrepresentation f network ( r ) of r is represented using the un-supervised network representation learning technique pro-posed in (Silva et al. 2020). Then, the ﬁnal input represen-tation f input ( r ) is constructed as f text ( r ) ⊕ f network ( r ) ,where ⊕ denotes concatenation. All the other encoding anddecoding functions, ( f specific , f shared , g specific , g shared , g pred , g recon ) , are modelled as 2-layer feed-forward net-works with sigmoid activation . Dataset

We combine three disinformation datasets: (1)PolitiFact; (2) GossipCop; and (3) CoAID, to produce across-domain news dataset . Then, we randomly choose75% of the dataset as the candidate data pool R pool for train-ing and the remaining 25% for testing. For a given labellingbudget B , we select B instances from R pool to train themodel. The same process is performed for 3 different train-ing and test splits and the average performance is reported.We evaluate the performance for each domain separately us-ing the testing instances from each domain. For the evalua- We present more details about implementations and parameterselections in the Supplementary Material in (Silva et al. 2021) Here we do not consider the existing datasets on rumour detec-tion (Kochkina et al. 2018; Ma et al. 2017) as they are not consistentwith the fake news deﬁnition (i.e., disinformation). tion, we adopt four metrics: (1) Accuracy (Acc); (2) Preci-sion (Prec); (3) Recall (Rec); and (4) F1 Score (F1).

Baselines

In Table 2, we compare our approach with sevenwidely used fake detection techniques and their variants . Parameter Settings

After performing a grid search, wehave set the hyper-parameters in our model as : λ = 1 , λ = 10 , λ = 5 , d = 512 . To satisfy the John-son–Lindenstrauss lemma, we set | H | = 10 ( >> log ( | R | ) .For the speciﬁc parameters of the baselines, we use the de-fault parameters mentioned in their original papers. Results

Quantitative Results for Fake News Detection

Asshown in Table 2, the proposed approach yields substantiallybetter results for all three domains, outperforming the bestbaseline by as much as . in F1-score. The best base-line, EANN-Multimodal, also adopts domain-informationwhen determining fake news. This observation shows theimportance of having domain-knowledge of news recordswhen identifying fake news in cross-domain datasets. In ad-dition to the architectural differences of the model, EANN-Multimodal is different from our approach for two reasons:(1) EANN-Multimodal only preserves cross-domain knowl-edge in news records. Thus, it overlooks domain-speciﬁcknowledge, which is shown to be useful in our ablation studyin Table 2; and (2) EANN-Multimodal adopts a hard label(i.e., exclusive membership) to represent the domain of anews record. Our approach conversely uses a vector to rep-resent the domain of a news record. Thus, our approach canaccurately represent the likelihood of each record for differ-ent domains. These differences may explain the importanceof our approach compared to the best baseline.Out of the baselines, the multimodal approaches (exceptHPNF+LIWC) generally achieve better results comparedto the uni-modal approaches. Thus, we can conclude thateach modality (i.e., propagation network and text) of newsrecords provides unique knowledge for fake news detection.In HPNF+LIWC, each news record is represented using aset of hand-crafted features. In contrast, other multimodalapproaches including our approach learn data-driven latentrepresentations for news records, which may be able to cap-ture latent and complex information in news records that areuseful to determine fake news. These observations furthersupport two main design decisions in our model: (1) to ex-ploit multimodalities of news records; and (2) to adopt a rep-resentation learning-based technique. Ablation Study

Our ablation study in Table 2 shows thatwithout the domain-speciﬁc loss (Eq. 5) and the cross-domain loss (Eq. 6), the F1-score of the model substan-tially drops by around and for the PolitiFact dataset,which is the smallest domain of the training dataset. Hence,it is important to have a domain-speciﬁc layer to preservethe domain-speciﬁc knowledge and a separate cross-domainlayer to transfer common knowledge between domains.To check whether our model actually learns the aforemen-tioned intuition behind each embedding layer, we visualizeeach embedding layer using t-SNE in Figure 5. As can beable 2: Results for fake news detection of different methods, which are classiﬁed under three categories: (1) text content-basedapproaches (T); (2) social context-based approaches (S); and (3) multimodal approaches (M). Method Type Politifact Gossipcop CoAIDT S M Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1LIWC (Pennebaker et al. 2015) X X X X X X X X X B = 100% | R pool | ) X Our Approach ( B = 50% | R pool | ) X B = 100% | R pool | ) (-) Domain-shared loss

Domain-speciﬁc loss

Network modality

Text modality − −

20 0 20 − − (a) −

20 0 20 − − (b)PolitiFact GossipCop CoAIDFigure 5: t-SNE visualization of the (a) domain-speciﬁc and(b) cross-domain embeddding spaces.seen, the domain-speciﬁc embedding layer preserves the do-main of the news records by mapping different domains intodifferent clusters. In contrast, we cannot identify the domainlabels of news records from the cross-domain embeddingspace. Hence, this embedding space is useful to share com-mon knowledge between records from different domains.Furthermore, we analyse the contribution of each modal-ity. It can be seen that network modality is more useful todetermine fake news in GossipCop, while text modality isthe most informative one for CoAID. This observation fur-ther signiﬁes the importance of multimodal approaches totrain models that generalize for multiple domains. Evaluation of LSH-based Instance Selection

As shownin Table 2, our model outperforms the baselines even witha constrained budget B ( | R pool | ) to select training datausing the LSH-based instance selection technique. To ver-ify its signiﬁcance further, Figure 6 compares the proposedLSH-based instance selection approach with random in-stance selection for different B values. The proposed ap-proach substantially outperforms the random instance selec-tion for the rarely-appearing or highly imbalanced domains.It increases F1-score by for PolitiFact and for . . . . . . . . . . B/ | R pool | F - s c o r e Random Selection . . . . . . . . . B/ | R pool | F - s c o r e LSH-based SelectionPolitifact GossipCop CoAIDFigure 6: F1-scores for the fake news detection task withdifferent instance selection strategies.CoAID, when B/ | R pool | = 0 . . This may be due to the abil-ity of our approach to maximize the coverage of domainswhen selecting instances (see Figure 4), instead of biasingtowards a domain with larger number of records. Conclusion

In this work, we proposed a novel fake news detec-tion framework, which exploits domain-speciﬁc and cross-domain knowledge in news records to determine fake newsfrom different domains. Also, we introduced a novel unsu-pervised approach to select informative instances for manuallabelling from a large pool of unlabelled news records. Theselected data pool is subsequently used to train a model thatcan perform equally for different domains. The integrationof the aforementioned two contributions yields a model withlow labelling budgets that outperforms existing fake newsdetection techniques by as much as . in F1-score.For future work, we intend to extend our model as an on-line learning framework to determine fake news in a real-world news stream, which typically covers a large numberof domains. This setting introduces new challenges such ascapturing newly emerging domains and handling temporalhanges in domains. Also, how to use the alignment in mul-timodal information to weakly guide the learning process ofthe proposed model is another interesting direction to ex-plore, which may further reduce the labelling cost in a con-ventional supervised learning setting. Acknowledgments

This research was ﬁnancially supported by Melbourne Grad-uate Research Scholarship and Rowden White Scholarship.We would like to specially thank Yi Han for his insightfulcomments and suggestions for this work. We are also grate-ful for the time and effort of the reviewers in providing valu-able feedback on our manuscript.

References

Achlioptas, D. 2001. Database-friendly Random Projec-tions. In

Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ,274–281.Bhattacharjee, S. D.; Talukder, A.; and Balantrapu, B. V.2017. Active Learning Based News Veracity Detection withFeature Weighting and Deep-shallow Fusion. In

Proceed-ings of the IEEE International Conference on Big Data (BigData) , 556–565.Blondel, V. D.; Guillaume, J.-L.; Lambiotte, R.; and Lefeb-vre, E. 2008. Fast Unfolding of Communities in Large Net-works.

Journal of Statistical Mechanics: Theory and Exper-iment

CompanionProceedings of the World Wide Web Conference , 975–980.Chen, Z.; and Freire, J. 2020. Proactive Discovery of FakeNews Domains from Real-Time Social Media Feeds. In

Companion Proceedings of the World Wide Web Conference ,584–592.Cui, L.; and Lee, D. 2020. CoAID: COVID-19 HealthcareMisinformation Dataset. arXiv e-prints arXiv:2006.00885.Fortunato, S. 2010. Community Detection in Graphs.

Physics Reports arXiv e-prints arXiv:12007.03316.Hosseinimotlagh, S.; and Papalexakis, E. E. 2018. Unsu-pervised Content-based Identiﬁcation of Fake News Articleswith Tensor Decomposition Ensembles. In

Proceedings ofthe Workshop on Misinformation and Misbehavior Miningon the Web (MIS2) .Janicka, M.; Pszona, M.; and Wawer, A. 2019. Cross-Domain Failures of Fake News Detection.

Computaci´on ySistemas

Proceedings of the ACM Inter-national Conference on Multimedia , 795–816. Johnson, W. B.; and Lindenstrauss, J. 1984. Extensions ofLipschitz Mappings into a Hilbert Space.

ContemporaryMathematics

Proceedings of The World Wide Web Confer-ence , 2915–2921.Kim, Y. 2014. Convolutional Neural Networks for SentenceClassiﬁcation. In

Proceedings of the Conference on Empir-ical Methods in Natural Language Processing , 1746–1751.Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Des-jardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho,T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Ku-maran, D.; and Hadsell, R. 2017. Overcoming CatastrophicForgetting in Neural Networks.

Proceedings of the NationalAcademy of Sciences

Pro-ceedings of the International Conference on ComputationalLinguistics , 3402–3413.Laib, M.; and Kanevski, M. 2017. Unsupervised FeatureSelection Based on Space Filling Concept. arXiv preprintarXiv:1706.08894 .Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.2019. RoBERTa: A Robustly Optimized BERT PretrainingApproach. arXiv e-prints arXiv:1907.11692.Liu, Y.; and Wu, Y.-f. B. 2018. Early Detection of FakeNews on Social Media Through Propagation Path Classiﬁ-cation with Recurrent and Convolutional Networks. In

Pro-ceedings of the AAAI Conference on Artiﬁcial Intelligence ,354–361.Lloyd, S. 1982. Least Squares Quantization in PCM.

IEEETransactions on Information Theory

Proceedings of the Con-ference on Advances in Neural Information Processing Sys-tems , 6467–6476.Ma, J.; Gao, W.; and Wong, K.-F. 2017. Detect Rumorsin Microblog Posts Using Propagation Structure via KernelLearning. In

Proceedings of the Annual Meeting of the As-sociation for Computational Linguistics , 708–717.Maaten, L. v. d.; and Hinton, G. 2008. Visualizing Data us-ing t-SNE.

Journal of Machine Learning Research

9: 2579–2605.McPherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001.Birds of a Feather: Homophily in Social Networks.

AnnualReview of Sociology arXiv e-prints arXiv:1902.06673.Pennebaker, J. W.; Boyd, R. L.; Jordan, K.; and Blackburn,K. 2015. The Development and Psychometric Properties ofIWC2015. Technical report. URL https://repositories.lib.utexas.edu/handle/2152/31333.P´erez-Rosas, V.; Kleinberg, B.; Lefevre, A.; and Mihalcea,R. 2018. Automatic Detection of Fake News. In

Proceedingsof the International Conference on Computational Linguis-tics , 3391–3401.Ruchansky, N.; Seo, S.; and Liu, Y. 2017. CSI: A HybridDeep Model for Fake News Detection. In

Proceedings of theACM on Conference on Information and Knowledge Man-agement , 797–806.Shu, K.; Cui, L.; Wang, S.; Lee, D.; and Liu, H. 2019. DE-FEND: Explainable Fake News Detection. In

Proceedings ofthe ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , 395–405.Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; and Liu,H. 2020a. FakeNewsNet: A Data Repository with NewsContent, Social Context, and Spatiotemporal Informationfor Studying Fake News on Social Media.

Big Data

Proceedings of the Inter-national AAAI Conference on Web and Social Media , 626–637.Silva, A.; Han, Y.; Luo, L.; Karunasekera, S.; and Leckie,C. 2020. Embedding Partial Propagation Network for FakeNews Early Detection.

Proceedings of the Internationalworkshop on Mining Actionable Insights from Social Net-works (MAISoN 2020) co-located with CIKM2020 .Silva, A.; Luo, L.; Karunasekera, S.; and Leckie, C.2021. Supplementary Materials for Embracing DomainDifferences in Fake News: Cross-domain Fake News De-tection using Multi-modal Data URL https://drive.google.com/drive/folders/1JRWxtAwd52Uibw0AHYWwcIAdN-aWK813?usp=sharing.Volkova, S.; Shaffer, K.; Jang, J. Y.; and Hodas, N. 2017.Separating Facts from Fiction: Linguistic Models to ClassifySuspicious and Trusted News Posts on Twitter. In

Proceed-ings of the Annual Meeting of the Association for Computa-tional Linguistics , 647–653.Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su,L.; and Gao, J. 2018. EANN: Event Adversarial NeuralNetworks for Multi-Modal Fake News Detection. In

Pro-ceedings of the ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 849–857.Wang, Y.; Yang, W.; Ma, F.; Xu, J.; Zhong, B.; Deng, Q.; andGao, J. 2020. Weak Supervision for Fake News Detectionvia Reinforcement Learning. In

Proceedings of the AAAIConference on Artiﬁcial Intelligence , 01, 516–523.Wu, L.; and Liu, H. 2018. Tracing Fake-News Footprints:Characterizing Social Media Messages by How They Prop-agate. In

Proceedings of the ACM International Conferenceon Web Search and Data Mining , 637–645. Yang, S.; Shu, K.; Wang, S.; Gu, R.; Wu, F.; and Liu, H.2019a. Unsupervised Fake News Detection on Social Me-dia: A Generative Approach. In

Proceedings of the AAAIConference on Artiﬁcial Intelligence , volume 33, 5644–5651.Yang, S.; Shu, K.; Wang, S.; Gu, R.; Wu, F.; and Liu, H.2019b. Unsupervised Fake News Detection on Social Me-dia: A Generative Approach.

Proceedings of the AAAI Con-ference on Artiﬁcial Intelligence

33: 5644–5651.Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,E. 2016. Hierarchical Attention Networks for DocumentClassiﬁcation. In

Proceedings of the Conference of theNorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies , 1480–1489.Zhang, C.; Zhang, K.; Yuan, Q.; Tao, F.; Zhang, L.; Hanratty,T.; and Han, J. 2017. React: Online multimodal embeddingfor recency-aware spatiotemporal activity modeling. In

Pro-ceedings of the International ACM SIGIR Conference on Re-search and Development in Information Retrieval , 245–254.Zhou, X.; Wu, J.; and Zafarani, R. 2020. SAFE: Similarity-Aware Multi-modal Fake News Detection. In

Proceedingsof Paciﬁc-Asia Conference on Advances in Knowledge Dis-covery and Data Mining , 354–367. upplementary Material for Embracing Domain Differences in Fake News:Cross-domain Fake News Detection using Multimodal Data

Amila Silva, Ling Luo, Shanika Karunasekera, Christopher Leckie

School of Computing and Information SystemsThe University of MelbourneParkville, Victoria, Australia { amila.silva@student., ling.luo@, karus@, caleckie@ } unimelb.edu.au Abstract

This is the supplementary material for the paper titled ”Em-bracing Domain Differences in Fake News: Cross-domainFake News Detection using Multimodal Data”.

Louvain Algorithm for Community Detection

This section presents more details about the Louvain algo-rithm (Blondel et al. 2008), which is used in the proposeddomain embedding learning approach to identify communi-ties in a network.As shown in Algorithm 1, the Louvain algorithm identi-ﬁes the communities in a network using the following steps:1. Each vertex is placed in their own community (Line 1 inAlgo. 1);2. Each vertex is retained in its own cluster or merge withan immediate neighbour such that the modularity scoresof the network is maximised (Line 3-15 in Algo. 1). Themodularity score is computed as: Q = (cid:20) P in + k i,in m − (cid:18) P tot + k i m (cid:19) (cid:21) − (cid:20) P in m − (cid:18) P tot m (cid:19) − (cid:18) k i m (cid:19) (cid:21) where P in and P tot represents the total weight of alllinks inside a community/cluster and total weight of alllinks to a community/cluster, respectively. Similarly, theterms k i and k i,in denote the total weight of all linksto i and total weight of links to i within the commu-nity/cluster. Lastly, m denotes the total weight of all linksin the network graph;3. Build a new network where vertices in the same commu-nity are combined as a single vertex.4. Repeat Steps 2 and 3 until there are no more mergingsbetween communities.At the end of this algorithm, we will obtain a set of com-munities of the provided network such that the modularity Algorithm 1:

Louvain Algorithm

Input: G = ( V, E ) where V and E are the verticesand edges of the network G Output: A = ( V, C ) : Assignment of vertices V into communities C Assign all vertices v into their own community; do for v ∈ V do M axM odularity ← − ; M axM odN eighbour ← N U LL ; for each neighbour v n of v do Shif tM od ← Modularity score ofshifting v to v n ’s community; if Shif tM od > M axM odularity then M axM odularity ← Shif tM od ; M axM odN eighbour ← v n ; OriginalM od ← Modularity score of v inits original community; if OriginalM od > M axM odularity then Shift v to the community of M axM odN eighbour ; else Keep v in its original community; while A stabilises (i.e., no more shifts) ;score of the network is maximised. We selected this algo-rithm in our model because it is known (Lim, Karunasek-era, and Harwood 2017) to generate a relatively small num-ber of communities compared to other parameter-free com-munity detection algorithms such as Infomap (Rosvall andBergstrom 2008) and Label Propagation (Raghavan, Albert,and Kumara 2007). Multimodal Input Representation

In our model, each news record r is inputted as a low-dimensional vector f input ( r ) using its text content (i.e.,news title) and propagation network (i.e., social context).Initially, we construct two independent representations for r using its text content f text ( r ) and propagation network f network ( r ) . Then, these two representations are concate-nated to produce the ﬁnal representation of r : f input ( r ) = a r X i v : . [ c s . C L ] F e b weets Posted after the Detection DeadlineDetection Deadline Tweets Posted within the Detection Deadline Source Node/ News RecordNode-level Aggregation billisgatesplotting

RoBERTa Pre-trained Model Text Embedding Global Network Embedding Local Network Embedding MultimodalInput Representation

Figure 1: Multimodal Input Representation f text ( r ) ⊕ f network ( r ) . This process is elaborated in thissection. Text Representation

In this work, the text content of a news record is repre-sented using RoBERTa (Liu et al. 2019), a robustly opti-mized BERT pre-training model. For a given textual content { w w w ....w n } of a news record r , the RoBERTa modelreturns the text-based latent representation f text ( r ) ∈ R d t of r . Out of the different variants of pretrained RoBERTa mod-els, we adopt the roberta-large model available in https://pytorch.org/hub/pytorch fairseq roberta/, where d t = 1024 . Propagation Network Representation

We explore two types of features: global-level features(global); and node-level features (local), of the propagationnetwork G r = ( V r , E r , X r ) to generate the network-basedrepresentation f network ( r ) of a record r . Propagation Network Construction

We consider all thetweets/retweets related to r as the nodes V r of G r . Thereis an extra node (i.e., source node) in G r to represent thenews, which links different information cascades of r . Theedges E r of G r represent how a news item spreads fromone person to another as shown in Fig 1. Speciﬁcally, thereis an edge from node i to node j if (1) the user of tweet i mentions the user of tweet j ; or (2) tweet i is public andtweet j is posted within the detection deadline (= ﬁve hours)after tweet i . Global Representation

We use the following features asglobal-level features: (1) Wiener Index ( g ); (2) Number ofnodes ( g ); (3) Network depth ( g ); (4) Number of nodesat different hops ( g ); and (5) Branching factor at differentlevels ( g ). Finally, all these features are concatenated to-gether to formulate the global-level network representation f global ( r ) of a record r . Table 1: Node-level FeaturesType Featuresuser whether the user is veriﬁed ( n ), the numberof followers ( n ), the number of friends ( n ),the number of lists ( n ), and the number offavourites ( n )text the sentiment scores computed using VADERwith the text content in the tweet ( n ), the pro-portion of positive words ( n ), the proportionof negative words ( n ), the number of men-tions ( n ), and the number of hashtags( n )temporal the time difference with the source node ( n );the time difference with the immediate prede-cessor ( n ); and the average time differencewith the immediate successors ( n ); user ac-count timestamp ( n ) Algorithm 2:

Local Network Representation

Input: propagation network G r = ( V r , E r , X r ) source node of r v s ∈ V r Output:

The local representation f local ( r ) h v ← x v ∀ v ∈ V r for t in , , ..., k do for v in V do h tv ← h t − v + P ∀ ( v,u ) ∈ Ert h t − u P ∀ ( v,u ) ∈ Ert f local ( r ) ← h kv s return f local ( r ) Local Representation

For the node-level features, we ex-tract three types of features: (1) text-based; (2) user-based;and (3) temporal-based, which are listed in Table 1. For agiven propagation network G r of a record r , all the featuresin Table 1 are extracted to represent each vertex (i.e., tweet)in G r . Then, we adopt the node-level aggregation approachproposed in (Silva et al. 2020) to propagate the aforemen-tioned node-level features to the source node as elaboratedin Algo. 2. This algorithm returns the ﬁnal representation ofthe source node (see Fig. 1) of G r as the local representation f local ( r ) of r .Finally, the network-based representation is formulatedas: f network ( r ) = f global ( r ) ⊕ f local ( r ) (1)where ⊕ denotes concatenation.Note: We standardise each dimension of f network ( r ) be-fore inputting to the model to stabilise the learning processof our model. Encoding and Decoding Functions

In our fake news detection classiﬁer, we have six encod-ing and decoding functions, ( f specific , f shared , g specific , g shared , g pred , g recon ) . In this work, all these functions are https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html odelled as 2-layer feed-forward networks with sigmoidactivation. Formally, we can deﬁne a encoding/decodingfunction f that maps an input x ∈ R d input to an output z ∈ R d output as: z = σ ( A ( σ ( A x + b )) + b ) where A ∈ R ( d hidden ,d input ) , A ∈ R ( d output ,d hidden ) , b ∈ R d hidden , and b ∈ R d output are trainable param-eters. σ denotes sigmoid activation. We set d hidden as max( d input , d output ) / . For example, assume that f takesinputs of 1024 dimensions and outputs of 128 dimensions.Then, the size of the hidden layer is 512. We leave the opti-mal neural architecture search for each encoding and decod-ing function in our model as future work. Domain Discovery Baseline

We compare our domain discovery approach with the base-line proposed in (Chen and Freire 2020), which assigns harddomain labels for news records based on the users engagedwith each news record. For the visualization purpose, weconvert these hard domain labels (i.e., one-hot vector) to do-main embeddings as they preserve pairwise domain similar-ity between records (Shu, Wang, and Liu 2019). We elabo-rate the steps that we followed to generate the domain em-beddings using this baseline as follows:1. Initially, we construct a network by considering eachnews record as a node.2. Each news record r (i.e., node) is represented using the listof the users U r tweeting the the particular news record.3. The pairwise similarity of nodes is computed for a giventwo nodes r and r as: similarity ( r , r ) = | U r ∩ U r || U r ∪ U r | Then r and r are connected in the graph if similarity ( r , r ) > α . α is set to . following the orig-inal paper (Chen and Freire 2020).4. The Louvain algorithms is used to identify the communi-ties C = c , c , ... in the constructed graph, which yieldshard cluster (considered as domains) assignment for eachnode.5. Then each node r can be represented as an one-hot vector I r ∈ R | C | , in which I ri := { if r ∈ c i ; 0 otherwise }

6. Finally, we construct the domain embedding f domain ( r ) ∈ R | R | of r by concatenating the cosinesimilarity scores of I r with other news records: f domain ( r ) = ( I r · I r ) ⊕ ( I r · I r ) ... ⊕ ( I r · I r | R |− ) where ⊕ denotes concatenation operation.Since this approach considers news records as the nodesof the constructed graph, it is difﬁcult to extend such anapproach to learn domain embeddings for new records. Incontrast, the proposed approach in this paper constructs itsknowledge network using words and users as nodes. Thus,we can generate the domain embeddings for a new record us-ing the words and users related to the new record. Also, ourapproach considers both text and user information of newsrecords to identify their domain labels. Fake News Detection Baselines

We compare our fake news detection model with sevenwidely used baselines and their variants:• LIWC (Pennebaker et al. 2015) ((i.e., Linguistic Inquiryand Word Count)) learns feature vectors from the text con-tent of news records by counting the number of lexiconsfalling into different psycho-linguistic categories . Then,a logistic regression model is used as the classiﬁer to pre-dict fake news using LIWC feature vectors.• text-CNN (Kim 2014) uses Convolution Neural Networks(CNN) to model the text content of news records at dif-ferent granularity levels with the help of multiple convo-lutional ﬁlters and multiple CNN layers .• HAN (Yang et al. 2016) adopts a hierarchical attentionneural network framework to model the text content ofnews records, which can assign varying importance towords and sentences when making ﬁnal predictions byword-level and sentence-level attention .• EANN (Wang et al. 2018) produces a latent representa-tion for each news record using its different modalities(e.g., text, network) such that the domain-speciﬁc knowl-edge in news records are ignored in the latent space. Sub-sequently, the latent representation is used to predict thelabel of the news record. We compare our model with twovariants of EANN: – EANN-Unimodal only considers the text modality of anews record to generate the latent representation; and – EANN-Multimodal considers both text and networkmodalities of a news record to produce the latent em-bedding.For a fair comparison of the models, we adopt the sametext and network representation techniques in our modelto encode the input modalities of EANN.• HPNF (Shu et al. 2019) extracts various features (e.g.,structural features, temporal features) from the propaga-tion network of a news record to generate its feature repre-sentation. Then, a Logistic Regression is used to classifynews records using the extracted propagation network-based model. In HPNF+LIWC, we concatenate the fea-tures vectors from HPNF and LIWC together to constructthe feature representation for news records.• AE (Silva et al. 2020) adopts an Auto-encoder architec-ture to learn latent representation for each news recordbased on its propagation network. Subsequently, the latentrepresentations are used to determine fake news records.• SAFE (Zhou, Wu, and Zafarani 2020) proposes a multi-modal approach for fake news detection. For a given newsrecord, this model learns separate latent representationsfor each modality. Also, it jointly learns another repre-sentation to represent cross-modality knowledge, which https://liwc.wpengine.com/ https://scikit-learn.org/stable/modules/generated/sklearn.linear model.LogisticRegression.html https://github.com/yoonkim/CNN sentence https://github.com/tqtg/hierarchical-attention-networks − . . log ( λ ) F - s c o r e λ = 10 , λ = 5 − − . . log ( λ ) F - s c o r e λ = 1 , λ = 5 − − . . log ( λ ) F - s c o r e λ = 1 , λ = 10 Politifact GossipCop CoAIDFigure 2: F1-scores for the fake news detection task with different hyper-parameters: λ ; λ ; and λ . . . . . log ( d ) F - s c o r e

100 300 5000 . . . . epochs F - s c o r e . . . . log ( batch size ) F - s c o r e Figure 3: F1-scores (overall for the all three datasets) for the fake news detection task with different hyper-parameters: d ; epochs ; and batch size .is consistent across modalities. Finally, all three represen-tations are concatenated and fed to a classiﬁer to predictthe label of the record. The original work of this modelconsiders the text and image modalities of news records.For a fair comparison with our model, here we use the textand network modality of news records for this baselinetoo. We adopt the same text and network representationtechniques in our model to encode the input modalities inthis baseline too. Parameter Sensitivity

This section evaluates how changes to the hyper-parametersof the model affect its performance on the fake news detec-tion tasks.In Figure 2, we analyse the performance of our model fordifferent λ , λ and λ values (see Eq. 7 in the paper), whichvaries the importance assign to each loss term in our model.By setting a very high value ( > ) or a very low value( < − ) for λ tends to drop the performance consistentlyfor all three datasets. It means that L recon loss term shouldbe included in our model with moderate importance com-pared to the other loss terms. The performance of the modelfor PolitiFact and CoAID domains drop substantially for low λ < and high λ > values. By setting a low λ < ora high λ > value, our model assigns more importance tothe cross-domain embedding space. The cross-domain em-bedding space could be dominated by frequently appearingdomains (GossipCop in this dataset). Thus, assigning moreimportance for cross-domain embedding space, the model . . | H | λ Figure 4: Domain-coverage measure λ (lower λ is better) ofthe dataset selected using the LSH-based instance selectionwith different | H | (number of hash functions) values, when B/ | R | = 0 . .could poorly perform for small domains e.g., PolitiFact andCoAID in this dataset as shown in Fig. 2. This observationfurther signiﬁes the importance of having domain-speciﬁcknowledge of news items to identify fake news.We examine the sensitivity of the model’s performancefor other parameters: latent dimension ( d ) ; number ofepochs; and batch size. Overall, the model yields consistentperformance for d > , epochs > , and batch size < values.There is only one hyper-parameter in the proposed LSH-based instance selection approach, which is the number ofhash functions ( | H | ) used for the random projections. Asshown in Figure 4, domain coverage of the proposed ap- . . epochs l o ss v a l u e L final L pred λ L recon λ L specific λ L shared Figure 5: Convergence properties of the loss function.proach reduces (increases λ measure) for high | H | ( > values. This is intuitive because high | H | (lengthy hashcodes) value could map even very close neighbours in theembedding space into different bins. Thus, the selected in-stance from different bins could be close-neighbours. In con-trast, low | H | values increases the domain coverage. Never-theless, having a very low | H | value increases the time com-plexity as it requires many iterations of the hashing step tomeet a given labelling budget.In summary, we adopt the following hyper-parameter val-ues for the results reported in the paper: (1) lambda = 1 ;(2) lambda = 10 ; (3) lambda = 5 ; (4) | H | = 10 ; (5) d = 512 ; (6) epochs = 300 ; (7) batch size = 64 . We usethe Adam optimizer for the optimization. For the parametersof the optimizer (e.g., learning rate, moments), the defaultparameters in Keras are used. Due to the randomness in-volved in the training and testing datasets splitting process,we conducted all our experiments using three random statevalue: { , , } , and the average performance is reported inthe paper. Convergence Analysis

In Figure 5, we examine the convergence properties ofthe loss function of our model. Our loss function con-sists of four terms: prediction loss ( L pred ); reconstructionloss ( L recon ); domain-speciﬁc loss ( L specific ); and cross-domain loss ( L shared ). As can be seen in Fig. 5, eachloss term converges around 250 epochs. Since L shared istrained as a minimax game, the converging L shared in Fig. 5empirically veriﬁes the convergence of the proposed min-imax game to exploit cross-domain knowledge in newsrecords. Moreover, L recon , L specific and L shared are mean-squared error based loss terms and L pred is based on binarycross-entropy. Hence, the typical value range for the non-converged L pred differs from the other loss terms. This alsoshows the importance of having λ , λ , and λ to penalisesuch differences due to different loss functions. https://keras.io/api/optimizers/adam/ References

Blondel, V. D.; Guillaume, J.-L.; Lambiotte, R.; and Lefeb-vre, E. 2008. Fast Unfolding of Communities in Large Net-works.

Journal of Statistical Mechanics: Theory and Exper-iment

Companion Proceedings of the World Wide Web Conference ,584–592.Kim, Y. 2014. Convolutional Neural Networks for SentenceClassiﬁcation. In

Proceedings of the Conference on Empir-ical Methods in Natural Language Processing , 1746–1751.Lim, K. H.; Karunasekera, S.; and Harwood, A. 2017.Clustop: A Clustering-based Topic Modelling Algorithm forTwitter using Word Networks. In

Proceedings of the IEEEInternational Conference on Big Data (Big Data) , 2009–2018. IEEE.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.2019. RoBERTa: A Robustly Optimized BERT PretrainingApproach. arXiv e-prints arXiv:1907.11692.Pennebaker, J. W.; Boyd, R. L.; Jordan, K.; and Blackburn,K. 2015. The Development and Psychometric Properties ofLIWC2015. Technical report. URL https://repositories.lib.utexas.edu/handle/2152/31333.Raghavan, U. N.; Albert, R.; and Kumara, S. 2007. NearLinear Time Algorithm to Detect Community Structures inLarge-scale Networks.

Physical review E

Proceedings of the National Academy of Sciences arXiv e-prints arXiv:1903.09196.Shu, K.; Wang, S.; and Liu, H. 2019. Beyond News Con-tents: The Role of Social Context for Fake News Detection.In

Proceedings of the ACM International Conference on WebSearch and Data Mining , 312–320.Silva, A.; Han, Y.; Luo, L.; Karunasekera, S.; and Leckie,C. 2020. Embedding Partial Propagation Network for FakeNews Early Detection.

Proceedings of the Internationalworkshop on Mining Actionable Insights from Social Net-works (MAISoN 2020) co-located with CIKM2020 .Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su,L.; and Gao, J. 2018. EANN: Event Adversarial NeuralNetworks for Multi-Modal Fake News Detection. In

Pro-ceedings of the ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , 849–857.Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,E. 2016. Hierarchical Attention Networks for DocumentClassiﬁcation. In

Proceedings of the Conference of theorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies , 1480–1489.Zhou, X.; Wu, J.; and Zafarani, R. 2020. SAFE: Similarity-Aware Multi-modal Fake News Detection. In