AAttributed Graph Alignment
Ning ZhangUniversity of British ColumbiaVancouver, BC, [email protected] Weina WangCarnegie Mellon UniversityPittsburgh, PA, [email protected] Lele WangUniversity of British ColumbiaVancouver, BC, [email protected]
Abstract —Motivated by various data science applications in-cluding de-anonymizing user identities in social networks, weconsider the graph alignment problem, where the goal is toidentify the vertex/user correspondence between two correlatedgraphs. Existing work mostly recovers the correspondence byexploiting the user-user connections. However, in many real-world applications, additional information about the users, suchas user profiles, might be publicly available. In this paper, weintroduce the attributed graph alignment problem, where addi-tional user information, referred to as attributes, is incorporatedto assist graph alignment. We establish sufficient and necessaryconditions for recovering vertex correspondence exactly, wherethe conditions match for a wide range of practical regimes.Our results recover existing tight information-theoretic limitsfor models where only the user-user connections are available,spanning the full spectrum between these models and modelswhere only attribute information is available.
I. I
NTRODUCTION
The graph alignment problem, also known as the graphmatching problem or the noisy graph isomorphism problem,has received growing attention in recent years, brought intoprominence by applications in a wide range of areas [1, 2, 3].For instance, in social network de-anonymization [4, 5], oneis given two graphs, each of which represents the user rela-tionship in a social network (e.g., Twitter, Facebook, Flickr,etc). One graph is anonymized and the other graph has useridentities as public information. Then the graph alignmentproblem, whose goal is to find the best correspondence oftwo graphs with respect to a certain criterion, can be usedto de-anonymize users in the anonymous graph by findingthe correspondence between them and the users with publicidentities in the other graph.The graph alignment problem has been studied under var-ious random graph models, among which the most popularone is the
Erd˝os–Rényi graph pair model (see, e.g., [6, 7]).In particular, two Erd˝os–Rényi graphs on the same vertex set, G and G , are generated in a way such that their edges arecorrelated. Then G and an anonymous version of G , denotedas G (cid:48) , are made public, where G (cid:48) is modeled as a vertex-permuted G with an unknown permutation. Under this model,typically the goal is to achieve the so-called exact alignment ,i.e., recovering the unknown permutation and thus revealingthe correspondence for all vertices exactly.A fundamental question in the graph alignment problemis: when is exact alignment possible? More specifically, whatare the conditions on graph statistics for achieving exactalignment when given unbounded computational resources?
Such conditions, usually referred to as information-theoretic limits , have been established for the Erd˝os–Rényi graph pair ina line of work [6, 7, 8]. In the most recent study [8], Cullinaand Kiyavash established matching sufficient and necessaryconditions of exact alignment for a large range of parameters.In many real-world applications, additional informationabout the anonymized vertices might be available. For ex-ample, Facebook has user profiles on their website abouteach user’s age, birthplace, hobbies, etc. Such associatedinformation is referred to as attributes (or features), which,unlike user identities, are often publicly available. A naturalquestion to ask is:
Can the attribute information help recoverthe vertex correspondence? If so, can we quantify the amountof benefit brought by the attribute information?
In the exampleof aligning Netflix and IMDb databases, where each user’sratings on movies are considered as attributes of that user,Narayanan and Shamatikov [9] successfully recovered someof the user identities in the anonymized Netflix dataset by onlycomparing attributes of users without knowing the relationshipnetwork among users. In this paper, we incorporate attributeinformation to generalize the graph alignment problem. Wecall this problem the attributed graph alignment problem.To study attributed graph alignment, we extend the currentErd˝os–Rényi graph pair by adding a set of vertices publiclylabeled by attributes. We refer to this set of vertices as attributevertices and assume they are aligned between the graph pair.For distinction, we refer to the original set of vertices in theErd˝os–Rényi graph pair as user vertices . Edges between uservertices and attribute vertices represent their relationship, andthere are no edges between attribute vertices. Similar to theErd˝os–Rényi graph pair, user-attributed edges in G and G are also correlated. Then a random permutation is applied onthe user vertices of G to create the anoynimized graph G (cid:48) .This new model is referred to as attributed Erd˝os–Rényi graphpair . The goal of attributed graph alignment is to recover theunknown permutation from G and G (cid:48) .In this paper, we focus on characterizing the information-theoretic limits for graph alignment under the attributedErd˝os–Rényi graph pair. We establish sufficient and necessaryconditions for achieving exact alignment, where the conditionsmatch for regimes that are typical and interesting in practice.These achievability and converse results allow us to betterunderstand how the attribute information can be integratedwith the structural information of the user relationship net-work, and then to quantify the benefit brought by the attributeinformation. Our results span the full spectrum between thetraditional Erd˝os–Rényi pair model where only the user rela- a r X i v : . [ c s . I T ] F e b bc1 23 Π ∗ abc1 32 G (cid:48) G user-user edgeuser-attribute edgeuser vertexattribute vertexabc1 23 G Figure 1: Example of attributed Erd˝os–Rényi graph pair: Graph G and G are generated on the same set of vertices. Anonymized graph G (cid:48) is obtainedthrough applying Π ∗ = (1)(2 , only on V a of G (permutation Π ∗ iswritten in cycle notation). tionship network is available and models where only attributeinformation is available [10, 11], unifying existing results ineach of these settings.We comment that the proposed attributed alignment problemcan also be viewed as a graph alignment problem with partof vertices correctly pre-aligned, known as the seeded graphalignment problem. Efficient algorithms and feasible regionsfor seeded graph alignment have been studied in [5, 12].However, the model assumptions in these existing work aretypically not directly comparable with our model assumptions.II. M ODEL
In this section, we describe the attributed Erd˝os–Rényigraph pair model. We proposed and formally define theattributed graph alignment problem under this model. Anillustration of the model is given in Figure 1.
User vertices and attribute vertices.
We first generate twographs, G and G , on the same vertex set V . The vertex set V consists of two disjoint sets of vertices, the user vertex set V u and the attribute vertex set V a , i.e., V = V u ∪ V a . Assumethat the user vertex set V u consists of n vertices, labeled as [ n ] (cid:44) { , , , ..., n } . Assume that the attribute vertex set V a consists of m vertices, and m scales as a function of n . Correlated edges.
To describe the probabilistic model foredges in G and G , we first consider the set of user-uservertex pairs E u (cid:44) V u × V u and the set of user-attribute vertexpairs E a (cid:44) V u × V a . Then for each vertex pair e ∈ E u ∪ E a ,we write G ( e ) = 1 (resp. G ( e ) = 1 ) if there is an edgeconnecting the two vertices in the pair in G (resp. G ), andwrite G ( e ) = 0 (resp. G ( e ) = 0 ) otherwise. Since we oftenconsider the same vertex pair in both G and G , we write ( G , G )( e ) as a shortened form of ( G ( e ) , G ( e )) .The edges of G and G are then correlatedly generatedin the following way. For each user-user vertex pair e ∈ E u , ( G , G )( e ) follows the joint distribution specified by ( G , G )( e ) = (1 , w.p. p , (1 , w.p. p , (0 , w.p. p , (0 , w.p. p , (1) where p , p , p , p are probabilities that sum up to . Foreach user-attribute vertex pair e ∈ E a , ( G , G )( e ) follows thejoint probability distribution specified by ( G , G )( e ) = (1 , w.p. q , (1 , w.p. q , (0 , w.p. q , (0 , w.p. q , (2)where q , q , q , q are probabilities that sum up to .The correlation between G ( e ) and G ( e ) is measured by thecorrelation coefficient defined as ρ (cid:44) Cov ( G ( e ) , G ( e )) (cid:112) Var [ G ( e )] (cid:112) Var [ G ( e )] , where Cov ( G ( e ) , G ( e )) is the covariance between G ( e ) and G ( e ) and Var [ G ( e )] and Var [ G ( e )] are the variances.Across different vertex pair e ’s, the ( G , G )( e ) ’s are in-dependent. Finally, recall that there are no edges betweenattribute vertices in our model.For compactness of notation, we represent the joint distri-butions in (1) and (2) in the following matrix form: p = (cid:18) p p p p (cid:19) and q = (cid:18) q q q q (cid:19) . We refer to the graph pair ( G , G ) as an attributed Erd˝os–Rényi pair G ( n, p , m, q ) . Relation to subsampling model.
Another random graphmodel that is commonly used in the graph alignment literatureis the subsampling model [6]. An attributed version of thesubsampling model can be described as follows. We firstgenerate a base graph , G , on the vertex set V = V u ∪ V a ,by generating its edges independently in a “stochastic-block”fashion. In particular, an edge between two user vertices isgenerated with probability p and an edge between a uservertex and an attribute vertex is generated with probability q .Again, there are no edges between attribute vertices. We thensubsample the edges of G twice, independently, to generatetwo subgraphs G and G . Each user-user edge in G is keptwith probability s in G (resp. s in G ), and each user-attribute edge in G is kept with probability s (cid:48) in G (resp. s (cid:48) in G ). The subsampling procedures for different edges areindependent.This attributed subsampling model above can be viewed asan attributed Erd˝os–Rényi pair G ( n, p , m, q ) with p and q given below: (cid:18) p p p p (cid:19) = (cid:18) ps s ps (1 − s ) p (1 − s ) s p (1 − s )(1 − s ) + 1 − p (cid:19) , (cid:18) q q q q (cid:19) = (cid:18) qs (cid:48) s (cid:48) qs (cid:48) (1 − s (cid:48) ) q (1 − s (cid:48) ) s (cid:48) q (1 − s (cid:48) )(1 − s (cid:48) ) + 1 − q (cid:19) . Anonymization and exact alignment.
In the attributed graphalignment problem, we are given G and an anonymizedversion of G , denoted as G (cid:48) . The anonymized graph G (cid:48) isgenerated by applying a random permutation Π ∗ on the uservertex set of G , where the permutation Π ∗ is unknown. Morexplicitly, each user vertex i in G is re-labeled as Π ∗ ( i ) in G (cid:48) .The permutation Π ∗ is chosen uniformly at random from S n ,where S n is the set of all permutations on [ n ] . Since G and G (cid:48) are observable, we refer to ( G , G (cid:48) ) as the observable pair generated from the attributed Erd˝os–Rényi pair G ( n, p , m, q ) .Then the graph alignment problem, i.e., the problem ofrecovering the identities/original labels of user vertices in theanonymized graph G (cid:48) , can be formulated as a problem ofestimating the underlying permutation Π ∗ . The goal of graphalignment is to design an estimator ˆ π ( G , G (cid:48) ) as a function of G and G (cid:48) to best estimate Π ∗ . We say ˆ π ( G , G (cid:48) ) achieves exact alignment if ˆ π ( G , G (cid:48) ) = Π ∗ . The probability of errorfor exact alignment is defined as P (ˆ π ( G , G (cid:48) ) (cid:54) = Π ∗ ) . We sayexact alignment is achievable with high probability (w.h.p.) ifthere exists ˆ π such that lim n →∞ P (ˆ π ( G , G (cid:48) ) (cid:54) = Π ∗ ) = 0 . Reminder of the Landau notation.
Notation Definition f ( n ) = ω ( g ( n )) lim n →∞ | f ( n ) | g ( n ) = ∞ f ( n ) = o ( g ( n )) lim n →∞ | f ( n ) | g ( n ) = 0 f ( n ) = O ( g ( n )) lim sup n →∞ | f ( n ) | g ( n ) < ∞ f ( n ) = Ω( g ( n )) lim inf n →∞ | f ( n ) | g ( n ) > f ( n ) = Θ( g ( n )) f ( n ) = O ( g ( n )) and f ( n ) = Ω( g ( n )) III. M
AIN R ESULT
Now we are ready to state our results. To better illustrate thebenefit of attribute information in graph alignment, we presentin Theorem 1 a simplified version of our achievability resultby adding mild assumptions on user-user edges motivatedby practical applications. This simplified result also makesit easier to compare to the converse result in Theorem 2,which will be illustrated in Figure 2. Note that these additionalassumptions are not needed for technical proofs. The moregeneral achievability results without these assumptions arepresented in Lemmas 2 and 3 in Section IV.In a typical social network, the degree of a vertex ismuch smaller than the total number of users. Based on thisobservation, we assume that the marginal probabilities of anedge in both G and G are not going to 1, i.e., − ( p + p ) = Θ(1) , − ( p + p ) = Θ(1) . (3)Moreover, two social networks on the same set of users aretypically highly correlated. Based on this, we assume that thecorrelation coefficient of user-user edges is not vanishing, i.e., ρ u = Θ(1) . (4) Theorem 1 (Achievability) . Consider the attributed Erd˝os–Rényi pair G ( n, p ; m, q ) under conditions (3) and (4) . If np + mψ a − log n → ∞ , (5) where ψ a = ( √ q q − √ q q ) , then there exists analgorithm that achieves exact alignment w.h.p. log nn q p log nm (a) m = Ω((log n ) ) log nn q p
24 13 log nm (b) m = o ((log n ) ) Figure 2: Feasible region ( 1 , 2 and 3 ) and infeasible region ( 4 ). Those re-gions are specified by 1 : p ≥ log n + ω (1) n ; 2 : p + mn q ≥ log n + ω (1) n and q = O ( m − ) ; 3 : p + mn ψ a ≥ log n + ω (1) n and q = ω ( m − ) ;4 : p + mn q ≤ log n − ω (1) n . In (b), the gap between 3 and 4 is Θ( q / ) that comes from the difference between ψ a and q . Theorem 2 (Converse) . Consider the attributed Erd˝os–Rényipair G ( n, p , m, q ) . If np + mq − log n → −∞ , (6) then no algorithm guarantees exact alignment w.h.p. Now to better compare the achievability and the converse,we further assume − ( q + q ) = Θ(1) , − ( q + q ) = Θ(1) , ρ a = Θ(1) , (7)where ρ a is the correlation coefficient of user-attribute edges. Corollary 1.
Consider the attributed Erd˝os–Rényi pair G ( n, p ; m, q ) under conditions (3) , (4) , and (7) . If m =Ω((log n ) ) and np + mq − log n → ∞ , (8) then there exists an algorithm that achieves exact alignmentw.h.p. If m = o ((log n ) ) and np + mq − O ( mq / ) − log n → ∞ , (9) then there exists an algorithm that achieves exact alignmentw.h.p. Figure 2 illustrates the achievability conditions (8) and (9).Figure 2a shows the tightness of condition (8) when m =Ω((log n ) ) . Figure 2b demonstrates the difference betweenthe achievability condition (9) and the converse condition (6)when m = o ((log n ) ) .Note that in the traditional Erd˝os–Rényi pair without at-tributes, the tight achievability condition for exact alignmentis np − log n → ∞ . Now in our attributed Erd˝os–Rényipair, the additional attribute information allows us to relax theachievability condition to (8) or (9). We illustrate how thisexpands the achievability region in Figure 2.When setting q = 1 , i.e., specializing the attributedErd˝os–Rényi pair to the traditional Erd˝os–Rényi pair, ourachievability result in Theorem 1 (and its more general versionin Lemmas 2 and 3) and converse result in Theorem 2 recoverhe state-of-the-art information-theoretic limits in [6, 7, 8].When setting p = 1 , i.e., removing the user relationshipnetwork, our results provide information-theoretic limits forthe graph alignment problem on bipartite random graphs.IV. G ENERAL ACHIEVABILITY
In this section, we present the general achievability results.We obtain exact alignment by applying the optimal maximuma posteriori probability (MAP) estimator. Lemma 1 statesthat the MAP estimator simplifies to a minimum weighteddistance estimator. Due to space limitations, we present theachievability in Lemmas 2 and 3 without proofs. Key prooftechniques include tools from enumerative combinatorics, suchas generating functions, which are inspired by [7, 8].
Lemma 1 (MAP estimator) . Let ( G , G (cid:48) ) be an observ-able pair generated from the attributed Erd˝os–Rényi pair G ( n, p ; m, q ) . The MAP estimator of the permutation Π ∗ based on ( G , G (cid:48) ) simplifies to ˆ π MAP ( G , G (cid:48) )= argmin π ∈S n { w ∆ u ( G , π − ( G (cid:48) )) + w ∆ a ( G , π − ( G (cid:48) )) } , where w = log (cid:16) p p p p (cid:17) , w = log (cid:16) q q q q (cid:17) , and ∆ u ( G , π − ( G (cid:48) )) = (cid:88) e ∈E u { G ( e ) (cid:54) = G (cid:48) ( π ( e )) } , ∆ a ( G , π − ( G (cid:48) )) = (cid:88) e ∈E a { G ( e ) (cid:54) = G (cid:48) ( π ( e )) } . Lemma 2 (General achievability) . Consider the attributedErd˝os–Rényi pair G ( n, p ; m, q ) . If nψ u + mψ a − log n = ω (1) , (10) where ψ u = ( √ p p − √ p p ) and ψ a = ( √ q q −√ q q ) , then the MAP estimator achieves exact alignmentw.h.p. Lemma 3 (Achievability in sparse region) . Consider theattributed Erd˝os–Rényi pair G ( n, p ; m, q ) . If p = O (cid:16) log nn (cid:17) , p + p = O (cid:16) n (cid:17) ,p p p p = O (cid:16) n ) (cid:17) , np + mψ a − log n = ω (1) , then the MAP estimator achieves exact alignment w.h.p. V. P
ROOF OF CONVERSE
In this section, we give a detailed proof for Theorem 2. Let ( G , G ) be an attributed Erd˝os–Rényi pair G ( n, p ; m, q ) . Inthis proof, we will focus on the intersection graph of G and G , denoted as G ∧ G , which is the graph on the vertex set V = V u ∪ V a whose edge set is the intersection of the edgesets of G and G . We say a permutation π on the vertex set V is an automorphism of G ∧ G if a vertex pair ( i, j ) isin the edge set of G ∧ G if and only if ( π ( i ) , π ( j )) is inthe edge set of G ∧ G , i.e., if π is edge-preserving . Notethat an identity permutation is always an automorphism. Let Aut( G ∧ G ) denote the set of automorphisms of G ∧ G . ByLemma 4 below, exact alignment cannot be achieved w.h.p. if Aut( G ∧ G ) contains permutations other than the identitypermutation. This allows us to establish conditions for notachieving exact alignment w.h.p. by analyzing automorphismsof G ∧ G . Lemma 4.
Let ( G , G ) be an attributed Erd˝os–Rényi pair G ( n, p ; m, q ) . Given | Aut( G ∧ G ) | , the probability that MAPestimator succeeds is at most | Aut( G ,G ) | . In the proof of Theorem 2, we will further focus on theautomorphisms given by swapping two user vertices. To thisend, we first define the following equivalence relation betweena pair of user vertices. We say two user vertices i and j ( i (cid:54) = j ) are indistinguishable in G ∧ G , denoted as i ≡ j ,if ( G ∧ G )(( i, v )) = ( G ∧ G )(( j, v )) for all v ∈ V . It isnot hard to see that swapping two indistinguishable verticesis an automorphism of G ∧ G , and thus | Aut( G ∧ G ) \{ identity permutation }| ≥ |{ indistinguishable vertex pairs }| .Therefore, in the proof below, we show that the number ofsuch indistinguishable vertex pairs is positive with a largeprobability, which suffices for proving Theorem 2. Proof of Theorem 2.
Let G and G be an attributed Erd˝os–Rényi pair G ( n, p ; m, q ) and let G = G ∧ G . Let X denotethe number of indistinguishable user vertex pairs in G , i.e., X = (cid:88) i 2) log (1 − p + 2 p )+ m log (1 − q + 2 q ) + O (1) ≥ n − np − mq + O (1) (16) = ω (1) . (17)Here (16) follows from the inequality log (1 − x + 2 x ) ≥− x for any x ∈ [0 , , which can be verified by showingthat function f ( x ) = log (1 − x + 2 x ) + 2 x is monotoneincreasing in [0,1] and thus f ( x ) ≥ f (0) = 0 . Equation (17)follows from the condition (6) in Theorem 2. Therefore, thefirst term in (14) n ( n − P ( i ≡ j ) → as n → ∞ .Next, for the second term n − n ( n − P ( i ≡ j ≡ k ) P ( i ≡ j ) in (14), we have − log (cid:18) n − n ( n − P ( i ≡ j ≡ k ) P ( i ≡ j ) (cid:19) = log n − ( n − 2) log (cid:18) − p + 3 p (1 − p + 2 p ) (cid:19) − m log (cid:18) − q + 3 q (1 − q + 2 q ) (cid:19) + O (1) ≥ log n − np − mq + O (1) (18) = ω (1) . (19)Here (18) follows from the inequality log (cid:16) − x +3 x (1 − x +2 x ) (cid:17) ≤ x for any x ∈ [0 , , which can be verified by showing thatthe function f ( x ) = log (cid:16) − x +3 x (1 − x +2 x ) (cid:17) − x is monotonedecreasing in [0 , and thus f ( x ) ≤ f (0) = 0 . Equation(19) follows from the condition (6) in Theorem 2. Hence, thesecond term in (14) also converges to as n → ∞ , whichcompletes the proof for P ( X = 0) → as n → ∞ .Now we derive an upper bound on the probability of exactalignment under the MAP estimator, which is also an upperbound for any estimator since MAP minimizes the probabilityof error. Note that by Lemma 4, P ( π MAP = Π ∗ | X = x ) ≤ x +1 , which is at most / when x ≥ . Therefore, P ( π MAP = Π ∗ ) = P ( π MAP = Π ∗ | X = 0) P ( X = 0)+ P ( π MAP = Π ∗ | X ≥ P ( X ≥ ≤ P ( X = 0) + 12 P ( X ≥ P ( X = 0) , which goes to / as n → ∞ and thus is bounded away from . This completes the proof that no algorithm can guaranteeexact alignment w.h.p. EFERENCES [1] R. Singh, J. Xu, and B. Berger, “Global alignment ofmultiple protein interaction networks with applicationto functional orthology detection,” Proceedings of theNational Academy of Sciences , vol. 105, no. 35, pp.12 763–12 768, 2008.[2] M. Cho and K. M. Lee, “Progressive graph matching:Making a move of graphs via probabilistic voting,” in Proc. IEEE Comput. Vision and Pattern Recognit. , 2012,pp. 398–405.[3] A. D. Haghighi, A. Y. Ng, and C. D. Manning, “Robusttextual inference via graph matching,” in Human Lang.Technol. and Empirical Methods in Natural Lang. Pro-cess. , 2005.[4] A. Narayanan and V. Shmatikov, “De-anonymizing socialnetworks,” in Proc. IEEE Symp. Security and Privacy ,2009, pp. 173–187.[5] N. Korula and S. Lattanzi, “An efficient reconciliationalgorithm for social networks,” Proc. VLDB Endow. ,vol. 7, no. 5, p. 377–388, Jan. 2014.[6] P. Pedarsani and M. Grossglauser, “On the privacy ofanonymized networks,” in Proc. Ann. ACM SIGKDDConf. Knowledge Discovery and Data Mining (KDD) ,2011, pp. 1235–1243.[7] D. Cullina and N. Kiyavash, “Improved achievability andconverse bounds for Erd˝os-Rényi graph matching,” ACMSIGMETRICS Perform. Evaluation Rev. , vol. 44, no. 1,pp. 63–72, 2016.[8] D. Cullina and N. Kiyavash, “Exact alignment recoveryfor correlated Erd˝os-Rényi graphs,” arXiv:1711.06783[cs.IT] , 2017.[9] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proc.IEEE Symp. Security and Privacy , 2008, pp. 111–125.[10] D. Cullina, P. Mittal, and N. Kiyavash, “Fundamentallimits of database alignment,” in Proc. IEEE Int. Symp.Information Theory , 2018, pp. 651–655.[11] F. Shirani, S. Garg, and E. Erkip, “A concentrationof measure approach to database de-anonymization,” in Proc. IEEE Int. Symp. Information Theory , 2019, pp.2748–2752.[12] E. Mossel and J. Xu, “Seeded graph matching via largeneighborhood statistics,”