LinkLouvain: Link-Aware A/B Testing and Its Application on Online Marketing Campaign
Tianchi Cai, Daxi Cheng, Chen Liang, Ziqi Liu, Lihong Gu, Huizhi Xie, Zhiqiang Zhang, Xiaodong Zeng, Jinjie Gu
LLinkLouvain: Link-Aware A/B Testing and ItsApplication on Online Marketing Campaign
Tianchi Cai (cid:63) , Daxi Cheng (cid:63) , Chen Liang (cid:63) , Ziqi Liu, Lihong Gu, Huizhi Xie,Zhiqiang Zhang, Xiaodong Zeng, and Jinjie Gu
Ant Financial Services Group, China { tianchi.ctc,daxi.cdx } @antfin.com,[email protected] Abstract.
A lot of online marketing campaigns aim to promote userinteraction. The average treatment effect (ATE) of campaign strategiesneed to be monitored throughout the campaign. A/B testing is usuallyconducted for such needs, whereas the existence of user interaction canintroduce interference to normal A/B testing. With the help of link pre-diction, we design a network A/B testing method LinkLouvain to mini-mize graph interference and it gives an accurate and sound estimate ofthe campaign’s ATE. In this paper, we analyze the network A/B test-ing problem under a real-world online marketing campaign, describe ourproposed LinkLouvain method, and evaluate it on real-world data. Ourmethod achieves significant performance compared with others and isdeployed in the online marketing campaign.
Keywords:
Graph neural networks · Graph partitioning · Graph clus-tering · Network A/B testing.
Recently, Alipay launched an online marketing campaign that encourages usersto invite others to join the campaign, so they can all receive discounts or cashrewards. Such user interaction-promoting services (IPS) are common to increaseuser engagement. Various strategies are developed for this campaign, and de-signing an A/B testing solution to quantify their average treatment effects iscrucial. However, normal A/B testing solutions for IPS are improper becauseedges (user invitations) exist between different test groups and introduce biasto ATE; A/B testing addressing such interference is called network A/B testing.Under a thorough analysis of real-world graphs, we develop a graph cluster-ing method LinkLouvain for network A/B testing and deploy it in the onlinemarketing campaign. LinkLouvain has the following strengths:1. Scalability. It conducts on graphs of billions of nodes and tens of billions ofedges in 10 hours.2. Simplicity. It is a static method that runs only once before the online mar-keting campaign. There is no need for additional streaming support. (cid:63)
These authors contributed equally. a r X i v : . [ c s . S I] F e b T. Cai et al.
Fig. 1: Visualization of our online marketing campaign. Coupons are handed outto users (colored red in left figure), and users can invite their friends to join thismarketing campaign, and they all receive a cash coupon (colored red in rightfigure).3. Effectiveness. It reduces network interference and reduces the heterogene-ity of test groups throughout the campaign lifecycle (7 days). We developtwo metrics estimator bias and estimator variance to measure the networkinterference and heterogeneity, respectively. Results show LinkLouvain out-performs others.
For consumer-facing online products, encouraging user interactions is a commonpractice to increase user engagement. Some examples are ‘People You May Know’on Facebook, ‘Connections You May Know’ on LinkedIn, and online marketingcampaigns where coupons can be shared with others on Alipay. Such services,referred to as interaction-promoting services (IPS), are designed to encourageuser interaction, and therefore benefit user engagement of the product.All users and their interactions on the Alipay platform construct a real-worldthorough social network with billions of nodes and tens of billions of edges. Users(nodes) and user invitations (edges) in our online marketing campaign form asubgraph of this thorough social network. Engaging nodes and edges increasethroughout the campaign, and this time-evolving graph is always a subgraph ofthe social network.In our paper, we analyze the growth properties of the network and the inter-ference patterns from the campaign for a sound understanding of the followingnetwork A/B testing problem.
For IPS, users who do not receive new services may still be affected throughinteractions with those who do receive new services. It introduces interferencefor user-level A/B testing; thus, a direct estimation of the ATE is no longerunbiased. Network A/B testing solutions are of great interest. inkLouvain 3
Fig. 2: Visualization of estimation bias in different A/B testing scheme.
Left : Inuser-level randomization, users are randomly selected for the treatment (coloredyellow) and the control group (colored cyan). However, the online marketingcampaign in the treatment group may affect users in control group. In this case,both the treatment group and the control group has the same number of invitedusers, and the A/B testing misleadingly concludes that the treatment does notmake any difference.
Right
Network A/B testing clusters users with interferencetogether, and the cluster-level metrics show that the treatment group has moreinvited users.There are mainly two approaches to conduct an unbiased estimation of ATEunder network interference. The first is afterward correction. For example, [4]assumes the interference is linear-additive, estimates the exposure probability,and weighs the estimation accordingly. The performance of this type of approachrelies on making the right decision for the form of interference. We analyze theinterference in a real social network, and in our case, however, the linear-additivemodel is over-simplified and a panacea solution is missing.The other approach is to perform randomization at the cluster-level. That is,clusters of users, instead of users themselves, are used as randomization units.This approach assumes no/low interference between clusters. Our method fallsinto this category.
Many clustering algorithms have been studied to reduce the interference be-tween their resulting clusters. [11] proposes a clustering algorithm r -net. Labelpropagation and modularity maximization algorithms are also studied in [4],and it suggests modularity maximization outperforms the other. However, theseapproaches usually assume their graphs are restricted-growth graphs (formallydefined later) to perform better, which is hard to meet through our analysis ofreal social networks. Later, we’ll introduce our LinkLouvain approach built onLouvain .We also consider graph partitioning methods to generate balanced test groupswith minimal edges between groups. Dynamic graphs at scale impose great chal- A fast and parallel approximation for modularity maximization. T. Cai et al. lenges for graph partitioning. Most existing algorithms can not scale to billionsof nodes. Graph theory based algorithms aiming to solve the optimal min-cutgraph partitioning task have been proven NP-hard. Classical graph partitioningmethods such as Metis [5] also have high computational complexity. To handlerapidly evolving graphs, classical methods are not favorable for efficiency issuesand dynamic graph partitioning algorithms [7,8] are proposed by constantlyupdating labels and graph structure changes that require additional streamingsupport.Our method also focuses on rapidly evolving graphs, however, in a morestatic manner. Unlike other static methods [9,10] that run periodically to obtaincontinuous partitioning results, we make a guess on the graph structure in thefuture (e.g. in a week) and partition the predicted graph for only once. In thebeginning, we obtain an ‘omniscient’ view of all users and all possible interactionsbetween them. Also, we have a campaign graph in the early stage campaign. Thenwe predict possible edges with graph neural networks (GNNs) to gain knowledgeof a future snapshot of the campaign graph. The current snapshot and futuresnapshot are formed by invitations in the campaign, while the omniscient graph isirrelevant to specific applications. The predicted edges form a guess of the futuresnapshot, and it’s then clustered by efficient graph clustering methods withlinearithmic time such as Louvain. Finally, the clusters are randomly mergedto p desired test groups for A/B testing. In this paper, we are interested in the task of network A/B testing. More specif-ically, we aim at estimating a precise and sound ATE. Estimating ATE whenlaunching or updating IPS, however, is non-trivial. In the absence of interactions,user-level A/B testing is commonly used to estimate potential effects [6]. Theestimation is unbiased if the
Stable Unit Treatment Value Assumption (SUTVA)holds. This assumption requires the response of a unit (in this case, a user) tobe invariant to treatments assigned to other units [1]. With this assumption, the average treatment effect (ATE) of a new service can be defined as
AT E = 1 N (cid:88) i y ( i ) − y ( i ) , where y ( i ) is the outcome for user i if not treated and y ( i ) is the outcome foruser i if treated. N represents the number of users.However, the ground-truth ATE in real-world network A/B testing is impos-sible to obtain. Our work designs an estimator of ATE in the presence of networkinterference by splitting graph to clusters. The estimator is formulated asˆ AT E = 1 M (cid:88) i (cid:88) j y ( q ij ) − N (cid:88) i (cid:88) j y ( c ij ) , inkLouvain 5 where q ij is i -th user in j -th cluster of the treatment group Q and c ij is i -th userin j -th cluster of the control group C . M and N represent numbers of users in Q and C , respectively.Our goal is to design an ATE estimator that minimizes the estimation biasand variance. Therefore, the estimated ATE can guide business decisions. In our online marketing campaign, we have access to two graphs: a stable socialgraph G = ( V, E ) and a time-evolving label graph L = ( V L , E L ). We collect thesocial graph G containing all users of Alipay as nodes V and their historicalinteractions as edges E . It contains billions of nodes and tens of billions of edgesand lays the foundation for predicting users’ future interactions.Additionally, as the new online marketing campaign goes on, we collect alabel graph L , where users who participate in the online marketing campaignform node set V L and user invitations form edge set E L . L and L T representthe label graph in its early stage and by the end of the campaign of lifecycle T , respectively. It is called a label graph since the interaction data provides astrong hint for the form of interference. Previously, labeled data is less discussedbecause users already participated cannot join a new round of A/B testing. Thenovelty of LinkLouvain is that it uses link prediction to generalize the form ofinterference from this label graph to all users in the social graph G , and predictsan “estimator bias” for all edges V .Various properties of the two collected graphs are analyzed in section 4.1. To cluster a rapidly evolving graph, we train a GNN based link prediction modelto predict possible edges in the evolving graph. Then we apply a traditional graphclustering algorithm such as Louvain to split the graph into small clusters. Touse these clusters in A/B testing, we randomly combine them into desired p testgroups. The procedure is shown in Figure 3. Label comes from the edges (positivelabels) in the current campaign graph and non-edges (negative labels) that existonly in the social graph G .Fig. 3: Processing pipeline of the proposed LinkLouvain framework. T. Cai et al.
GNNs are a set of deep learning architectures that aggregate information fromnodes’ neighbors using neural networks. Deeper layers aggregate more distantneighbors, and the k th layer embedding of node v is h kv = σ ( W k · AGG( h k − u , ∀ u ∈ N ( v ) ∪ { v } ))where the initial embedding h v = x v is its node feature vector, σ is a non-linearfunction, and AGG is an aggregation function that differs in GNN architectures.Figure 4 shows a naive GNN based link prediction algorithm with a twin-tower architecture. Each target node of an edge aggregates its own neighbors for K times. After aggregation of K -hop neighbors, the final embeddings h KA and h KB of two target nodes A and B are concatenated and fed to the final denselayer.Fig. 4: Model architecture for link prediction. G : 1-hop neighborhood of a node; X : edge features; h : GNN embeddings; one-hot vectors: node labeling.Moreover, we add structural features called node labeling [12] to naive linkprediction. Node labeling assigns a one-hot vector to each node in the K -hopneighborhood of two target nodes A and B . It marks nodes’ different roles inthis neighborhood. For example, the left graph in Figure 4 has 5 nodes in A and B ’s 1-hop neighborhood. There are three roles in the neighborhood: A and B aretarget nodes; C and D are nodes connecting both target nodes; E is a node thatonly connects to one target. The node labeling vector is appended to each node’soriginal feature vector and tells GNN its relative location around the edge to bepredicted. It helps GNN to have more accurate predictions on link existence. inkLouvain 7 Comparison of Link Prediction Models in Online Marketing Campaign
In the early stage of the online marketing campaign, we collect and sample userinteractions as positive training samples and non-invitation relations as negativetraining samples. All training samples exist in our social graph G . There are 1.5million positive edges and 1.5 million negative edges. Each edge has 128 featuresrepresenting user interaction history. We compare the following models for thelink prediction task: – DNN: a dense neural network of five layers with layer size [512 , , , , – NG-LP: a naive GNN link prediction method with 2-hop neighbors ( K = 2)and embedding size 64. – NL-LP: a node labeling link prediction method with 2-hop neighbors ( K = 2)and embedding size 64.The main results are summarized in Table 1. F1 , KS , and AUC are widelyused binary classification metrics. NL-LP performs the best by taking structuralinformation into account.Table 1: Link prediction task comparison. DNN NG-LP NL-LPF1 0.88 0.89
KS 0.74 0.79
AUC 0.91 0.92
The output scores on edges represent possibilities of future online interactions.We filter out less possible edges and set the prediction score as edge weight.Graph filtering is crucial for a billion-node graph and the reasons are two-fold: – Computation resources are limited for graphs of such size. – Clustering algorithms like Louvain tend to generate unbalanced clusterswhen handling densely connected graphs. They undermine A/B testing per-formance heavily. Removing unnecessary edges help prevent long tails ofresulting clusters.However, if we set the threshold ( γ ) to abandon or keep an edge too high,we could drop too many possible edges. This introduces great bias on ATE esti-mates. We choose γ considering the trade-off between efficiency and effectiveness.In the online marketing campaign, we set the threshold to be 0.5, and clus-tering the remaining graph costs 0.6 hours. https://en.wikipedia.org/wiki/F1_score https://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test https://en.wikipedia.org/wiki/Receiver_operating_characteristic T. Cai et al.
Fig. 5:
Left : The vertex degree distribution of our real social network G at dif-ferent growth levels, as well as G after graph filtering by LinkLouvain. Right :Since the treatment effect of a user depends on his/her neighborhood’s treat-ment status, there exists interference. Moreover, this influence is non-linear tothe fractional exposure level, and cannot be corrected afterward easily.
To generate clusters of users as randomization units, we use Louvain to clusterthe filtered graph G (cid:48) . Clustering algorithms are well-discussed. In [4], researchersinvestigate several distributed clustering algorithms, such as label propagationand Louvain. Their result shows that Louvain performs better in preserving moreintra-cluster edges and reducing network interference. Experiment results in thenext section (Table 2) also support this conclusion.The resulting clusters are finally randomly merged into partitions of thedesired size p . These are the p test groups in A/B testing. Though G is a large social network that does not change frequently, the size of L grows quickly as users joining our campaign. Therefore we can analyze thegrowth property of our social network. As the number of nodes in L reaches 1,10, 40, 160 millions, we construct a subgraph of G with all nodes in L , and keepall edges between these nodes. Hence we can analyze the growth property of oursocial network retrospectively. We compare the graph properties of these foursubgraphs of G , as well as the full graph G , which contains more than 1 billionnodes. Maximum Degree Growth is Unbounded
In Figure 5
Left , we compare thedegree distribution of G at different growth levels. We find that as the networkgrows larger, customers build more connections with each other, and the degree inkLouvain 9 distribution shifts right. The long right tails of all five series suggest that thedegree of this social network has a right-skewed distribution regardless of thenetwork size. Moreover, diverged from bounded maximum degree assumptions[11], the maximum degree grows almost linearly to the number of nodes in thegraph, and thus, unbounded.In Figure 5 Left , We also plot the degree distribution of the full graph G after graph filtering by LinkLouvain. It is clear that the degree distributionis less skewed compared to the original distribution of the full social graph G (labeled ” > G have the sameinfluence on our online marketing campaign. Therefore we can eliminate manyedges that are not likely to have interactions with LinkLouvain and hence reduceinterference in our cluster-level randomization scheme. Network Interference
We examine network interference patterns on our socialnetwork by estimating the ATE on different neighborhood fractional exposurelevel [4] (share of neighbors that are in the treatment group) to see if there isany pattern. We divide users into subgroups according to their different fractionalexposure levels and plot estimate ATEs with respect to each group as a curve infigure 5
Right . We can draw two main conclusions. First, the treatment effect ofa user depends on his/her neighborhood’s treatment status, which means thatthe interference exists. Second, the interference does not follow a linear-additivepattern; in other words, the ATE is not linear with the fractional exposure level.This explains the difficulty of using the afterward correction approach: thereis no universal assumption for the form of interference suitable for all cases.The true form of interference might be complicated, and the linear-additiveassumption might be over-simplified.
Lower estimator bias and variance indicate more accurate and sound estimations.Here we introduce how to measure them.
Estimator Bias is measured by the degree of network interference betweentest groups. Clusters of users are randomly merged to p desired A/B test groups.Edges of graph L T exist between test groups, and their interference is denotedas I = | E − || E TL | , where E TL is the set of all edges (invitations in the campaign) ingraph L T , and E − is the set of edges in graph L T connecting nodes across testgroups. Estimator Variance represents the statistical power of designed estimators.To get higher statistical power, our estimator should generate clusters where theATE metric of the clusters has a lower variance, which means online experiments are more sensitive. We use the following formula from [2] to calculate the varianceof clusters in an estimator,Var( Y ) ≈ Kµ N (cid:18) σ S − µ S µ N σ SN + µ S µ N σ N (cid:19) , where ¯ Y is the total estimated conversion rate in this A/B testing group. K isthe number of clusters in the group. S and N are the random variables of the sumof individual conversion and individual number respectively. µ and σ calculatethe mean and variance/covariance of the corresponding random variables. Weevaluate the metric variance with the same group size (1% of the total traffic). The methods in our comparative evaluation are as follows. – Geo: the classical strategy to cluster users by their geographic locations. – LinkLouvain: our proposed method with graph filtering threshold γ . – Louvain: an ablation study that removes the link prediction stage and thegraph filtering stage. – HRLouvain: an ablation study that replaces the link prediction stage andthe graph filtering stage by removing hotspots (nodes with more than θ neighbors). – LinkLouvain-UW: an ablation study that replaces link prediction edge weightby 1 in our proposed method. – LinkLabel: an ablation study with Louvain replaced by label propagation forgraph clustering.
Table 2 summarizes the evaluation results of all the methods on our campaign.Metrics include estimator bias and estimator variance described in Section 4.2as well as computation time. The number of clusters is also summarized forreference.In general, LinkLouvain shows effectiveness in delivering precise and soundestimates and efficiency to run within 6 hours.
Consistency
We compare three sets of threshold γ (0.2, 0.3, and 0.5) for Lin-kLouvain, and their key metrics are consistent. It leads to an easier tuning pro-cess during experiments. Computational Performance
We run clustering methods with 40 workerson GRAPE [3]. Table 2 summarizes the computation time, and LinkLouvainwith γ = 0 . θ = 40 are the most efficient in graph basedmethods. inkLouvain 11 Table 2: Evaluation summary. (Louvain runs for more than one day and drainscomputational resources. Its results are not available.)
Methods
I V ar ( ¯ Y ) Time(%) (10 − ) (h)Geo 346 γ = 0 . γ = 0 . γ = 0 .
52 1.11 5.6
Louvain - - - > θ = 40 367M 82 HRLouvain, θ = 100 145M 67 32.45 12.1HRLouvain, θ = 200 98M 66 232.62 12.6LinkLouvain-UW 442M 64 Comparison with Geo-based Methods
A popular way to run A/B testingin online services is to use geographic regions as randomization units. It servesas a practical baseline for comparison. It is easy to use since it only requireslocations of user queries. We compare our method with geo-based partitioning,and the results show we achieve much lower variance compared to this popularapproach.
The online campaign run for 7 days and our LinkLouvain ( γ = 0 .
5) method wasdeployed to give estimates of ATE of different campaign strategies such as givingdiscount coupons or cash coupons. The ATE is the average payment made byusers who receive coupons, and the ATE estimate of the best strategy is 1.05times better than baseline (giving everyone a small amount of cash) withoutincreasing the campaign budget. The A/B test was run on 2% users in thecampaign, and after monitoring strategies for a day, the best coupon-distributingstrategy was applied to 100% users. The performance of the campaign exceedsexpectations with the help of LinkLouvain.
In this paper, we discuss network A/B testing motivated by interaction-promotingservices. We analyze this problem in a real social graph and our label graph anddevelop LinkLouvain to address network A/B testing. The proposed approach iscomputationally efficient and achieves the preferable balance between estimatorbias and estimator variance with the help of link prediction. It is deployed on areal marketing campaign and gives accurate and sound estimates of ATEs.
References Cox, D. R., and Cox, D. R.
Planning of experiments , vol. 20. Wiley New York,1958.2.
Deng, A., Knoblich, U., and Lu, J.
Applying the delta method in metric ana-lytics: A practical guide with novel ideas. In
Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining (2018), pp. 233–242.3.
Fan, W., Yu, W., Xu, J., Zhou, J., Luo, X., Yin, Q., Lu, P., Cao, Y., and Xu,R.
Parallelizing sequential graph computations.
ACM Transactions on DatabaseSystems (TODS) 43 , 4 (2018), 1–39.4.
Gui, H., Xu, Y., Bhasin, A., and Han, J.
Network a/b testing: From samplingto estimation. In
Proceedings of the 24th International Conference on World WideWeb (2015), pp. 399–409.5.
Karypis, G., and Kumar, V.
A fast and high quality multilevel scheme forpartitioning irregular graphs.
SIAM Journal on scientific Computing 20 , 1 (1998),359–392.6.
Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M.
Controlledexperiments on the web: survey and practical guide.
Data mining and knowledgediscovery 18 , 1 (2009), 140–181.7.
Li, H., Yuan, H., Huang, J., Cui, J., and Yoo, J.
Dynamic graph repartition-ing: From single vertex to vertex group. In
International Conference on DatabaseSystems for Advanced Applications (2020), Springer, pp. 482–497.8.
Nicoara, D., Kamali, S., Daudjee, K., and Chen, L.
Hermes: Dynamic parti-tioning for distributed social network graph databases. In
EDBT (2015), pp. 25–36.9.
Stanton, I., and Kliot, G.
Streaming graph partitioning for large distributedgraphs. In
Proceedings of the 18th ACM SIGKDD international conference onKnowledge discovery and data mining (2012), pp. 1222–1230.10.
Tsourakakis, C., Gkantsidis, C., Radunovic, B., and Vojnovic, M.
Fennel:Streaming graph partitioning for massive scale graphs. In
Proceedings of the 7thACM international conference on Web search and data mining (2014), pp. 333–342.11.
Ugander, J., Karrer, B., Backstrom, L., and Kleinberg, J.
Graph clusterrandomization: Network exposure to multiple universes. In
Proceedings of the 19thACM SIGKDD international conference on Knowledge discovery and data mining (2013), pp. 329–337.12.
Zhang, M., and Chen, Y.
Link prediction based on graph neural networks. In