[PDF] Causal Meta-Mediation Analysis: Inferring Dose-Response Function From Summary Statistics of Many Randomized Experiments

Abstract

It is common in the internet industry to use offline-developed algorithms to power online products that contribute to the success of a business. Offline-developed algorithms are guided by offline evaluation metrics, which are often different from online business key performance indicators (KPIs). To maximize business KPIs, it is important to pick a north star among all available offline evaluation metrics. By noting that online products can be measured by online evaluation metrics, the online counterparts of offline evaluation metrics, we decompose the problem into two parts. As the offline A/B test literature works out the first part: counterfactual estimators of offline evaluation metrics that move the same way as their online counterparts, we focus on the second part: causal effects of online evaluation metrics on business KPIs. The north star of offline evaluation metrics should be the one whose online counterpart causes the most significant lift in the business KPI. We model the online evaluation metric as a mediator and formalize its causality with the business KPI as dose-response function (DRF). Our novel approach, causal meta-mediation analysis, leverages summary statistics of many existing randomized experiments to identify, estimate, and test the mediator DRF. It is easy to implement and to scale up, and has many advantages over the literature of mediation analysis and meta-analysis. We demonstrate its effectiveness by simulation and implementation on real data.

Full PDF

CCausal Meta-Mediation Analysis: Inferring Dose-ResponseFunction From Summary Statistics of Many RandomizedExperiments

Zenan Wang

UC BerkeleyBerkeley, [email protected]

Xuan Yin

Etsy, Inc.Brooklyn, New [email protected]

Tianbo Li

Etsy, Inc.Brooklyn, New [email protected]

Liangjie Hong

LinkedIn, Inc.Sunnyvale, [email protected]

ABSTRACT

It is common in the internet industry to use offline-developed al-gorithms to power online products that contribute to the successof a business. Offline-developed algorithms are guided by offlineevaluation metrics, which are often different from online businesskey performance indicators (KPIs). To maximize business KPIs, it isimportant to pick a north star among all available offline evaluationmetrics. By noting that online products can be measured by onlineevaluation metrics, the online counterparts of offline evaluationmetrics, we decompose the problem into two parts. As the offlineA/B test literature works out the first part: counterfactual estimatorsof offline evaluation metrics that move the same way as their onlinecounterparts, we focus on the second part: causal effects of onlineevaluation metrics on business KPIs. The north star of offline eval-uation metrics should be the one whose online counterpart causesthe most significant lift in the business KPI. We model the onlineevaluation metric as a mediator and formalize its causality with thebusiness KPI as dose-response function (DRF). Our novel approach,causal meta-mediation analysis, leverages summary statistics ofmany existing randomized experiments to identify, estimate, andtest the mediator DRF. It is easy to implement and to scale up, andhas many advantages over the literature of mediation analysis andmeta-analysis. We demonstrate its effectiveness by simulation andimplementation on real data.

KEYWORDS causal inference; meta-analysis; mediation analysis; experiment;dose-response function; A/B test; evaluation metric; business KPI

ACM Reference Format:

Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong. 2020. Causal Meta-Mediation Analysis: Inferring Dose-Response Function From SummaryStatistics of Many Randomized Experiments. In

Proceedings of the 26th ACMSIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20),

KDD ’20, August 23–27, 2020, Virtual Event, CA, USA © 2020 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in

Proceedings ofthe 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20),August 23–27, 2020, Virtual Event, CA, USA , https://doi.org/10.1145/3394486.3403313.

August 23–27, 2020, Virtual Event, CA, USA.

ACM, New York, NY, USA,11 pages. https://doi.org/10.1145/3394486.3403313

Nowadays it is common in the internet industry to develop algo-rithms that power online products using historical data. The onethat improves evaluation metrics from historical data will be testedagainst the one that has been in production to assess the lift inkey performance indicators (KPIs) of the business in online A/Btests. Here we refer to metrics calculated from historical data as offline metrics and metrics calculated in online A/B tests as online metrics. In many cases, offline evaluation metrics are different fromonline business KPIs. For instance, a ranking algorithm, which pow-ers search pages in e-commerce platforms, typically optimizes forrelevance by predicting purchase or click probabilities of items. Itcould be tested offline (offline A/B tests) for rank-aware evalua-tion metrics, for example, normalized discounted cumulative gain(

NDCG ), mean reciprocal rank (

MRR ) or mean average precision (

MAP ),which are calculated from the test set of historical purchase or click-through feedback of users. Most e-commerce platforms, however,deem sitewide gross merchandise value (

GMV ) as their business KPIand test for it online. There could be various reasons not to directlyoptimize for business KPIs offline or use business KPIs as offlineevaluation metrics, such as technical difficulty, business reputation,or user loyalty. Nonetheless, the discrepancy between offline evalu-ation metrics and online business KPIs poses a challenge to productowners because it is not clear that, in order to maximize onlinebusiness KPIs, which offline evaluation metric should be adoptedto guide the offline development of algorithms.The challenge essentially asks for the causal effects of increasingoffline evaluation metrics on business KPIs, e.g., how business KPIswould change for a 10% increase in an offline evaluation metric.The offline evaluation metric in which a 10% increase could resultin the most significant lift in business KPIs should be the northstar to guide algorithm development. Algorithms developed offlinepower online products, and online products contribute to the suc-cess of the business (see Figure 1). By noting that online productscan be measured by online evaluation metrics, the online counter-parts of offline evaluation metrics, we decompose the problem intotwo parts. The offline A/B test literature (see, e.g, Gilotte et al. [6]) a r X i v : . [ s t a t . A P ] J un DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong works out the first part (the black arrow): counterfactual estimatorsof offline evaluation metrics to bridge the inconsistency betweenchanges of offline and online evaluation metrics. We focus on thesecond part (the red arrow): the causality between online products(assessed by online evaluation metrics) and the business (assessedby online business KPIs). The offline evaluation metric whose on-line counterpart causes the most significant lift in online businessKPIs should be the north star. Hence, the question for us becomes,how business KPIs would change for a 10% increase in an onlineevaluation metric.

Algorithms guided by

Offline EvaluationMetrics : NDCG , MRR , MAP , · · · Online Products measured by

Online EvaluationMetrics : NDCG , MRR , MAP , · · · Business measured by

Online BusinessKPIs :sitewide

GMV , · · · ? Figure 1: The Causal Path from Algorithms to Business

Randomized controlled trials, or online A/B tests, are popular tomeasure the causal effects of online product change on businessKPIs. Unfortunately, they cannot answer our question directly. Inonline A/B tests, in order to compare the business KPIs caused bydifferent values of an online evaluation metric, we need to fix themetric at its different values for treatment and control groups. Takethe ranking algorithm as an example. If we could fix online

NDCG of the search page at 0.22 and 0.2 for treatment and control groupsrespectively, then we would know how sitewide

GMV would changefor a 10% increase in online

NDCG at 0.2. However, this experimentaldesign is impossible, because most online evaluation metrics dependon users’ feedback and thus cannot be directly controlled.We address the question by developing a novel approach ofcausal inference. We model the causality between online evaluationmetrics and business KPIs by dose-response function ( DRF ) in po-tential outcome framework [13, 14].

DRF originates from medicineand describes the magnitude of the response of an organism givendifferent doses of a stimulus. Here we use it to depict the value of abusiness KPI given different values of an online evaluation metric.Different from doses of stimuli, values of online evaluation metricscannot be directly manipulated. However, they could differ betweentreatment and control groups in experiments of treatments otherthan algorithms—user interface/user experience (UI/UX) design,marketing, etc. This could be due to the “fat hand” [19, 29] natureof online A/B tests that a single intervention can change manycausal variables at once. A change of the tested feature, which isnot algorithm, could induce users to change their engagement withalgorithm-powered online products, so that values of online evalu-ation metrics would change. For instance, in an experiment of UIdesign, users might change their search behaviors because of thenew UI design, so that values of online

NDCG , which depends onsearch interaction, would change, even though ranking algorithmdoes not change. The evidence suggests that online evaluation met-rics could be mediators that (partially) transmit causal effects oftreatments on business KPIs in experiments where treatments are not necessarily algorithm-related. Hence, we formalize the problemas the identification, estimation, and test of mediator

DRF .In mediation analysis literature, there are two popular identi-fication techniques: sequential ignorability ( SI ) and instrumentalvariable ( IV ). SI assumes each potential mediator is independentof all potential outcomes conditional on the assigned treatment,whereas IV permits dependence between unknown factors andmediators but forbids the existence of direct effects of the treat-ment. Rather than making these stringent assumptions, we leveragetrial characteristics to explain average direct effect ( ADE ) in eachexperiment so that we can tease it out from average treatment ef-fect ( ATE ) to identify the causal mediation. The utilization of trialcharacteristics means we have to use data from many trials becausewe need variations in trial characteristics. Hence, we develop ourframework as a meta-analysis and propose an algorithm that onlyuses summarized results from many existing experiments and gainthe advantage of easy implementation to scale.Most meta-analyses rely on summarized results from differentstudies with different raw data sources. Therefore, it is almost im-possible to learn more beyond the distribution of

ATE s. Fortunately,the internet industry produces plentiful randomized trials withconsistently defined metrics, and thus presents an opportunity forperforming a more complicated meta-analysis. Literature is lackingin this area while we create the framework of causal meta-mediationanalysis ( CMMA ) to fill in the gap.Another prominent strength of our approach in real applicationis, for a new product that has been shipped online but has fewA/B tests, it is plausible to explore the causality between its onlinemetrics and business KPIs from many A/B tests of other products.The values of online metrics of the new product can differ betweentreatment and control groups in experiments of other products ("fathand" [19, 29]), which makes it possible to solve for mediator

DRF of the new product without its own A/B tests.Note that, our approach can be applied to any evaluation metricthat is defined at experimental-unit level, like metrics discussed inoffline A/B test literature. The experimental unit means the unitfor randomization in online A/B tests. For example, in search pageexperiments, the experimental unit is typically the user. Also, theevaluation metric can be any combination of existing experimental-unit-level metrics.To summarize, our contributions in this paper include:(1) This is the first study that offers a framework to choose thenorth star among all available offline evaluation metrics foralgorithm development to maximize business KPIs whenoffline evaluation metrics and business KPIs are different.We decompose the problem into two parts. Since the offlineA/B test literature works out the first part: counterfactualestimators of offline evaluation metrics to bridge the incon-sistency between changes of offline and online metrics, wework out the second part: inferring causal effects of onlineevaluation metrics on business KPIs. The offline evaluationmetric whose online counterpart causes the most significantlift in business KPIs should be the north star. We show theimplementation of our framework on data from Etsy.com.(2) Our novel approach

CMMA combines mediation analysis andmeta-analysis to identify, estimate, and test mediator

DRF . ausal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA It relaxes standard SI assumption and overcomes the limi-tation of IV , both of which are popular in causal mediationliterature. It extends meta-analysis to solve causal mediationwhile the meta-analysis literature only learns the distribu-tion of ATE . We demonstrate its effectiveness by simulationand show its performance is superior to other methods.(3) Our novel approach

CMMA uses only trial-level summary sta-tistics (i.e., meta-data) of many existing trials, which makesit easy to implement and to scale up. It can be applied toall experimental-unit-level evaluation metrics or any com-bination of them. Because it solves for causality problemof a product by leveraging trials of all products, it could beparticularly useful in real applications for a new product thathas been shipped online but has few A/B tests.

We draw on two strands of literature: mediation analysis and meta-analysis. We briefly discuss them in turn.

Our framework expands on causal mediation analysis. Mediationanalysis is actively conducted in various disciplines, such as psy-chology [15, 24], political science [7, 12], economics [9], and com-puter science [16]. The recent application in the internet industryreveals the performance of recommendation system could be canni-balized by search in e-commerce website [29]. Mediation analysisoriginates from the seminal paper of Baron and Kenny [3], wherethey proposed a parametric estimator based on the linear struc-tural equation model (

LSEM ). LSEM , by far, is still widely used byapplied researchers because of its simplicity. Since then, Robinsand Greenland [21] and Pearl [16] and other causal inference re-searchers have formalized the definition of causal mediation andpinpointed assumptions for its identification [17, 20, 22] in variouscomplicated scenarios. The progress features extensive usage ofstructural equation models and causal diagrams (e.g.,

NPSEM-IE ofPearl [16] and

FRCISTG of Robins [20]).As researchers extend the potential outcome framework of Ru-bin [23] to causal mediation, alternative identification, and moregeneral estimation strategies have been developed. Imai et al. [12]achieved the non-parametric identification and estimation of me-diation effects of a single mediator under the assumption of SI .After analyzing other well-known models such as LSEM [3] and

FRCISTG [20], they concluded that assumptions of most models canbe either boiled down to or replaced by SI . However, SI is stringent,which ignites many in-depth discussions around it (see, e.g., thediscussion between Pearl [17, 18] and Imai et al. [11]).Another popular identification strategy of causal mediation is IV ,which is a signature technique in economics [1, 2]. Sobel [26] usedtreatment as IV to identify mediation effects without SI . However,as Imai et al. [12] pointed out, IV assumptions may be undesirablebecause they require all causal effects of the treatment pass throughthe mediator (i.e., complete mediation [3]). Small [25] proposeda new method to construct IV that allows direct effects of thetreatment (i.e., partial mediation [3]) but assumes that ADE of thetreatment is the same for different segments of the population.

Our method only uses summary statistics of many past experiments.Analyzing summarized results from many experiments is termedas meta-analysis and is common in analytical practice [5, 27]. Inthe literature, meta-analysis is used for mitigating the problem ofexternal validity in a single experiment and learning knowledgethat was hard to recover when analyzing data in isolation, suchas heterogeneous treatment effects (see, e.g., Browne and Jones[4], Higgins and Thompson [10]). Besides, a significant advantageof meta-analysis is easy to scale, because it only takes summarizedresults from many different experiments.Peysakhovich and Eckles [19] took one step toward the directionof performing mediation analysis using data from many experi-ments. They used treatment assignments as IV s to identify causalmediation, which is similar to Sobel [26], but lacks the justificationwhy more than one experiment is needed and failed to addresslimitations of IV that we discussed above. Our framework showsthat having access to many experiments enables identifying causalmediation without SI and overcoming the limitation of IV , both ofwhich are hard to achieve with only one experiment. We follow the literature of potential outcomes [12, 23, 25] to set upour framework.

M YT s T s T s T s K U . .. Figure 2: Directed Acyclic Graph of Conceptual Framework

As illustrated in Figure 2, we suppose there are many experi-ments with different treatments, and there is a mediator M thatcan be affected by any treatment T s and will in turn influenceoutcome Y . Each treatment may also affect Y directly. But we areparticularly interested in recovering their shared casual channel,the link between M and Y , marked in red in the figure. In the rank-ing algorithm example, M is online evaluation metric of searchpage, Y is online business KPI. A challenge to identify the red linkemerges if a confounder U exists. U could be (user engagementof) any other web-page/module or user preference in the rankingalgorithm example. In the literature, there are two approaches tosolve this challenge, SI and IV . SI requires that U is observed andmeasured, and there are no other unmeasured/unobserved con-founders after controlling for U . The standard IV approach allowsunmeasured/unobserved U , but assumes no direct links between T s s and Y . An IV method proposed by Small [25] relaxes require-ments of standard approach, and assumes all T s s share a single DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong direct link to Y . Our method allows unmeasured/unobserved U anddirect links from each T s to Y .Suppose there are K randomized trials in total. For each trial,there exists one treatment group and one control group . To sim-plify the discussion, here we assume experimental units are firstrandomly assigned to different trials, and then randomly assignedinto the treatment or control group of that trial. However, our ap-proach CMMA allows the same unit to participate in multiple trials.We will go back to this point in Section 5.1.We consider the following model for potential outcomes.

Definition 1. M ( t , s ) i = τ ⊤ i s × t + ϕ ⊤ s + M ∗ i (1) Y ( t , s , m ) i = µ i ( m ) + γ ⊤ i s × t + θ ⊤ s + Y ∗ i , (2)where the ( τ i , γ i , M ∗ i , Y ∗ i ) are i.i.d random vectors, for all t = s ∈ S and m ∈ M .We use a one-hot vector s = ( s , · · · , s K ) to encode the trialassignment where ( , · · · , s k = , · · · , ) indicates the assignmentto trial k , and a binary variable t ∈ { , } to encode treatmentassignment, with t = i with trial assignment s and treatment assignment t , the random variable M ( t , s ) i represents the potential mediator, andthe random variable Y ( t , s , m ) i represents the potential outcome thatwould be observed if i were to receive or exhibit level m of themediator through some hypothetical mechanism.We are interested in mediator DRF , which represents the valueof Y ( t , s , m ) i given different values ( m ) of M ( t , s ) i for the same t and s . This effectively is the red arrow in the Figure 2 for individuals.Let µ i ( m ) be the mediator DRF for individual i . Our goal is to esti-mate average mediator DRF : E [ µ i ( m )] , with which we can computethe percentage change of E [ µ i ( m )] for a 10% increase in m ceterisparibus . The expectation here is taken over the population of i andso are other expectations in this paper if not specified. Here weconsider polynomial mediator DRF : µ i ( m ) = (cid:205) Pp = β i , p m p , whichcan capture the nonlinearity of the causality. Estimating E [ µ i ( m )] means to estimate β p = E [ β i , p ] for p = , , · · · , P . Vectors τ i and γ i are in R K . Each element of the vector, τ i , k or γ i , k , representsdirect effect of the treatment on mediator or outcome in trial k , andis assumed not to depend on m .Vectors ϕ and θ are also in R K , representing trial fixed effects.Random variables M ∗ i and Y ∗ i represent idiosyncratic individualcharacteristics in the values of potential mediator and potentialoutcome. Assume E [ M ∗ i ] = E [ Y ∗ i ] = . Let M be the support ofthe distribution of mediator, and S be the set containing all possibletrial assignments.We only observe realized data { M i , Y i , T i , S i } . The observed me-diator M i : = M ( T i , S i ) i , and observed outcome Y i : = Y ( T i , S i , M i ) i .The specification has two implications. First, it implies that beingin a particular trial will not affect mediator DRF . If mediator

DRF is not trial independent, then having many trials only adds noisesrather than provides additional information, and thus defeats thepurpose of conducting a meta-analysis. Experiments with multiple treatment arms can be considered as multiple trials withone treatment group and one control group. This assumption can always be satisfied by reparameterizing M ∗ i , Y ∗ i and ϕ , θ . Figure 3: Mediator

DRF and Observed Data in Control Group

Remark (Trial-Irrelevant Mediator

DRF ) . Y ( t , s , m ′ ) i − Y ( t , s , m ) i = µ i ( m ′ ) − µ i ( m ) , (3)for all t = m ′ , m ∈ M and s ∈ S , where µ i ( m ′ ) − µ i ( m ) doesnot depend on s .Second, the specification implies that there are no interactioneffects between treatment and mediator on the outcome. It meansthe individual direct effect in each trial is irrelevant of the value ofmediator. It is common in the literature of causal mediation (see,e.g., NPSEM-IE of Pearl [16] and

FRCISTG of Robins [20]).

Remark (No-Interaction Pearl [16], Robins [20]) . Y ( , s , m ) i − Y ( , s , m ) i = γ ⊤ i s , (4)for all m ∈ M and s ∈ S , where γ i does not depend on m .When mediator M i is not independent of Y i (which is the casehere), it is not trivial to recover E [ µ i ( m )] through observed data. Fig-ure 3 illustrates the challenge using simulated data. The grey linesin Figure 3 are simulated individual mediator DRF s following Defi-nition 1, which represents the true causality between M i s and Y i s.The blue line is the average mediator DRF . After randomly assign in-dividuals to a trial, we can compute their observed mediator values M i and outcome values Y i when in the control group, which aredepicted by black scattered points. The black line shows the resultfrom fitting the observed points by a widely-used non-parametricmachine learning algorithm, locally estimated scatterplot smooth-ing (LOESS). Although the black line fits the data almost perfectly,it significantly deviates from the true underlying causality: theaverage mediator DRF . We first formally define the assumption that trial assignment andtreatment assignment are random.

Assumption 1 (Random Assignment) . (cid:110) Y ( t ′ , s , m ) i , M ( t , s ) i (cid:111) ⊥⊥ S i (5) (cid:110) Y ( t ′ , s , m ) i , M ( t , s ) i (cid:111) ⊥⊥ T i | S i = s (6) ausal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA for t , t ′ = m ∈ M and s ∈ S and it is also assumed that0 < P ( T i = t | S i = s ) < < P ( M i ( t ) = m | T i = t , S i = s ) < t = m ∈ M and s ∈ S .Equation 6 can be guaranteed by random assignment of treat-ments in online A/B tests. Equation 5 means that individual’s po-tential outcomes and potential mediators are independent of thetrial assignment. In practice, users are randomly selected into A/Btests, thus this assumption is trivially satisfied . Assumption 2 (Relaxed Sequential Ignorability) . Y ( t , s , m ′ ) i − Y ( t , s , m ) i ⊥⊥ M i | T i = t , S i = s (7)for all m , m ′ ∈ M , all s ∈ S , t = , µ i ( m ) is a polynomialfunction, this assumption is equivalent to β i , , · · · , β i , P ⊥⊥ M i | T i = t , S i = s for all s ∈ S , t = ,

1. This assumption says that the effect ofchanging m on the outcome of i is independent of the idiosyncraticindividual unobservable that affects M i . A similar assumption isproposed by Small [25, (IV-A3)]. It means that the underlying causal-ity between online product and the business is invariant even weobserve some users produce higher online metrics than others forunknown reasons (i.e., the idiosyncratic unobservable). It is weakerthan the SI . Whenever SI is satisfied, this assumption is naturallysatisfied. But the inverse is not true. It is possible to break SI and still fulfill this assumption. For example, when the mediator DRF is the same for all individuals and potential outcomes onlydepends on unobserved confounders, SI could be violated, whilethis assumption is still satisfied. In order to allow the presence of many direct treatment effects, weput some structure on the direct treatment effects. Let H k representsa vector of trial characteristics for trial k . Assumption 3 (Trial-Level Conditional Independence) . We as-sume γ i , k | H k ’s are independently and identically distributed with E [ γ i , k | H k ] = H ⊤ k π and γ i , k ⊥⊥ τ i , k | H k for all i .This assumption allows correlation between individual directtreatment effects on outcome and on mediator in a trial, whileassuming such correlation disappears once conditioned on the char-acteristics of the trial. In the example of ranking algorithm, we maybelieve that experiments that test new algorithms generally havehigh impacts on both online NDCG and

GMV whereas experimentsthat test new UI designs generally have only modest impacts onboth metrics. But, within the same type of experiments, how mucha treatment affects online

NDCG does not correlate with its effect on

GMV . This assumption is unverifiable. However, it is weaker thanstandard IV assumption in the literature. If there were no directeffects ( IV assumption), then this assumption is trivially satisfied. This is true regardless of whether the same unit can participate in multiple trials. If Y ( t , s , m ) i ⊥⊥ M i | T i = t , S i = s for all m ∈ M then naturally Y ( t , s , m ′) i − Y ( t , s , m ) i is also independent of M i given T i = t , S i = s . We can stack vectors of trial characteristics of all the K trialsinto a matrix H = (cid:2) H H . . . H K (cid:3) ⊤ . Assumption 3 implies E [ γ i | H ] = (cid:2) E [ γ i , | H ] E [ γ i , | H ] . . . E [ γ i , K | H K ] (cid:3) ⊤ = Hπ (8)for all i . Assumption 4 (Relevance Condition) . V ar ( τ k | H k ) >

0; (9)This assumption means, for the same type of experiments, directtreatment effect on M ( τ k ) varies between experiments. It impliesthat treatment in each trial is still helpful for predicting M afterconditioning on trial-level covariates. This assumption is similar tothe standard rank condition of IV identification (see Wooldridge[28, Chapter 5]). Because τ k can be calculated easily by summarystatistics, this assumption can be empirically verified. In practice,since we can decide which trials to be included into the analysis,we can make sure this assumption always holds. DRF

To simplify the proof, let’s assume µ i ( m ) = β i m . The same proofworks for the more general polynomial DRF . Based on Definition 1,random variables of observed mediator M i and observed outcome Y i can be written as M i = ( S i T i ) ⊤ τ + S ⊤ i ϕ + η i , (10) Y i = M i β + S ⊤ i θ + ( S i T i ) ⊤ γ + ϵ i , (11)where η i = ( S i T i ) ⊤ ( τ i − τ ) + M ∗ i , ϵ i = ( S i T i ) ⊤ ( γ i − γ ) + M i ( β i − β ) + Y ∗ i , β = E [ β i ] , τ = E [ τ i ] , and γ = E [ γ i ] . By plugging Equation 10 into Equation 11 for M i , we obtain Y i = ( S i T i ) ⊤ τ β + S ⊤ i ( θ + ϕ β ) + ( S i T i ) ⊤ Hπ + ϵ ′ i (12)where ϵ ′ i = ( S i T i ) ⊤ ( γ i − Hπ ) + η i β + M i ( β i − β ) + Y ∗ i .With all the specifications and assumptions, we are ready topresent the important result about the identification. Theorem 1.

Consider the model specified in Definition 1, UnderAssumption 1 - 4, the average mediator effect β = E [ β i ] can be iden-tified. A consistent estimator of β can be derived through a procedureof two-stage least squares (2SLS):(1) Following Equation 10, regress M i on S i T i and S i via leastsquares to obtain a consistent estimator (cid:98) τ ;(2) Plug the consistent estimator (cid:98) τ into Equation 12, and regress Y i on ( S i T i ) ⊤ (cid:98) τ , S i , and ( S i T i ) ⊤ H via least squares to obtainthe consistent estimator (cid:98) β . The proof is in Appendix A. Since τ can be identified from 10, wecan use (cid:98) τ in place of τ . The proof shows that under Assumption 1 -4, the covariances between ϵ ′ i and all the covariates, S i T i (cid:98) τ , S i and S i T ⊤ i H in Equation 12 are zeros. Therefore, structural parameters β can be identified (see, e.g., Wooldridge [28, Chapter 4] for moredetails of coefficient identification in linear regression). DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong

The estimator (cid:98) β from 2SLS of Theorem 1 is equivalent to an IV -2SLS estimator (see, e.g., Wooldridge [28, Chapter 5] for moredetails of IV -2SLS). To see it, let’s rewrite Equation 11 as Y i = M i β + S ⊤ i ( θ + ϕ β ) + ( S i T i ) ⊤ Hπ + ϵ ′′ i (13)where ϵ ′′ i = ( S i T i ) ⊤ ( γ i − Hπ ) + M i ( β i − β ) + Y ∗ i . We could get thesame estimator of β as in Theorem 1 through applying IV -2SLS onEquation 13 and using S i T i as instruments for M i . Theorem 1 implies that average mediator effect can be estimated byrunning two regressions with pooled data. Such estimation couldbe very costly when the sample size in each trial is huge so thatpooling data from all trials becomes infeasible. We propose a simplertwo-stage procedure:

CMMA . Algorithm 1

CMMA

Input Y i , T i , and M i , S i , H k Output (cid:98) β for trial k = K do Estimate

ATE of trial k treatment on Y using data from trial k and denote it as ATE Yk . Estimate

ATE of trial k treatment on M using data from trial k and denote it as ATE Mk . end for Regress

ATE Yk on ATE Mk and additional trial-level covariates H k . Save the coefficient for

ATE Mk as (cid:98) β . return (cid:98) β Step 1-4 of

CMMA calculates

ATE s for each trial.

ATE on Y and M could be done in the standard procedure of online A/B tests. Step 5and 6 uses only trial-level summary statistics and covariates, makingit very easy to implement. This estimator has the same identificationstrategy as in Theorem 1 and is equivalent to a weight-adjusted2SLS estimator. The proof is in Appendix B.Note that, CMMA allows the same unit to participate in multipletrials. We can always use regression/ANOVA with treatment in-teraction terms to estimate

ATE s of each trial for units in multipletrials, and then implement Step 5 and 6 of

CMMA on estimated

ATE sto get (cid:98) β .The most challenging part of applying CMMA is finding valid trial-level characteristics H k to satisfy Assumption 3. A good H k shouldhave explanatory power for treatment effects on outcome and me-diator across trials. However, similar to finding a valid instrument,there is no systematic way to produce H k . Practitioners have to relyon available data and domain knowledge to argue for the validity of H k . In Section 6 Table 2, we simulate the consequences of violatingAssumption 3. In Section 7.2 Figure 4, we discuss the choice of H k inour real data application. Future work is required on the sensitivityof CMMA to H . In general, the reported standard errors from the second stage re-gression is slightly different from theoretical values without access to residuals in the first stage. But this becomes less of an issue assample size increases. Since the sample size is usually enormous inonline A/B tests, we recommend using the reported standard errorsin the second stage regression for convenience.Although we have assumed that µ i ( m ) = β i m for discussionto this point, the same proof is still valid for polynomial DRF , µ i ( m ) = (cid:205) Pp = β i , p m p . Let M i = [ M i , · · · , M pi , · · · , M Pi ] ⊤ and β = [ β , · · · , β p , · · · , β P ] ⊤ , where β p = E [ β i , p ] . Then we can use M i ⊤ β in place of M i β in the proof. For our algorithm, in addition to ATE Yk and ATE Mk , we also need to estimate ATE on the higher orderterms of M i , ATE ( M p ) k , p = , , · · · , P . To decide the highest p -th order term to include in the modelof the second stage, we can use the common model selection tool:Wald test. The standard way is to run a regression with higher-order terms and then perform a series of tests to check whethercoefficients of those terms are zeros. See Greene [8, Chapter 5] formore technical discussions on Wald test. We conduct Monte Carlo simulations to study the finite sampleperformance of our estimator. The details of our simulation set-upare described in Appendix C. The R code is available in our GitHubrepository: https://github.com/znwang25/cmma.

Table 1: Finite-Sample Performance Comparison N per = N per = N per = CMMA

We use Limited Information Maximum Likelihood (LIML) esti-mator as a benchmark to evaluate our

CMMA estimator. An LIMLestimator that is specified according to the simulation setup shouldhave the best performance theoretically. We include two commonestimators in the literature of mediation analysis into the compari-son: Sobel [26] and LSEM [3], which are derived under identificationapproaches discussed in Section 2. Sobel [26] assumes completemediation, and LSEM [3] relies on SI assumption [12]. Both assump-tions are false under our setting. We also implement the Full Sample2SLS estimator prescribed in Theorem 1, which should producesimilar results to

CMMA .We set the sample size per trial ( N per ), to 200, 500, and 1000and perform 100 simulations for each setting. Table 1 reports aver-age biases and 95% confidence interval coverage of the estimators.The performance of CMMA is quite good and largely comparable toLIML’s result. When the sample size per trial is small, our estimatoris slightly biased but the bias is much smaller than those of Sobeland LSEM estimators. As N per increases, the bias of our estima-tor decreases to zero, whereas the biases of the Sobel and LSEM Note: We need

ATE ( Mp ) k , not the p -th order of ATE Mk : ( ATE Mk ) p . ATE ( Mp ) k (cid:44) ( ATE Mk ) p . ausal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA Table 2: Assumption Violation N per = N per = N per = estimators remain roughly the same. As the sample size is usuallyenormous in A/B tests (on average, one trial has millions of obser-vations in the real data we obtained from an internet company),the bias of CMMA will be negligible in practice. Table 1 also showsthat, as N per increases, the 95% confidence interval coverage ofour estimator converges to the nominal coverage. This means thatthe OLS variance estimated in the trial-level regression is valid forhypothesis testing when N per is sufficiently large. In addition, thepoint estimate of Full Sample 2SLS estimator is numerically equalto CMMA , which empirically validates the equivalence claim made inSection 5.1.In Table 2, we examine the performance of

CMMA when each as-sumption fails. The results show that, the failure of Assumption 2does not seem to affect the estimator’s unbiasedness, whereas withfailures of Assumption 3 or 4,

CMMA is no longer unbiased. Com-paring results across rows, it seems to suggest that violating As-sumption 4 has worse consequence in terms of bias. Fortunately,Assumption 4 is testable as discussed in Section 4.We also test the performance of the Wald test with a higher orderof µ ( m ) . Table 3 shows that Wald tests can successfully select thecorrect model. For example, in the second row of Table 3, the truemediator DRF is a quadratic function of m : 4 m + m . Wald tests inall simulations reject the null hypothesis that coefficients for thesecond-degree term and the third-degree term are zeros, whereas97% of simulations fail to reject the hypothesis that coefficient forthe third-degree term is zero. The result suggests that the highestorder of ATE ( M p ) k should be 2. The last column of Table 4 showsthat, with a correctly specified model, we can accurately estimateall the underlying parameters. We apply the approach on three most popular rank-aware evalua-tion metrics:

NDCG , MRR , and

MAP , to show, for ranking algorithmsthat power search page of Etsy.com, which one could lead to themost significant lift of sitewide

GMV . Since the offline A/B test lit-erature [6] bridges the inconsistency between changes of offlineand online evaluation metrics, we only focus on, how sitewide

GMV would change for 10% lifts in online

NDCG , MRR , and

MAP of searchpage respectively. All metrics in the application, unless otherwisenoted, are online metrics. Please note the approach has not beendeployed in Etsy. This work is not intended to apply to, nor is ita prediction of, actual live performance metrics or performancechanges on Etsy or any other property.

We follow the offline A/B test literature [6] and define the threeonline rank-aware evaluation metrics at the user level. Althoughthe three metrics are originally defined at the query level in thetest collection evaluation of information retrieval (IR) literature,the search page in the industry is an online product for users andthus the computation could be adapted to the user level. Morespecifically, the three metrics are constructed as follows: 1) query-level metrics are computed using rank positions on search page anduser conversion status as binary relevance, and non-conversionassociated queries have zero values ; 2) user-level metrics is theaverage of query-level metrics across all queries the user issues(including non-conversion associated queries), and users who donot search or convert have zero values. Also, all the three metricsare defined at rank position 48, the lowest position of the first pageof search results in Etsy.com. We have access to summarized results of 190 randomly selectedexperiments from the online A/B test platform of Etsy.com. All theexperiments in the data have the user as an experimental unit. Thedata include descriptive information about each experiment suchas the tested product change, the product team that initiated theexperiment, and summary statistics of each experiment such asaverage (user-level)

NDCG per user in treatment and control groups.Note that, the difference between the average metric per user intreatment and control groups is

ATE on the metric (Step 1 - 4 inAlgorithm 1).We use the taxonomy of product teams as our trial-level covari-ates H because team taxonomy is quite informative of experimentsas Figure 4 suggests. Figure 4 shows the density of ATE on NDCG forselected teams including UI/UX design, marketing, search ranking.First, it is evident that distributions of

ATE on NDCG vary by team,which implies that Assumption 3 is likely to be satisfied. In partic-ular, most experiments of the search ranking team post positivegains on

NDCG , whereas experiments of the UI/UX team barely af-fects

NDCG . Second, within each team there are significant variancesof

ATE on NDCG , which suggests that Assumption 4 holds. Due topage limitation, only results for

NDCG are presented here, but thedistributions of

ATE on MAP and

MRR exhibit the similar pattern.

To decide the polynomial terms in the model, we perform Waldtests. The result from Wald tests is in Table 4. Since all the nullhypothesis are rejected, the results suggest us including

ATE on M , ATE on M and ATE on M in the model of each M ( NDCG , MAP , MRR ).Figure 5 shows the estimation results. On the first row, bluelines depict estimated mediator

DRF s. Scattered points representsummary statistics (the data for

CMMA ), ATE on ranking evaluationmetrics and

ATE on GMV , of all experiments. Black curves showresults from fitting the data by LOESS. Note that, the range ofthree evaluation metrics could be much smaller than those in IRliterature since they are defined at the user level (see Section 7.1). If the user purchases the item that she has clicked on the search page, the relevanceis 1; otherwise 0.

DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong

Table 3: Model Selection and Wald Tests % of Wald Tests Rejecting H CMMA

Estimation µ ( m ) : β m + β m + β m H : β = H : β = β = H : β = β = β = ATE ( M p ) k Estimated β ’s of µ ( m ) m

3% 5% 100% 1 4.0164 m + m

3% 100% 100% 2 4.019, 1.9994 m + m + m H is rejected if p-value < . Figure 4: Density of

ATE on NDCG

Table 4: Wald Test Results

P-ValueNull Hypothesis M : NDCG M : MAP M : MRR H : β = H : β = β = H : β = β = β = DRF are in Table D2 in theAppendix. The estimated mediator

DRF s show that all three onlinemetrics have positive causal effects on

GMV . Note that, the causalrelationships are different from the pattern of the data (scatterpoints and black curves). The differences between blue lines andblack curves show the bias from fitting the data by machine learningmethods without addressing omitted variables.The second row of Figure 5 shows the elasticity of GMV : the per-centage change of

GMV for a 10% increase in each evaluation metricat its different values, which are derived from estimated mediator

DRF . The downward slopes imply that, for all three evaluation met-rics, as they increase, the benefit of continuous improving themon

GMV decreases. For example, when average

NDCG per user equals0.0021, its 10% increase leads to a 9.88% increase in average

GMV peruser. Yet, when it equals 0.006, its 10% increase only leads to a 9.65%increase in average

GMV per user.Now it is easy for product owners to pick the evaluation metricthat could guide algorithm development to achieve the most sig-nificant lift in online

GMV . Suppose the current average values (peruser) of

NDCG , MAP , and

MRR from live data are 0.00210, 0.00156, and 0.00153 respectively. From estimated mediator

DRF s, we can calcu-late their corresponding elasticities of

GMV : 9.88%, 9.87%, and 9.90%,which are marked by red lines in Figure 5. Because online

MRR hasthe highest elasticity of

GMV , we should choose offline

MRR , whichis estimated based on offline A/B test literature [6] and thus hasthe same move as online

MRR , to guide the development of rankingalgorithms.

In the internet industry, the algorithms developed offline poweronline products and online products contribute to the success ofa business. In many cases, offline evaluation metrics, which guidealgorithm development, are different from online business KPIs.It is important for product owners to pick the offline evaluationmetric guided by which the algorithm could maximize online busi-ness KPIs. By noticing that online products could be assessed byonline counterparts of offline evaluation metrics, we decompose theproblem into two parts. Since the offline A/B test literature worksout the first part: counterfactual estimators of offline evaluationmetrics that move the same way as their online counterparts, wefocus on the second part: inferring causal effects of online evalua-tion metrics on business KPIs. The offline evaluation metric whoseonline counterpart causes the most significant lift in online businessKPIs should be the north star. We model online evaluation metricsas mediators and formalize the problem as to identify, estimate,and test mediator

DRF . Our novel approach

CMMA combines medi-ation analysis and meta-analysis and has many advantages overthe two strands of literature. In particular, it takes as inputs onlysummary statistics from multiple past A/B tests, and thus it is easyto implement in scale. We apply the approach on Etsy’s real datato uncover the causality between three most popular rank-awareonline evaluation metrics and

GMV , and show how we successfullyidentify

MRR as the offline evaluation metric for

GMV maximization.

REFERENCES [1] Joshua Angrist, Guido Imbens, and Donald Rubin. 1996. Identification of CausalEffects Using Instrumental Variables.

J. Amer. Statist. Assoc.

91, 434 (6 1996), 444.[2] Joshua Angrist and Alan Krueger. 2001. Instrumental Variables and the Searchfor Identification: From Supply and Demand to Natural Experiments.

Journal ofEconomic Perspectives

15, 4 (11 2001), 69–85.[3] Reuben Baron and David Kenny. 1986. The moderator-mediator variable dis-tinction in social psychological research: Conceptual, strategic, and statisticalconsiderations.

Journal of Personality and Social Psychology

51, 6 (1986), 1173–1182.[4] Will Browne and Mike Jones. 2017. What works in e-commerce - a meta-analysisof 6700 online experiments.

Qubit Digital Ltd (2017), 1–21.[5] Harris Cooper, Larry Hedges, and Jeffrey Valentine. 2009.

The handbook ofresearch synthesis and meta-analysis . Russell Sage Foundation.[6] Alexandre Gilotte, ClÃľment Calauzènes, Thomas Nedelec, Alexandre Abraham,and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In

Pro-ceedings of the Eleventh ACM International Conference on Web Search and Data ausal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA

Figure 5: Estimated Mediator Dose Response Function

Mining (WSDM âĂŹ18) . Association for Computing Machinery, New York, NY,USA, 198–206.[7] Donald Green, Shang Ha, and John Bullock. 2010. Enough already about "BlackBox" experiments: Studying mediation is more difficult than most scholars sup-pose.

Annals of the American Academy of Political and Social Science

Econometric analysis (7 ed.). Pearson Education Inc. 1232pages.[9] James Heckman and Rodrigo Pinto. 2015. Econometric Mediation Analyses:Identifying the Sources of Treatment Effects from Experimentally EstimatedProduction Technologies with Unmeasured and Mismeasured Inputs.

EconometricReviews

34 (2015), 6–31.[10] Julian Higgins and Simon Thompson. 2002. Quantifying heterogeneity in ameta-analysis.

Statistics in Medicine

21, 11 (6 2002), 1539–1558.[11] Kosuke Imai, Luke Keele, Dustin Tingley, and Teppei Yamamoto. 2014. Commenton Pearl: Practical implications of theoretical results for causal mediation analysis.

Psychological Methods

19, 4 (2014), 482–487.[12] Kosuke Imai, Luke Keele, and Teppei Yamamoto. 2010. Identification, Inferenceand Sensitivity Analysis for Causal Mediation Effects.

Statist. Sci. (2010).[13] Guido Imbens. 2000. The Role of the Propensity Score in Estimating Dose-Response Functions.

Biometrika

87, 3 (2000), 706–710.[14] Guido Imbens and Keisuke Hirano. 2004. The Propensity Score with ContinuousTreatments. (2004).[15] David MacKinnon, Amanda Fairchild, and Matthew Fritz. 2006. Mediation Anal-ysis.

Annual Review of Psychology

58, 1 (12 2006), 593–614.[16] Judea Pearl. 2001. Direct and indirect effects. In

Proceedings of the seventeenthconference on uncertainty in artificial intelligence . Morgan Kaufmann PublishersInc., 411–420.[17] Judea Pearl. 2014. Interpretation and identification of causal mediation.

Psycho-logical Methods

19, 4 (2014), 459–481. [18] Judea Pearl. 2014. Reply to Commentary by Imai, Keele, Tingley, and YamamotoConcerning Causal Mediation Analysis.

Psychological Methods

19, 4 (2014), 488–492.[19] Alexander Peysakhovich and Dean Eckles. 2018. Learning causal effects frommany randomized experiments using regularized instrumental variables. In

TheWeb Conference 2018 (WWW 2018) . ACM, New York, NY.[20] James Robins. 2003. Semantics of causal DAG models and the identification ofdirect and indirect effects.

Highly Structured Stochastic Systems (1 2003), 70–82.[21] James Robins and Sander Greenland. 1992. Identifiability and exchangeabilityfor direct and indirect effects.

Epidemiology

3, 2 (1992), 143–155.[22] James Robins and Thomas Richardson. 2010. Alternative graphical causal modelsand the identification of direct effects.

Causality and psychopathology: findingthe determinants of disorders and their cures (2010).[23] Donald Rubin. 2003. Basic concepts of statistical inference for causal effects inexperiments and observational studies. (2003).[24] Derek Rucker, Kristopher Preacher, Zakary Tormala, and Richard Petty. 2011.Mediation Analysis in Social Psychology: Current Practices and New Recommen-dations.

Social and Personality Psychology Compass

5, 6 (2011), 359–371.[25] Dylan Small. 2012. Mediation analysis without sequential ignorability: Usingbaseline covariates interacted with random assignment as instrumental variables.

Journal of Statistical Research

46, 2 (2012), 91–103.[26] Michael Sobel. 2008. Identification of Causal Parameters in Randomized StudiesWith Mediating Variables.

Journal of Educational and Behavioral Statistics

33, 2(2008), 230–251.[27] Tom Stanley and Hristos Doucouliagos. 2012.

Meta-regression analysis in eco-nomics and business . Routledge.[28] Jeffrey Wooldridge. 2010.

Econometric analysis of cross section and panel data .MIT Press, Cambridge, MA. 1096 pages.[29] Xuan Yin and Liangjie Hong. 2019. The Identification and Estimation of Direct andIndirect Effects in A/B Tests Through Causal Mediation Analysis. In

Proceedingsof the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining (KDD ’19) . ACM, New York, NY, USA, 2989–2999.

DD ’20, August 23–27, 2020, Virtual Event, CA, USA Zenan Wang, Xuan Yin, Tianbo Li, and Liangjie Hong

A PROOF OF THEOREM 1

Proof. Let X i = (cid:2) ( S i T i ) ⊤ τ S ⊤ i ( S i T i ) ⊤ H (cid:3) , and B = (cid:2) β θ + ϕ β π (cid:3) ⊤ , so Equation 12 can be written as Y i = X i B + ϵ ′ i Because of assumption 4, E [ X ⊤ i X i ] is guaranteed to have fullrank, thus is invertible. If we are able to show that E [ X ⊤ i ϵ ′ i ] = ,then B can be estimated by linear projection B ∗ = E [ X ⊤ i X i ] − E [ X ⊤ i Y i ] .While this is infeasible, we can first estimate τ with its estimate (cid:98) τ and use (cid:99) X i = (cid:2) ( S i T i ) ⊤ (cid:98) τ S ⊤ i ( S i T i ) ⊤ H (cid:3) in place of X i . Let (cid:98) X be N-component data vector with ith element (cid:99) X i . The resultedestimator is the two-stage least squares(2SLS) estimator, (cid:98) B SLS = ( (cid:98) X ⊤ (cid:98) X ) − (cid:98) X ⊤ Y to estimate B .Note that, because of Assumption 1, 2 and 3, the following equa-tions are true. E [ η i ] = E [ S i T i ] ⊤ E [( τ i − τ ) ] + E [ M ∗ i ] = E [ M i ( β i − β )] = E [ E [ M i ( β i − β )| S i , T i ]] = E [ M i E [( β i − β )| S i , T i ]] = E [ γ i − Hπ ] = E [ E [ γ i | H ] − Hπ ] = With the first component in X i , E (cid:2) ( S i T i ) ⊤ τ ϵ ′ i (cid:3) = E (cid:2) ( S i T i ) ⊤ τ (cid:2) ( S i T i ) ⊤ ( γ i − Hπ ) + η i β + M i ( β i − β ) + Y ∗ i (cid:3)(cid:3) = E (cid:2) ( S i T i ) ⊤ τ ( S i T i ) ⊤ ( γ i − Hπ ) (cid:3) + E (cid:2) ( S i T i ) ⊤ τ (cid:3) E (cid:2) η i β + M i ( β i − β ) + Y ∗ i (cid:3) = E (cid:2) ( S i T i ) ⊤ τ ( S i T i ) ⊤ ( γ i − Hπ ) (cid:3) = E (cid:34) K (cid:213) k = E (cid:2) S i , k T i τ i , k ( γ i , k − H k π ) (cid:3)(cid:35) = E (cid:34) K (cid:213) k = E (cid:2) E (cid:2) S i , k T i τ i , k ( γ i , k − H k π )| H k (cid:3)(cid:3)(cid:35) = E (cid:34) K (cid:213) k = E (cid:2) S i , k T i τ i , k (cid:0) E (cid:2) γ i , k | H k (cid:3) − H k π (cid:1)(cid:3)(cid:35) = , where the forth equality expands the matrix multiplication and thesixth equality follows from Assumption 3.For the second component, E (cid:2) S i ϵ ′ i (cid:3) = E (cid:2) S i ( S i T i ) ⊤ ( γ i − Hπ ) (cid:3) + E [ S i ] E (cid:2) η i β + M i ( β i − β ) + Y ∗ i (cid:3) = E (cid:2) S i ( S i T i ) ⊤ (cid:3) E (cid:2) γ i − Hπ (cid:3) = , where both the second and third equality follows from Assumption1. With respect to the third component, E (cid:2) ( S i T i ) ⊤ H ϵ ′ i (cid:3) = E (cid:2) ( S i T i ) ⊤ H ( S i T i ) ⊤ ( γ i − Hπ ) (cid:3) + E (cid:2) ( S i T i ) ⊤ H (cid:3) E (cid:2) η i β + M i ( β i − β ) + Y ∗ i (cid:3) = E (cid:2) ( S i T i ) ⊤ H ( S i T i ) ⊤ E (cid:2) ( γ i − Hπ )| H (cid:3)(cid:3) = , where both the second and third equality follow from Assumption1. Taken together, we have shown that E [ X ⊤ i ϵ ′ i ] = , therefore B can be estimated by (cid:98) B SLS = ( (cid:98) X ⊤ (cid:98) X ) − (cid:98) X ⊤ Y . □ B EQUIVALENCE OF

CMMA

ANDWEIGHT-ADJUSTED 2SLS ESTIMATOR

Let Z be an N-component data vector with ith element ( S i T i ) ⊤ , S be an N-component data vector with ith element S ⊤ i , and P S = S ( S ⊤ S ) − S ⊤ , M S = I − P S . The following proposition shows the linkbetween CMMA and 2SLS estimator.

Proposition 1.

The

CMMA method defined in Algorithm 1 is equiva-lent to a weight-adjusted 2SLS IV estimator with weight W ⊤ W , where W = ( Z ⊤ M S Z ) − Z ⊤ Proof. Let β = (cid:18) β π (cid:19) , Z = (cid:2) Z (cid:98) τ Z H (cid:3) . Since we are not particu-larly interested in coefficients in front of S i , using FrischâĂŞWaugh-âĂŞLovell theorem, we can get 2SLS estimator for β , (cid:98) β SLS = (cid:16) Z ⊤ M S M S Z (cid:17) − Z ⊤ M S M S Y . We could use a weighting matrix W and still get a consistent esti-mator for β ,˜ β = ( Z ⊤ M S W ⊤ WM S Z ) − Z ⊤ M S W ⊤ WM S Y . Use W = ( Z ⊤ M S Z ) − Z ⊤ Z ⊤ M S W ⊤ = (cid:18) (cid:98) τ ⊤ Z ⊤ M S Z ( Z ⊤ M S Z ) − H ⊤ Z ⊤ M S Z ( Z ⊤ M S Z ) − (cid:19) = (cid:18) (cid:98) τ ⊤ H ⊤ (cid:19)(cid:98) β CMMA = ( (cid:18) (cid:98) τ ⊤ H ⊤ (cid:19) (cid:0) (cid:98) τ H (cid:1) ) − (cid:18) (cid:98) τ ⊤ H ⊤ (cid:19) WM S YWM S Y = ( Z ⊤ M S Z ) − Z ⊤ M S Y is equivalent to regressing Y i on S i T i and S i and take coefficients of S i T i . And it is equivalent toregress on T i for each trial. □ C SIMULATION SETUP

We follow the specification described in Equation 10 and Equation12 and let the M ∗ i and Y ∗ i be jointly normally distributed: (cid:18) M ∗ i Y ∗ i (cid:19) ∼ N (cid:18)(cid:18) (cid:19) , (cid:20) σ M ∗ , ρσ M ∗ σ Y ∗ ρσ M ∗ σ Y ∗ , σ Y ∗ (cid:21)(cid:19) with ( ρ , σ M ∗ , σ Y ∗ ) = ( . , , ) . We fix the number of trials to be 50,and used independent uniform distributions to specify parametervalue for each element of θ , ϕ . To satisfy Assumption 3 and 4, τ is set to be the sum of a H dependent term and a random vectordrawing from a uniform distribution. All the other parameter valuesare listed in the Table C1. The innovations in the error terms such ausal Meta-Mediation Analysis KDD ’20, August 23–27, 2020, Virtual Event, CA, USA as β i − β , elements of τ i − τ , and γ i − Hπ are all drawn fromindependent normal distributions N( , . ) . Table C1: Parameter ValuesParameters Value β π ( , . , ) ⊤ θ ( θ , · · · , θ ) ⊤ , θ k ∼ U(− , ) ∀ k ϕ ( ϕ , · · · , ϕ ) ⊤ , ϕ k ∼ U(− , ) ∀ k τ T ⊤ H ⊤ + ( τ , · · · , τ ) ⊤ , τ k ∼ U(− , ) ∀ k and T = ( . , , . ) ⊤ The one-hot group assignment variable S i and treatment indi-cator T i are randomly generated. We assume the 50 trials can begrouped into 3 experiment types and use experiment types as triallevel covariates. Trials are randomly assigned into three types and H is a 3 ×

50 matrix representing such assignment. Under this setup,all the assumptions are satisfied, and thus we are ready to estimate β . D TABLES

Table D2: Second Stage Regression Results

Dependent variable: GMS M : NDCG M : MPP M : MRR(1) (2) (3) ATE M ∗ ATE M − ∗∗∗ − ∗ − ∗∗ (6,606.5) (9,718.6) (8,882.6) ATE M ∗∗∗ ∗∗ ∗∗∗ (5,162.6) (7,213.5) (6,576.5)Observations 190 169 190R Note: ∗ p < ∗∗ p < ∗∗∗ p <<