[PDF] Algorithms and Complexity for Variants of Covariates Fine Balance

Abstract

We study here several variants of the covariates fine balance problem where we generalize some of these problems and introduce a number of others. We present here a comprehensive complexity study of the covariates problems providing polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostly combinatorial and rely on network flow techniques. In addition we present several fixed-parameter tractable results for problems where the number of covariates and the number of levels of each covariate are seen as a parameter.

Full PDF

aa r X i v : . [ c s . D S ] S e p ALGORITHMS AND COMPLEXITY FOR VARIANTS OF COVARIATES FINEBALANCE

DORIT S. HOCHBAUM ∗ , ASAF LEVIN † , AND

XU RAO ‡ Abstract.

We study here several variants of the covariates ﬁne balance problem where we generalize some of theseproblems and introduce a number of others. We present here a comprehensive complexity study of the covariates problemsproviding polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostlycombinatorial and rely on network ﬂow techniques. In addition we present several ﬁxed-parameter tractable results forproblems where the number of covariates and the number of levels of each covariate are seen as a parameter.

Key words.

Algorithms; Complexity; Covariate balance; Observational studies.

1. Introduction.

The problem of balancing covariates arises in observational studies in variouscontexts such as statistics [20] [22], epidemiology [3], sociology [15], economics [9] and political science[7]. In an observational study there are two disjoint groups of samples, one of treatment samples andthe other of control samples. Each of the samples in the two groups is characterized by several observedcovariates, or features.When estimating causal eﬀects using observational data, it is desirable to replicate a randomizedexperiment as closely as possible by obtaining treatment and control groups with similar covariate distri-butions. This goal can often be achieved by choosing well-matched samples of the original treatment andcontrol groups, thereby reducing bias in the estimated treatment eﬀects due to the observed covariates.The matching is to assign each treatment sample to one unique control sample, or, in other setups, toassign each treatment sample to a unique set of κ control samples, for κ a pre-speciﬁed integer, whereevery control sample is assigned to at most one treatment sample. A detailed review of matching-relatedmethods used for covariates balancing problems is given by [23].In this paper we address various problems of balancing covariates. The covariates here are nominal ,in that they take on discrete values or categories. The set of values of each nominal covariate partitionsthe treatment and control samples to a number of subsets referred to as levels where the samples at everylevel share the same covariate value. In an ideal situation the samples of the treatment and the controlin each matched pair or matched set belong to the same levels over all covariates. However, satisfyingthe requirement that matched samples in each pair or set belong to the same levels over all covariatestypically results in a very small selection from the treatment and control group, which is not desirable.To address this Rosenbaum et al. [21] introduced a weaker requirement to match all treatment samplesto a subset of the control samples, called selection, so that the proportion (or the number, if κ = 1) ofcontrol and treatment samples in each level of each covariate are the same. This requirement is knownin the literature as ﬁne balance .To formalize the discussion we introduce essential notation. Let the number of treatment samplesbe n and the number of control samples be n ′ . Let the set of all treatment samples be denoted by T , |T | = n . Let P be the number of covariates to be balanced. For p = 1 , ..., P , covariate p partitionsboth treatment and control groups into k p levels each. Let the partition of the treatment group undercovariate p be L p, , L p, , ..., L p,k p of sizes ℓ p, , ℓ p, , ..., ℓ p,k p . Similarly, let the partition of the controlgroup under covariate p be L ′ p, , L ′ p, , ..., L ′ p,k p of sizes ℓ ′ p, , ℓ ′ p, , ..., ℓ ′ p,k p . Let κ be an integer specifyingthe ratio of the number of matched control samples to the number of matched treatment samples.We deﬁne the κ -ﬁne-balance constraints for a selection of treatment and a selection of control samplesas follows: ∗ Department of IEOR, Etcheverry Hall, Berkeley, CA, supported in part by NSF award No. CMMI-1760102.([email protected]). † Faculty of Industrial Engineering and Management, The Technion, Haifa, Israel, supported in part by ISF - IsraeliScience Foundation grant number 308/18. ([email protected]). ‡ Department of IEOR, Etcheverry Hall, Berkeley, CA, supported in part by NSF award No. CMMI-1760102.([email protected]). 1

D. S. HOCHBAUM, A. LEVIN, AND X. RAO

Definition κ -ﬁne-balance). For an integer κ , a selection S ⊆ T of the treatment group and aselection S ′ of the control group, we say that ( S, S ′ ) - κ -ﬁne-balance is satisﬁed if κ · | S ∩ L p,i | = | S ′ ∩ L ′ p,i | for p = 1 , ..., P and i = 1 , ..., k p . Obviously for S , S ′ satisfying ( S, S ′ )- κ -ﬁne-balance, the cardinality of S ′ is κ times as large as thecardinality of S , | S ′ | = κ | S | .We are now ready to deﬁne the three families of problems investigated here with complexity thatvaries according to the number of covariates and the value of κ . The maximum κ -ﬁne-balance selection ( κ -FBS) problem is to select a subset S ⊆ T and a subset S ′ of the control group so as to maximize thesize of the selection S (equivalent to maximizing the size of S ′ since | S ′ | = κ | S | ) where the ( S, S ′ )- κ -ﬁne-balance constraints are satisﬁed. This problem is introduced here for the ﬁrst time.A second problem studied here is the κ -ﬁne-balance matching ( κ -BM) problem, ﬁrst introduced byRosenbaum et al. [21] for one covariate. Here we are given a distance, or cost, measure between eachtreatment and each control sample. The κ -BM problem is to minimize the total cost of the assignmentof each treatment sample in T to κ control samples such that the selection of matched control samples S ′ satisﬁes ( T , S ′ )- κ -ﬁne-balance.Another problem family newly introduced here is an optimization where the feasible sets are optimalfor another problem. Formally, in the ﬁrst stage the goal is to ﬁnd the optimal selections to the κ -FBSproblem. In the second stage, among all maximum sized selections, ﬁnd the selection that minimizes thetotal distance of an assignment of each selected treatment sample to exactly κ selected control samples.We refer to this problem as maximum selection κ -ﬁne-balance matching problem ( κ -MSBM).For the case of κ = 1, we ignore the preﬁx κ so ( S, S ′ )- κ -ﬁne-balance is called ( S, S ′ )-ﬁne-balance, κ -FBS problem is called FBS problem, κ -BM problem is called BM problem, and κ -MSBM problem iscalled MSBM problem.A summary of the problems investigated here is given is Table 1.Table 1: Summary of problems studied here.Problem name Objective Constraintsmax ﬁne-balance selection (FBS) max | S | ( S, S ′ )-ﬁne-balancemax κ -ﬁne-balance selection ( κ -FBS) max | S | ( S, S ′ )- κ -ﬁne-balanceﬁne-balance matching (BM) min assignment cost ( T , S ′ )-ﬁne-balance κ -ﬁne-balance matching ( κ -BM) min assignment cost ( T , S ′ )- κ -ﬁne-balancemax selection ﬁne-balance matching (MSBM) min assignment cost ( S, S ′ ) optimal for FBSmax selection κ -ﬁne-balance matching ( κ -MSBM) min assignment cost ( S, S ′ ) optimal for κ -FBS The concept of ﬁne balance was ﬁrst introduced by Rosenbaum et al.[21], who studied the κ -BM problem for the 1-covariate problem and proposed a network ﬂow algo-rithm. No polynomial running time algorithm has been known for the κ -BM problem with two or morecovariates.It is not always feasible to ﬁnd a selection S ′ of the control samples that satisﬁes the ( T , S ′ )- κ -ﬁne-balance constraints in the κ -BM problem. To that end several papers considered the goal of minimizingthe violation of this requirement, which we refer to as imbalance , [25], [26], [19], [2], [8]. The studies inall these papers require the entire treatment group to be selected or matched. Bennett et al. [2], andHochbaum and Rao [8] considered ﬁnding the selection of control group that minimizes an imbalanceobjective, deﬁned as P Pp =1 P k p i =1 || S ′ ∩ L ′ p,i | − κ · ℓ p,i | . This problem is called minimum κ -imbalanceproblem . The problem is trivial to solve for the 1-covariate problem (see section 2 for details); the 2-covariate problem was proved to be polynomial time solvable using linear programming in [2] and usingnetwork ﬂow algorithms in [8]; for three or more covariates, the problem is NP-hard [2, 8]. Yang et al.[25] and Pimental et al. [19] considered a more complicated problem that minimizes the total assignmentcost of the matched sets, each consisting of a single treatment sample and κ control samples, subject tothe requirement that the selection of matched control samples is optimal for the minimum κ -imbalanceproblem. Yang et al. [25] proposed two network ﬂow algorithms for the case of the 1-covariate problem;Pimental et al. [19] proposed a network ﬂow algorithm for the case in which the covariates form anested sequence. Zubizarreta [26] considered a diﬀerent variant which minimizes the total assignmentcost of the matched sets with a penalty on the imbalance, and presented a mixed integer programmingformulation for an arbitrary number of covariates. We show here, for the ﬁrst time, that the 2-covariate BM problem is in factNP-hard and therefore there is no polynomial time algorithm for the κ -BM problem with two or morecovariates unless P = N P . This NP-hardness result for κ -BM problem with two or more covariatesholds for any value of κ .The κ -MSBM problem is newly introduced here. It relaxes the requirement in the κ -BM problem ofselecting all treatment samples and replaces it with a maximum size selection possible, while enforcingthe κ -ﬁne-balance constraints. This κ -MSBM problem, as shown here, is NP-hard with two or morecovariates for any given value of κ . What’s more, it is also proved here to be NP-hard for the 1-covariateproblem when κ ≥

3. We present a polynomial algorithm for the 1-covariate MSBM problem, but weleave the complexity status of the 1-covariate 2-MSBM problem open.The κ -FBS problem is a simpler problem compared to the κ -MSBM problem. This problem isstudied here for the ﬁrst time. We prove that for three or more covariates, the FBS and κ -FBS problemsare NP-hard for any value of κ . For the case of the 2-covariate problem we present here an eﬃcientalgorithm for the FBS problem. The algorithm is based on an integer programming formulation of theproblem in which the constraint matrix, for two covariates, has the structure of network ﬂow constraints.For the resulting minimum cost network ﬂow problem we apply an algorithm with running time O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )). We also prove that for κ ≥

3, the 2-covariate κ -FBS problemis NP-hard. For the remaining case in which κ = 2 and the number of covariates is two, the complexitystatus of the 2-FBS problem is left open.We observe here that, for any number of covariates, if the selections of treatment and control samplesare ﬁxed, then the optimal assignment among the selected samples, and therefore the optimal solutionto the κ -MSBM problem, is attained by solving a minimum cost network ﬂow problem. See section 2for details.A summary of the complexity results for the three problem families is given in Table 2.Table 2: Summary of complexity and algorithmic results derived here. (Here n is the size of treatment group and n ′ is the size of control group) Problem One covariate Two covariates ≥ O ( n + n ′ ) O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k ))) NP-hard κ -FBS ( κ ≥ O ( n + n ′ ) NP-hard for κ ≥

3, open for κ = 2 NP-hard κ -BM (any κ ) O (( n + n ′ ) ) [21] NP-hard NP-hardMSBM O (( n + n ′ ) nn ′ ) NP-hard NP-hard κ -MSBM ( κ ≥

2) NP-hard for κ ≥ κ = 2Beyond these complexity results, we also address here ﬁxed-parameter tractable (FPT) results. Weprove that the κ -FBS, κ -BM and MSBM problems are solvable in ﬁxed-parameter tractable time forconstant numbers of covariates levels, yet the κ -MSBM problem is NP-hard for constant κ ≥ κ -BM problem where one of thecovariates has a constant number of levels (whereas the other one may have a linear number of levels), D. S. HOCHBAUM, A. LEVIN, AND X. RAO and we show that the complexity status of these special cases is tied to the complexity status of the exactmatching problem, a problem that is known to have a randomized polynomial time algorithm [16] butfor which the existence of a deterministic polynomial time algorithm is a long-standing open problem.

In section 2 we consider the case of the 1-covariate κ -FBS, κ -BM, MSBMproblems, and provide a compact representation of the sample selections. Then we present our complexityand algorithmic results of the other cases for the three families of problems separately, i.e., the κ -FBSproblem in section 3, the κ -BM problem in section 4, and the κ -MSBM problem in section 5. Theﬁxed-parameter complexity results are provided in section 6.

2. Preliminaries.

Consider ﬁrst the case of a single covariate, P = 1, that partitions the controland treatment groups into, say, k levels each. Let the sizes of levels of the treatment group be ℓ , ..., ℓ k ,and the sizes of levels of the control group be ℓ ′ , ..., ℓ ′ k . It is easy to see that there exists a selection S ′ of control samples that satisﬁes the ( T , S ′ )- κ -ﬁne-balance if and only if ℓ ′ i ≥ κℓ i for i = 1 , ..., k . If thiscondition is satisﬁed then any subset S ∗ of the control group with κ · ℓ i samples in level i , i = 1 , ..., k ,satisﬁes the ( T , S ∗ )- κ -ﬁne-balance, and as such is a feasible selection for the κ -BM problem. With theseknown numbers of control samples to be selected in each level, the optimal solution to the 1-covariate κ -BM problem is found using a minimum cost network ﬂow formulation, as shown next. Note thata standard linear programming formulation of the minimum cost network ﬂow (MCNF) is given inAppendix A.The MCNF problem the solution to which is an optimal solution to κ -BM is constructed on abipartite graph with the treatment samples each represented by a node on one side, and the controlsamples each represented by a node on the other side. The cost on each arc between a treatment sampleand a control sample is the “distance” value between the two, and the arc capacity is 1. Each treatmentsample has a supply of κ . To account for the requirement that in each level i of control samples therewill be κ · ℓ i samples matched we add to the bipartite graph a third layer of k nodes, one for each level.The i th node in the third layer has demand of κ · ℓ i and there are arcs to this demand node from allcontrol samples in level i with capacity 1 and cost of 0. In an optimal solution to this MCNF problemthe control sample nodes through which there is a positive ﬂow (of one unit) are the ones selected andmatched to the respective treatment sample nodes from which they have a positive ﬂow.If ℓ ′ i < κℓ i for some i then there is no selection S ′ of control samples that satisﬁes the ( T , S ′ )- κ -ﬁne-balance. Addressing this context, as mentioned earlier, [25], [26], [19], [2], [8], consider the problemof minimizing the κ -imbalance, which is the sum of violations for all levels, P ki =1 || S ′ ∩ L ′ i | − κ · ℓ i | .The solution to this 1-covariate minimum κ -imbalance problem is straightforward: in step 1, selectmin { κ · ℓ i , ℓ ′ i } control samples in level i ; if the number of control samples selected is less than n in step1, then we select random additional control samples such that the selection is of size n . Another wayto address this context is to seek a solution for the ( S, S ′ )- κ -ﬁne-balance where rather than forcing allsamples of T to be included, ﬁnding a solution in which the size of the selection S , and equivalently | S ′ | ,is maximized, the κ -FBS problem. The solution to the 1-covariate κ -FBS problem is also straightforward:select ¯ ℓ i = min { ℓ i , ⌊ ℓ ′ i /κ ⌋} treatment samples of level i and κ · ¯ ℓ i control samples of level i .Again, for the 1-covariate problem of ﬁnding an optimal matching, or assignment, among all optimalselections for either the minimum κ -imbalance or the FBS problem, we solve a MCNF problem for theknown number of samples to select from each level, similar to the one deﬁned above with the followingmodiﬁcations. For the minimum κ -imbalance, we ﬁrst need to change the demand of the demand nodesin the above MCNF problem from κ · ℓ i to min { κ · ℓ i , ℓ ′ i } for each level i . We also add a dummy demandnode in the third layer with demand κ · n − P ki =1 min { κ · ℓ i , ℓ ′ i } , which connects with all control nodes eachwith capacity 1 and cost of 0. For the optimal selections of FBS, in addition to changing the demandfrom ℓ i to ¯ ℓ i for each level i , we also remove the supply on each treatment sample, add for every level i a supply node with supply ¯ ℓ i , and add arcs from this supply node to all treatment samples in level i with capacity 1 and cost of 0. The best assignment found with a selection that is optimal for the FBSproblem is an optimal solution for the MSBM problem. However, this method does not apply to the κ -MSBM problem with κ ≥

2. We further show that even the 1-covariate κ -MSBM problem is NP-hardfor κ ≥ κ -MSBM problem are polynomial time solvable forthe 1-covariate case. In section 5 we show that the 1-covariate κ -MSBM does not admit a polynomialtime algorithm for κ ≥ κ -FBS problem, we observe that the selectionsfrom the treatment and control groups can be represented compactly in terms of level-intersections . For P covariates, the intersection of the level sets L ,i ∩ L ,i ∩ . . . ∩ L P,i P , i p = 1 , . . . , k p , p = 1 , . . . , P , forma partition of the treatment group. Similarly, the intersection of the level sets L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P , i p = 1 , . . . , k p , p = 1 , . . . , P , form a partition of the control group. Therefore, instead of specifying whichsample belongs to the selection, it is suﬃcient to determine the number of selected samples in each levelintersection for the two groups, since the identity of the speciﬁc selected samples has no eﬀect on theﬁne balance requirement. With this discussion we have a theorem on the representation of the solutionto the κ -FBS problems in terms of the level-intersection sizes. Theorem

The level-intersection sizes s i ,i ,...,i P and s ′ i ,i ,...,i P are an optimal solution to the κ -FBS problem if there exists an optimal selection S of treatment samples and S ′ of control samples suchthat s i ,i ,...,i P = | S ∩ L ,i ∩ L ,i ∩ . . . ∩ L P,i P | and s ′ i ,i ,...,i P = | S ′ ∩ L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P | , for p = 1 , . . . , P , i p = 1 , . . . , k p . We will say that the optimal selection for the covariates problems here is unique if for any optimalselection S and S ′ , the numbers s i ,i ,...,i P = | S ∩ L ,i ∩ L ,i ∩ . . . ∩ L P,i P | and s ′ i ,i ,...,i P = | S ′ ∩ L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P | are unique. In order to derive an optimal selection given the optimal level-intersectionsizes, one selects any s i ,i ,...,i P treatment samples from the intersection L ,i ∩ L ,i ∩ . . . ∩ L P,i P and any s ′ i ,i ,...,i P control samples from the intersection L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P for i p = 1 , . . . , k p , p = 1 , . . . , P .We observe here that, for any number of covariates, if the optimal selection of treatment and controlsamples in terms of level-intersections is known and unique, then the optimal assignment among theselected samples, and therefore the optimal solution to the κ -MSBM problem, can also be attained bysolving an MCNF problem as follows. For each non-zero level intersection of treatment samples thereis a source node with supply of s i ,i ,...,i P . This source node is connected to all treatment samples inthe intersection L ,i ∩ L ,i ∩ . . . ∩ L P,i P with arcs of capacity 1 and cost of 0. For each non-zero levelintersection of control samples there is a demand node with supply of s ′ i ,i ,...,i P . This demand node isconnected from all control samples in the intersection L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P with arcs of capacity 1and cost of 0. The treatment and control sample nodes through which there is a positive ﬂow (of soneunit) are the ones selected, and a positive ﬂow between a treatment node and a control node indicatesthe two samples are matched. This is a minimum cost network ﬂow problem with a total demand (orsupply) bounded by min { n, n ′ } , and O ( nn ′ ) arcs and O ( n + n ′ ) nodes. Therefore the successive shortestpaths algorithm, discussed below in section 3, solves this problem in O (( n + n ′ ) nn ′ ) steps.

3. The maximum κ -ﬁne-balance selection ( κ -FBS ) problem. In this section we show thecomplexity and algorithmic results for the κ -FBS problems. We present the results separately ﬁrst forthe 1-covariate problem, next for three or more covariates, then the 2-covariate FBS problem, and ﬁnallythe 2-covariate κ -FBS problem for κ ≥ κ -FBS problem is straightforward, as discussed in section 2: select¯ ℓ i = min { ℓ ,i , ⌊ ℓ ′ ,i /κ ⌋} number of level i treatment samples and κ · ¯ ℓ i number of level i control samples.The union of the selections at each level is an optimal solution for the 1-covariate κ -FBS problem.Thevalue of the objective function corresponding to this solution is P k i =1 ¯ ℓ i = P k i =1 min { ℓ ,i , ⌊ ℓ ′ ,i /κ ⌋} . κ -FBS problem for any constant κ with P ≥ . We show herethat even for κ being constant, the κ -FBS problem with three or more covariates is NP-hard by reducingthe problem, one of Karp’s 21 NP-hard problems [13][6]. : Given a ﬁnite set X and a set of triplets U ⊂ X × X × X . Is there a subset M ⊆ U such that | M | = | X | and that no two elements of M agree in any coordinate? Theorem

The κ -FBS problem is NP-hard when P = 3 even for constant κ . D. S. HOCHBAUM, A. LEVIN, AND X. RAO

Proof.

Given an instance of 3-dimensional matching problem with a ﬁnite set X and a set of triplets U ⊂ X × X × X , we construct an instance of κ -FBS problem with P = 3 for any constant κ . Withoutloss of generality, we assume X = { , ..., | X |} .First we deﬁne the levels of the three covariates. For p = 1 , ,

3, the set of levels of covariate p is { , ..., | X |} ∪ { , ′ } . So all the three covariates have | X | + 2 levels.Next, we construct the samples. For each sample, we represent it by the ordered triplet ( a, b, c )where a is the level of the ﬁrst covariate, b is the level of the second covariate, and c is the level of thethird covariate. The treatment group contains a sample ( i, i, i ) for i = 1 , ..., | X | as well as | X | copies of(0 , ,

0) and | X | copies of (0 ′ , ′ , ′ ). For each triplet u ∈ U , whose elements are denoted by [ u , u , u ],we create one control sample ( u , u , u ). In addition, for each element i ∈ X we have ( κ −

1) copies foreach of the three control samples ( i, ′ , , (0 ′ , , i ) , (0 , i, ′ ). We also create | X | copies of (0 , ,

0) and | X | copies of (0 ′ , ′ , ′ ) for the control group. That is, the control group is the union of the following threesets (we represent the diﬀerent copies by a superscript as shown below.): C = { ( u , u , u ) : ∀ u = [ u , u , u ] ∈ U } ,C = { ( i, ′ , ( w ) , (0 ′ , , i ) ( w ) , (0 , i, ′ ) ( w ) : ∀ i = 1 , ..., | X | , ∀ w = 1 , ..., κ − } ,C = { (0 , , ( w ) , (0 ′ , ′ , ′ ) ( w ) : ∀ w = 1 , ..., | X |} . The treatment group constructed is of size 3 | X | and the control group constructed is of size | U | + (3 κ − | X | . The two sizes are both polynomially bounded in the size of the 3-dimensional matching instanceso the reduction can be computed in polynomial time.Finally, we claim that the optimal value of the constructed 3-covariate κ -FBS problem is 3 | X | if andonly if there exist a subset M ⊆ U such that | M | = | X | and that no two elements of M agree in anycoordinate for the 3-dimensional matching instance.Let M ⊆ U be the solution for the 3-dimensional matching instance, we derive a solution for theconstructed problem as follows. We select all the treatment samples. For each triplet u ∈ M , we choosethe control sample in C whose covariates levels are corresponding to the elements in u . Additionally,we also choose all control samples in C and C . To check the feasibility of this solution, ﬁrst considerthe appearances of level i for each i = 1 , ..., | X | under each covariate p = 1 , ,

3: level i appears once inthe treatment group for each covariate p ; it appears once among the selected samples in C as i appearsonce in each coordinate in M ; it appears for κ − C ; it does not appear in C . That is, thereis exactly one selected treatment sample of level i under covariate p and κ selected control samples oflevel i under covariate p , for each i and p . Next, consider appearances of level 0 and 0 ′ under eachcovariate p = 1 , ,

3: they each appear | X | times in the treatment group; they do not appear in C ; theyeach appear ( κ − | X | times in C and | X | times in C . That is, 0 and 0 ′ each appear κ | X | times inthe selected control samples under each covariate. Therefore, this selection is feasible and the objectivevalue, the number of selected treatment samples, is 3 | X | .On the other hand, if the constructed 3-covariate κ -FBS problem has an optimal solution S, S ′ ofobjective value 3 | X | , we say that u = [ u , u , u ] ∈ U is selected for M if the control sample ( u , u , u )is selected in S ′ for the constructed problem. We will show that M is a feasible solution of the 3-dimensional matching instance. Since the size of the treatment group is 3 | X | , all treatment samplesmust be selected in the optimal solution, and that 3 κ | X | number of control samples must be selected.For each covariate 1 , ,

3, levels 0 and 0 ′ each appears | X | times in the treatment group, so the numberof appearance of each of these two levels must be κ | X | in the selection S ′ of the control samples. So allsamples in C and C must be selected, otherwise there is no enough level 0 or level 0 ′ samples in S ′ .Therefore, M = 3 κ | X | − | C | − | C | = | X | . Furthermore, for each covariate p and for i = 1 , ..., | X | , level i appears exactly once in the treatment group so there are κ number of selected control samples in level i . For each i under each covariate p , since there are ( κ −

1) number of samples in level i in C ∪ C , onlyone sample in C in that same level is selected. So there is no overlap in each coordinate for any twotriplets in M .With the above arguments, any 3-dimensional matching problem can be reduced to a 3-covariate κ -FBS problem for any constant integer κ , and hence, the κ -FBS problem is NP-hard for any such κ when P = 3. Corollary

The κ -FBS problem is NP-hard for any integer P ≥ , even for constant κ .Proof. For any constant integer κ , any 3-covariate κ -FBS problem, and any P >

3, we can constructan equivalent P -covariate κ -FBS problem as follows: for each sample of the given 3-covariate κ -FBSproblem, we create a sample for the constructed κ -FBS problem such that they have the same levelvalue for covariate p = 1 , ,

3. For p = 4 , ..., P , set covariate p to have only one level so all samples inthe constructed κ -FBS problem have the same value.Therefore, the NP-hardness of 3-covariate κ -FBS problem implies that the P -covariate κ -FBS prob-lem is NP-hard for every value of P when P ≥ κ -FBS problem is NP-hard for P ≥

3, there is no polynomial time algorithm unless P = N P .In the following subsections, we will discuss the remaining case of the 2-covariate problems. P = 2 . In this subsection, we present aninteger programming formulation with network ﬂow constraints for the 2-covariate FBS problem. Wethen show how to solve the problem eﬃciently with a network ﬂow algorithm.It was noted, in Theorem 2.1, that there is no diﬀerentiation between the individual samples selectedin each level intersection, only the number of those selected counts. We thus deﬁne the decision variablesas follows: x i ,i : the number of treatment samples selected from the ( i , i ) level intersection L ,i ∩ L ,i , for i = 1 , ..., k and i = 1 , ..., k ; x ′ i ,i : the number of control samples selected from the ( i , i ) level intersection L ′ ,i ∩ L ′ ,i , for i = 1 , ..., k and i = 1 , ..., k .Let u i ,i = | L ,i ∩ L ,i | and u ′ i ,i = | L ′ ,i ∩ L ′ ,i | for i = 1 , ..., k , i = 1 , ..., k . Clearly, x i ,i must bean integer between 0 and u i ,i , and x ′ i ,i must be an integer between 0 and u ′ i ,i . With these decisionvariables the following is an integer programming formulation for the 2-covariate FBS problem:(IP-FBS) max k X i =1 k X i =1 x i ,i (3.1a) s.t. k X i =1 x i ,i − k X i =1 x ′ i ,i = 0 i = 1 , ..., k (3.1b) k X i =1 x i ,i − k X i =1 x ′ i ,i = 0 i = 1 , ..., k (3.1c) 0 ≤ x i ,i ≤ u i ,i i = 1 , ..., k , i = 1 , ..., k (3.1d) 0 ≤ x ′ i ,i ≤ u ′ i ,i i = 1 , ..., k , i = 1 , ..., k (3.1e) x i ,i , x ′ i ,i integers i = 1 , ..., k , i = 1 , ..., k (3.1f). The objective (3.1a) is the total number of selected treatment samples. Constraints (3.1b) are theﬁne balance requirement under covariate 1, as P k i =1 x i ,i equals the number of selected treatmentsamples in level i under covariate 1 and P k i =1 x ′ i ,i equals the number of selected control samples inthe same level. Similarly, constraints (3.1c) are the ﬁne balance requirement under covariate 2.Formulation (IP-FBS) is in fact also a network ﬂow formulation. In a minimum cost network ﬂowformulation, each column of the constraint matrix corresponding to a variable that is a ﬂow along an arc,has exactly one 1 and one -1. The corresponding MCNF network is shown in Figure 1, where all capacitylower bounds are 0, and each arc has a cost per unit ﬂow and upper bound associated with it. The ﬂow D. S. HOCHBAUM, A. LEVIN, AND X. RAO on the arc from node (1 , i ) to node (2 , i ) represents variable x i ,i , which is bounded between 0 and u i ,i as stated in constraints (3.1d); arc from node (2 , i ) to node (1 , i ) represents variable x ′ i ,i , whichis bounded between 0 and u ′ i ,i as stated in constraints (3.1e). To get a “minimize” type objective, wetake the negative value of | S | = P k i =1 P k i =1 x i ,i as the objective, so the per unit arc cost should be − { (1 , , (1 , , ..., (1 , k ) } to any node in { (2 , , (2 , , ..., (2 , k ) } . All otherarcs have cost 0. It is easy to verify that constraints (3.1b) are corresponding to the ﬂow balance atnodes (1 , i ) for all i , constraints (3.1c) are corresponding to the ﬂow balance at nodes (2 , i ) for all i .1,11,21, k k ... ... ( − , u , )( − , u , )( − , u ,k )( − , u , )( − , u k ,k )(0 , u ′ , )(0 , u ′ , )(0 , u ′ k ,k )(0 , u ′ ,k ) Fig. 1: Min-cost network ﬂow graph corresponding to formulation (IP-FBS) . arc legend: (cost, upperbound)

Theorem

The -covariate FBS problem is solved as a minimum cost network ﬂow problem in O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )) time.Proof. To solve the minimum cost network ﬂow problem of the 2-covariate FBS problem, we choosethe algorithm of successive shortest paths that is particularly eﬃcient for a MCNF with “small” total arccapacity (see [1] Section 9.7). The successive shortest paths algorithm starts with a network graph withno negative cycles, so we ﬁrst modify the network shown in Figure 1 using a well-known arc reversaltransformation in [1] Section 2.4. The resulting network graph is shown in Figure 3.The successive shortest path algorithm iteratively selects a node s with excess supply (supply notyet sent to some demand node) and a node t with unfulﬁlled demand and sends ﬂow from s to t alonga shortest path in the residual network [11], [10], [4]. The algorithm terminates when the ﬂow satisﬁes1,11,21, k k ... ...( − ℓ , )( − ℓ , )( − ℓ ,k ) ( ℓ , )( ℓ , )( ℓ ,k ) (1 , u , )(1 , u , )(1 , u ,k )(1 , u , )(1 , u k ,k )(0 , u ′ , )(0 , u ′ , )(0 , u ′ k ,k )(0 , u ′ ,k ) Fig. 3: Min-cost network ﬂow graph after arc reversal. arc legend: (cost, upperbound); node legend: (supply) all the ﬂow balance constraints. Since at each iteration, the number of remaining units of supply to besent is reduced by at least one unit, the number of iterations is bounded by the total amount of supply.For the network in Figure 3 the total supply is n .At each iteration, the shortest path can be solved with Dijkstra’s algorithm of complexity O ( | A | + | V | log | V | ), where | V | is number nodes and | A | is number of arcs [24], [5]. In our formulation, | V | is O ( k + k ), which is at most O ( n ). Since the number of nonempty sets L ,i ∩ L ,i is at mostmin { n, k k } , the number of unit-cost arcs is O (min { n, k k } ). Since the number of nonempty sets L ′ ,i ∩ L ′ ,i is at most min { n ′ , k k } , the number of zero-cost arcs is O (min { n ′ , k k } ). So the totalnumber of arcs | A | is O (min { n + n ′ , k k } ).Hence, the total running time of applying the successive shortest path algorithm on our formulationis O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )).In contrast to the 2-covariate FBS problem which is polynomial time solvable, we show next thatthe 2-covariate κ -FBS problem is NP-hard when κ ≥ -covariate κ -FBS problem with κ ≥ . We prove here that the2-covariate κ -FBS problem is NP-hard for all constant values of κ such that κ ≥

3. The proof reducesthe exact 3-cover problem, which is NP-hard [13][6].

Exact 3-cover : Given a collection C of 3-element subsets (triplets) of a ground set E with | E | = 3 q forsome integer q , is there a subcollection C ′ ⊆ C where each element e ∈ E appears in exactly one triplet0 D. S. HOCHBAUM, A. LEVIN, AND X. RAO of C ′ ? Theorem

The -covariate κ -FBS problem is NP-hard for any constant κ ≥ .Proof. Given an instance of Exact 3-cover problem with ground set E of size 3 q and a collection oftriplets C , we construct an instance of 2-covariate κ -FBS problem for any constant integer κ ≥ T for each triplet T ∈ C , and a level e ′ for each element e ∈ E . So there are | C | + | E | levels of covariate 1. For covariate2, we have a level e ′′ for each element e ∈ E , and one additional level dentoed as X . So there are | E | + 1levels of covariate 2.Next, we construct the samples. For each sample, we represent it by the ordered pair ( a, b ) where a is the level of the ﬁrst covariate and b is the level of the second covariate. The treatment groupcontains a sample ( T, X ) for each triplet T ∈ C and a sample ( e ′ , e ′′ ) for each element e ∈ E . Moreover,for each triplet T ∈ C , whose elements are denoted by e t , e t , e t , we create three control samples( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ), as well as ( κ −

3) copies of control sample (

T, X ). In addition, for each element e ∈ E we have one control sample ( e ′ , X ) and ( κ −

1) copies of control sample ( e ′ , e ′′ ). The treatmentgroup constructed is of size | C | + | E | and the control group constructed is of size κ | C | + κ | E | . The twosizes are both polynomially bounded in the size of the Exact 3-cover instance so the reduction can becomputed in polynomial time.Finally, we claim that the constructed 2-covariate κ -FBS problem has a feasible solution with ob-jective value of at least 4 q if and only if the Exact 3-cover instance has a subcollection C ′ ⊆ C such thateach element e ∈ E appears in exactly one triplet of C ′ .Let C ′ be a subcollection such that each element e ∈ E appears in exactly one triplet of C ′ , wederive a solution for the constructed problem as follows. In the treatment group, we choose ( T, X ) forall T ∈ C ′ , and ( e ′ , e ′′ ) for all e ∈ E . In the control group, for each T ∈ C ′ , whose elements are denotedby e t , e t , e t , we choose ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) and the ( κ −

3) copies of (

T, X ). Additionally, we alsochoose from the control group sample ( e ′ , X ) and ( κ −

1) copies of ( e ′ , e ′′ ) for each e ∈ E . That is, theselection of treatment samples is S = { ( T, X ) : ∀ T ∈ C ′ } ∪ { ( e ′ , e ′′ ) : ∀ e ∈ E } and the selection of control samples is S ′ = { ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) : ∀ T = ( e t , e t , e t ) ∈ C ′ } ∪ { ( T, X ) ( w ) ) : ∀ T ∈ C ′ , w = 1 , ..., κ − }∪{ ( e ′ , X ) : ∀ e ∈ E } ∪ { ( e ′ , e ′′ ) ( w ) : ∀ e ∈ E, w = 1 , ..., κ − } . To check the feasibility of this solution, ﬁrst consider the levels of the ﬁrst covariate. For each T ∈ C ,if T ∈ C ′ then there is exactly one treatment sample with level T in S and we choose exactly κ suchcontrol samples for S ′ ; and if T / ∈ C ′ , then we have not chosen any sample of this level for neither thetreatment nor the control group. Furthermore, for each e ∈ E , we choose exactly one sample from thetreatment group with covariate 1 level e ′ and exactly κ control samples with covariate 1 level e ′ . Next,consider the levels of the second covariate. We choose | C ′ | number of level X treatment samples and( κ − | C ′ | + | E | number of level X control samples. Note that the size of subcollection C ′ must be q aseach element in E appears in exactly one triplet of C ′ , so ( κ − | C ′ | + | E | = κq = κ | C ′ | . For every ¯ e ∈ E ,we choose exactly one treatment sample with level ¯ e ′′ . There are κ − e ′′ undercovariate 2 in the set { ( e ′ , ¯ e ′′ ) ( w ) : ∀ e ∈ E, w = 1 , ..., κ − } . Since each element ¯ e appears in exactly onetriple of C ′ , ¯ e ′′ appears exactly once in the set { ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) : ∀ T = ( e t , e t , e t ) ∈ C ′ } . Sothere are κ control samples of level ¯ e ′′ under covariate 2 in the selection S ′ . Therefore, this solution isfeasible and the objective value of this solution is | S | = | C ′ | + | E | = 4 q .On the other hand, if the constructed 2-covariate κ -FBS problem has a feasible solution S, S ′ ofobjective value at least 4 q , we say that T ∈ C is selected for subcollection C ′ if the treatment sample( T, X ) is selected in S for the constructed problem. We will show that the subcollection of selected tripletsis a feasible solution of the Exact 3-cover problem. Since there are only 3 q samples in the treatment1group that do not correspond to a triplets, the size of C ′ must be at lease q . So the subcollection isfeasible if we can establish that every pair of selected triplets is disjoint. Assume by contradiction thatthere exist two selected subsets T, T ′ ∈ C ′ that have a common element e . Due to ( S, S ′ )- κ -ﬁne-balance,we know the number of control samples in each level under any covariate must be an integer multiple of κ . Since there is only one sample of level ¯ e ′′ in the treatment group, in selection S ′ there can be either 0or κ samples of level ¯ e ′′ under covariate 2. Since both ( T, X ) and ( T ′ , X ) are in the selection S , controlsamples ( T, ¯ e ′′ ) and ( T ′ , ¯ e ′′ ) must be chosen in S ′ , otherwise the number of selected samples in covariate1 levels T and T ′ will not satisfy ( S, S ′ )- κ -ﬁne-balance. Thus, the number of samples in S ′ with thesecond covariate being ¯ e ′′ is at least 2. So the number | S ′ ∩ L ′ , ¯ e ′′ | must be κ , the number | S ∩ L , ¯ e ′′ | mustbe 1, and the number of sample (¯ e ′ , ¯ e ′′ ) in selection S ′ must be less than or equal to κ −

2. This impliesthat the number of control samples of level ¯ e ′ under covariate 1 can not be more than κ −

1, whichmeans, it must be zero. So we can derive that treatment sample (¯ e ′ , ¯ e ′′ ) is not selected, which meansthe number of treatment samples in level e ′′ under covariate 2 is 0, contradicts with | S ∩ L ,e ′′ | = 1.

4. The κ -balanced-matching ( κ -BM) problem with P ≥ . The 1-covariate κ -BM problem issolvable in polynomial time [21]. However, we show here for the ﬁrst time that even for any constant κ the κ -BM problem is NP-hard for two or more covariates, P ≥

2. For the 2-covariate BM problem andthe 2-covariate κ -BM problem, the complexity status when the numbers of levels of both covariates areconstants is discussed in section 6 together with the other two families of problems. Here we consideran intermediate case where only one of the covariates has a constant number of levels. We will showthat the 2-covariate BM problem and the 2-covariate κ -BM problem when the second covariate has aconstant number of levels can be solved eﬃciently if and only if the exact matching problem on bipartitegraphs can be solved eﬃciently. P -covariate BM problem and the P -covariate κ -BM problemfor P ≥ . We will ﬁrst present the integer programming formulation for the 2-covariate κ -BM problem.Denote the distance between the i th treatment sample and the j th control sample by δ ij for i = 1 , ..., n and j = 1 , ..., n ′ . The decision variable x ij is a binary variable which equals 1 if and only if the i thtreatment sample is matched to the j th control sample. The integer programming formulation is givenbelow. min n X i =1 n ′ X j =1 δ ij x ij (4.1a) s.t. n X i =1 X j ∈ L ′ ,q x ij = κℓ ,q q = 1 , . . . , k (4.1b) n X i =1 X j ∈ L ′ ,q x ij = κℓ ,q q = 1 , . . . , k (4.1c) n ′ X j =1 x ij = κ i = 1 , ..., n (4.1d) n X i =1 x ij ≤ j = 1 , ..., n ′ (4.1e) x ij ∈ { , } i = 1 , ..., n, j = 1 , . . . , n ′ . (4.1f)The objective (4.1a) is the total distance of all matched pairs. Constraints (4.1b) ensure that thenumber of matched level q control samples is κ times as large as the number of matched level q treatmentsamples under covariate 1 for each level q . Constraints (4.1c) ensure the κ -ﬁne-balance requirement under2 D. S. HOCHBAUM, A. LEVIN, AND X. RAO covariate 2. Constraints (4.1d) assign each treatment sample to κ control samples, and constraints (4.1e)specify that each control sample can not be matched to more than one treatment sample.Next we prove that the 2-covariate BM problem is NP-hard by reducing the problem, one ofKarp’s 21 NP-hard problems [13][6]. : Given clauses C , C , ..., C m , each consisting of 3 literals from the set { v , v , ..., v n }∪{ ¯ v , ¯ v , ..., ¯ v n } .Is the conjunction of the given clauses satisﬁable? Theorem

The -covariate BM problem is NP-hard.Proof. Given an instance of 3-SAT problem with clauses C , C , ..., C m and variables v , v , ..., v n ,we construct an instance of 2-covariate BM problem. We deﬁne two types of gadgets: the “variable”gadgets, one for each variable, and the “clause” gadgets, one for each clause. The distances between thetreatment samples and the control samples are set to 0 or 1. We construct a 2-covariate BM problemsuch that the 3-SAT problem is satisﬁable if and only if the optimal objective value is 0. Moreover, thezero-distance matching implies the values of the 3-SAT variables in a truth assignment.First, we deﬁne the variable gadgets. Consider the j th variable v j that appears p j times as v j and q j times as ¯ v j . We have the following 2 p j + 2 q j − j th variable gadget: a ( j ) i for i = 0 , , , ..., q j − b ( j ) i for i = 1 , , ..., q j − c ( j ) i for i = 0 , , , ..., p j −

1, and d ( j ) i for i = 1 , , ..., p j −

1. The p j + q j − v j gadget are: e ( j ) i for i = 0 , , ..., q j − f ( j ) i for i = 1 , ..., p j −

1. (We also call the e ( j )0 sample f ( j )0 for simplicity.)There are p j + q j − v j gadget, each one consists of three samples:a pair of control samples out of the 2 p j + 2 q j − p j + q j − { a ( j ) i , b ( j ) i +1 } for i = 0 , , ..., q j − { c ( j ) i , d ( j ) i +1 } for i = 0 , , ..., p j − { a ( j ) q j − , c ( j ) p j − } . Sinceeach of these levels has one treatment sample, exactly one control sample out of those pairs need to bematched.The zero-distance pairs for the v j gadget are [ e ( j ) i , a ( j ) i ] for i = 0 , , ..., q j −

1, [ e ( j ) i , b ( j ) i ] for i =1 , , ..., q −

1, [ f ( j ) i , c ( j ) i ] for i = 0 , , ..., p j − f ( j ) i , d ( j ) i ] for i = 1 , , ..., p j −

1. Each treatment samplehas two potential matches with zero distance, and each control samples has one potential matches withzero distance. Note that sample e ( j )0 = f ( j )0 must be matched to either a ( j )0 or c ( j )0 in a zero-distancematching. So there are two possible zero-distance matchings: • Case 1 ( e ( j )0 is matched to a ( j )0 ): e ( j ) i is matched to a ( j ) i for i = 0 , , , ..., q j − f ( j ) i is matchedto d ( j ) i for i = 1 , ..., p j −

1; samples { b ( j ) i : i = 1 , , ..., q j − } and { c ( j ) i : i = 0 , , , ..., p j − } areunmatched; • Case 2 ( e ( j )0 = f ( j )0 is matched to c ( j )0 ): f ( j ) i is matched to c ( j ) i for i = 0 , , , ..., p j − e ( j ) i is matched to b ( j ) i for i = 1 , ..., q j −

1; samples { a ( j ) i : i = 0 , , ..., q j − } and { d ( j ) i : i =1 , , , ..., p j − } are unmatched;Consider the ﬁrst case in which e ( j )0 is matched to a ( j )0 . For i = 0 , ..., q j −

2, if treatment sample e ( j ) i is matched to a ( j ) i , as only one control sample from the same level can be matched, b ( j ) i +1 must not bematched. Then as the two only zero-distance matches for e ( j ) i +1 are a ( j ) i +1 and b ( j ) i +1 , we infer that e ( j ) i +1 must be matched to a ( j ) i +1 . By such induction, we know all samples in { a ( j ) i : i = 0 , , , ..., q j − } arematched, and samples in { b ( j ) i : i = 1 , , ..., q j − } are unmatched. Since the only zero-distance matchof c ( j )0 is e ( j )0 , c ( j )0 is unmatched in Case 1. For i = 0 , ..., p j −

2, if sample c ( j ) i is unmatched, as onecontrol sample need to be matched in each level, we can infer that sample d ( j ) i +1 is matched to its onlyzero-distance pair f ( j ) i +1 . Therefore, c ( j ) i +1 can not be matched as its only zero-distance pair is taken. Bysuch induction, we know samples in { c ( j ) i : i = 0 , , , ..., p j − } are unmatched, and all samples in { d ( j ) i : i = 1 , , ..., p j − } are matched. With similar arguments, in the second case in which e ( j )0 = f ( j )0 is matched to c ( j )0 , all samples in { b ( j ) i : i = 1 , , ..., q j − } ∪ { c ( j ) i : i = 0 , , , ..., p j − } are matched,3and samples in { a ( j ) i : i = 0 , , ..., q j − } ∪ { d ( j ) i : i = 1 , , , ..., p j − } are unmatched. In Case 1, we saythat we assign variable v j the value FALSE, and in Case 2 we say that we assign variable v j the valueTRUE.Next, consider the clause gadgets. For clause C w that consists of variables v w , v w , v w , we pickthree samples without replacement from the variable gadgets that correspond to v w , v w , v w as follows.If the variable appears as literal v j ( w ) in the clause, we pick one of the samples of the type c ( j ( w )) i fromthe v j ( w ) gadget; if the variable appears as literal ¯ v j ( w ) we pick one of the samples of the type a ( j ( w )) i from the v j ( w ) gadget. Since for each occurrence of literal v j there is a type c ( j ) i sample, and for eachoccurrence of literal ¯ v w there is a type a ( j ) i sample in the v j gadget, we can ensure that there are enoughsamples to be selected from without replacement. The clause gadget is then augmented by eight newsamples used as “garbage collectors”: three treatment samples g ( w )1 , g ( w )2 , g ( w )3 and ﬁve control samples h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 . The ﬁfteen pairs of distances between the three treatment samples and theﬁve control samples are set to zero.We introduce a new covariate 1 level that includes all of the eight new samples, so that it is disjointfrom the other levels created for the variable gadgets. Since there are three treatment samples in thislevel, three of the control samples h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 have to be matched.We introduce a new covariate 2 level which consists of ﬁve control samples: h ′ ( w )1 , h ′ ( w )2 and the threepreviously selected samples of type a i or c i in the v w , v w , v w gadgets. We also set the three treatmentsamples g ( w )1 , g ( w )2 , g ( w )3 to be in this covariate 2 level. Therefore, three out of the ﬁve control samples ofthis level must be matched. Observe that any three control samples in this level must contain at leastone sample of type a i or c i from a variable gadget. That means, the clause is satisﬁed by our variableassignment rule above: suppose the matched literal appears as v j ( w ) and the sample selected for thisclause gadget is c ( j ( w )) i , sample c ( j ( w )) i is matched so variable v j ( w ) must be assigned the value TRUEfrom our previous discussion; and suppose the matched literal appears as ¯ v j ( w ) and the sample selectedfor this clause gadget is a ( j ( w )) i , sample a ( j ( w )) i is matched so variable v j ( w ) must be assigned the valueFALSE.For all remaining samples that have not been assigned to a covariate 2 level in the above discussion,i.e. the treatment samples in the variable gadgets and the control samples of type h ( w )1 , h ( w )2 , h ( w )3 , wecreate a “dump” covariate 2 level that includes all of them.With the construction above, a zero value matching solution implies a truth assignment of the 3-SATproblem. The values of the 3-SAT variables are determined by which of the two zero-distance matchingsis in each variable gadget. For each clause gadget, at least one of the type a i or c i sample is matched,which implies the clause is satisﬁed by our variable assignment rule.On the other hand, given a truth assignment of the 3-SAT problem, there is a zero-distance match-ing for the constructed problem. First match the samples in the variable gadgets as follows: if thevariable takes value TRUE, then match the samples as described above in Case 2; if it takes valueFALSE, then match the samples as described in Case 1. By doing so, the number of matched controlsamples under the covariate 1 level associated to each variable gadget, equals the number of treat-ment samples in that level. Next, for each clause gadget, if there is one, respectively two, respectivelythree satisﬁed literals, then match two, respectively one, respectively zero out of { h ′ , h ′ } to g , g .That will match exactly three samples of the corresponding covariate 2 level as required. Finally, for w = 1 , ..., m , the remaining unmatched treatment samples in clause C w are matched to control samplesout of { h ( w )1 , h ( w )2 , h ( w )3 } . Since the three treatment samples g ( w )1 , g ( w )2 , g ( w )3 are matched to the controlsamples in { h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 } , the numbers of matched samples in the associated covariate 1level are three for both the treatment and control sides. For either the treatment or control side, thenumber of matched samples in the dump level under covariate 2 is the total number of matched samplesminus the number of matched samples in other covariate 2 levels. As we have shown that the numberof matched samples in the covariate 2 level associated with each clause gadget are three for both sides,4 D. S. HOCHBAUM, A. LEVIN, AND X. RAO the numbers of matched samples in the dump level also equal to each other for the two groups. So thisis a zero-distance matching for the constructed 2-covariate BM problem.This proof of Theorem 4.1 can be extended to the 2-covariate κ -BM problem for any constant κ . Corollary

The -covariate κ -BM problem is NP-hard for any constant κ .Proof. We prove this corollary also by reduction from the 3-SAT problem with clauses C , C , ..., C m and variables v , v , ..., v n .We ﬁrst construct the same covariates levels, treatment and control samples which are the same asthe 2-covariate BM instance in the proof of Theorem 4.1. Let t denote the number of treatment samplesconstructed. The distance between each treatment and each control is modiﬁed: we change all distances0 to 1 and all distances 1 to M for M being a large constant which is greater than t . According to theproof of Theorem 4.1, the 3-SAT problem is satisﬁable if and only if the optimal objective value of the2-covariate BM problem with this modiﬁed distance is t .Next, we add more samples to the control group. For each treatment sample constructed, we add κ − κ − κ − M . We claim that the 3-SAT problem is satisﬁableif and only if the optimal objective value of the 2-covariate κ -BM problem on this new instance is t .If the 3-SAT problem is satisﬁable, we can ﬁnd an assignment for the BM problem on the constructedinstance with distance t as in the proof of Theorem 4.1. By assigning additionally each treatment sampleto the corresponding κ − κ -BM problem on the constructedinstance. Furthermore, this is also the optimal solution for the κ -BM problem as there are only ( κ − · t zero-distance pairs.On the other hand, if the optimal solution to the 2-covariate κ -BM problem has a total distance of t , then all the ( κ − · t zero-distance pairs must be matched. In addition, there must be t matched pairwith distance 1. From the arguments in the proof of Theorem 4.1, we can derive that there is a truthassignment for the 3-SAT problem.We can further derive that the κ -BM problem is also NP-hard for more than two covariates. Corollary

The P -covariate κ -BM problem is NP-hard for every value of P when P ≥ forany constant κ .Proof. For any 2-covariate κ -BM problem, and any P ≥

3, we can construct an equivalent P -covariate κ -BM problem by adding P − κ -BM probleminstance and set the value of the p th covariate to be the same for all samples, for each p = 3 , ..., P .Therefore, the NP-hardness of 2-covariate κ -BM problem implies that the P -covariate κ -BM problemis NP-hard as long as P ≥ -covariate BM and -covariate κ -BM problems where onecovariate has a constant number of levels. Let BM’ be the special case of 2-covariate BM problemwhere the second covariate has a constant number of levels while the ﬁrst covariate has no restrictionon the number of levels. In Section 6 we will establish that if both covariates have constant number oflevels then the 2-covariate BM problem is polynomial time solvable. We show here that the complexitystatus of the 2-covariate problem in which only one covariate has a constant number of levels is linked tothe complexity status of the exact matching problem and its weighted version denoted as weighted exactmatching . In order to present this connection we assume that the distance matrix is integral and alldistances are given in unary, that is, there is a polynomial π of the input encoding length where δ ij ≤ π for all i, j .The exact matching in bipartite graph problem is deﬁned as follows. The input is an integer number k together with a bipartite graph G = ( V ∪ V , E ) with | V | = | V | = q , and its edge set E is partitionedinto E b ∪ E r where E b is the set of blue edges and E r is the set of red edges. The exact matchingproblem is to ﬁnd a perfect matching that has exactly k blue edges (and all other q − k edges are red).The complexity status of the exact matching problem is as follows. While [16] showed that there is a5randomized polynomial time algorithm for the problem, the existence of a deterministic polynomial timealgorithm is still an important open problem.The weighted exact matching problem is deﬁned as follows. The input is a bipartite graph G = ( V, E )together with non-negative integral distances δ e for all e ∈ E , where there is a polynomial π of | V | + | E | such that δ e ≤ π for all e ∈ E . We are also given a target value K . The goal is to ﬁnd a perfect matchingof total distance exactly K . Note that the weighted exact matching problem is a generalization of theexact matching problem since the later problem can be interpreted as the weighted exact matchingproblem where the weight of a blue edge is 1 and the weight of a red edge is 0. Thus, a polynomialtime algorithm for the weighted exact matching problem gives a polynomial time algorithm for theexact matching problem. On the other hand, it is known that a polynomial time algorithm for theexact matching problem gives a polynomial time algorithm for the weighted exact matching problem(see proposition 1 in [17]). If the algorithm for the exact matching is deterministic (randomized), thenthe algorithm for the weighted exact matching is deterministic (randomized, respectively) as well [17].Therefore, the complexity of the weighted problem has the same status as the one of the exact matchingproblem. Namely, the result of [16] gives a randomized polynomial time algorithm for the weighted exactmatching problem, while the existence of a deterministic polynomial time algorithm for this problem willresult in a deterministic polynomial time algorithm for the exact matching in bipartite graphs problem.We show the following connections between the exact matching problem (or the weighted exactmatching) and problem BM’. Theorem

If there is a deterministic (or randomized) polynomial time algorithm for BM’ thenthere is a deterministic (or randomized, respectively) polynomial time algorithm for exact matching inbipartite graphs.Proof.

Assume that there is a polynomial time algorithm ALG for BM’ and we will establish theexistence of a polynomial time algorithm for the exact matching problem. Given an input to the exactmatching problem with 2 q nodes ( q nodes on each side of the bipartite graph), we denote the bipartitionof the graph by V ∪ V and the partition of the edge set into E b ∪ E r , and we let k be the numberof the required blue edges in the matching. We deﬁne the following input for BM’. We will associatesamples with nodes so the control group consists of nodes and the treatment group also consists ofnodes. For every node v ∈ V , we have two nodes in the control corresponding to v : a red node v r anda blue node v b . All blue edges in E b that were incident to v in the input to the exact matching problemare now incident to v b , and all red edges that were incident to v are now incident to v r . These edgescorresponding to original edges in the input for the exact matching instance have zero distance, whileall other distances are set to 1. The nodes in V (of the original input graph to the exact matching) arethe treatment nodes, so the distances we deﬁned represent the distances between a treatment node anda control node.The levels of the ﬁrst covariate are deﬁned such that every pair [ v r , v b ] for v ∈ V deﬁnes one level ofthe ﬁrst covariate, and we have one treatment node in each such level. Observe that the number of levelsof the ﬁrst covariate is q , and we have q nodes in the treatment group, so this assignment of levels ofthe ﬁrst covariate is feasible. Next, consider the second covariate. We will have two levels of the secondcovariate corresponding to blue and red. The red level of the control is the set of all red nodes, and theblue level of the control is the set of all blue nodes. The second level of the treatment are deﬁned sothat there will be exactly k nodes of the treatment in the blue level and the remaining q − k nodes inthe treatment are in the red level.We would apply algorithm ALG on the BM’ instance and check if the output cost is zero or strictlypositive. In any feasible solution of the BM’ deﬁned, exactly one of two control nodes [ v r , v b ] is matchedfor each v ∈ V , as there is exactly one treatment node in each level of the ﬁrst covariate. And since weneed to match k blue control nodes and q − k red control nodes, a zero distance matching of the BM’represents a set of edges in the exact matching instance that is a perfect matching consisting of k blueedges and q − k red edges.Observe that this construction is of deterministic polynomial time. Thus the algorithm that con-6 D. S. HOCHBAUM, A. LEVIN, AND X. RAO structs the input to BM’ and apply ALG on that input is a deterministic (randomized) polynomialtime algorithm for the exact matching problem if ALG is a deterministic (randomized, respectively)polynomial time algorithm for BM’.We next consider the other direction.

Theorem

If there is a deterministic (or randomized) polynomial time algorithm for weightedexact matching in bipartite graphs then there is a deterministic (or randomized, respectively) polynomialtime algorithm for BM’.Proof.

Assume that there is a polynomial time algorithm for the weighted exact matching problemin bipartite graph, we will establish the existence of a polynomial time algorithm for BM’. Here we aregoing to use the fact that the maximum distance is at most π and without loss of generality, we assumethat n ′ ≤ π . We set ǫ = 1 / (2 n π + 1), and we deﬁne a multi-objective optimization problem. As a stepin the algorithm for solving BM’, we will ﬁnd a (1 + ǫ )-approximated Pareto set of perfect matchings inthe following graph with the following multi-criteria objective.We consider a complete bipartite graph where one side consists of the control group, namely onenode for each control sample, the other side of the graph correspond to the treatment group togetherwith some additional nodes. If we have in the instance of BM’, a control group of size n ′ and a treatmentgroup of size n , then we will have exactly n ′ − n additional nodes. The graph that we consider is thecomplete bipartite graph with n ′ nodes on each side. In this graph a feasible solution is a perfectmatching of all nodes.Next, we deﬁne the 1 + k objectives, for constant k being the number of levels of the secondcovariate. The ﬁrst objective corresponds to total distance, and one of the remaining objectives for eachlevel of the second covariate. These objectives are sums of cost for edges in the perfect matching withdiﬀerent coeﬃcients. We denote these cost coeﬃcients as a vector for each edge so the ﬁrst coeﬃcientof this cost is the coeﬃcient of the cost function of the ﬁrst objective. In order to deﬁne the ﬁrst costcoeﬃcient, we assign a covariate 1 level for the additional nodes as follows. For each level p of theﬁrst covariate, we assign exactly ℓ ′ ,p − ℓ ,p additional nodes to have level p . Note that without loss ofgenerality we have ℓ ′ ,p ≥ ℓ ,p for all p (as otherwise BM’ is infeasible), and thus our deﬁnition of levels ofthe ﬁrst covariate for the additional nodes, indeed deﬁne such level for every additional node. Consideran edge [ u, v ] between an additional node u and a control sample node v . The ﬁrst cost coeﬃcient ofedge [ u, v ] is 0 if the two nodes share the same level of covariate 1, otherwise it is set to 2 nπ . Theother k cost coeﬃcients of such an edge [ u, v ] are set to 0. Consider next an edge [ u ′ , v ] where u ′ is atreatment node (and not an additional node) and assume that the second covariate level of v is p . Theﬁrst cost coeﬃcient is the distance between samples u ′ and v in the BM’ instance (that is, δ u ′ ,v ). The( p + 1)th cost coeﬃcient is set to 1 and the other k − p + 1)thobjective is to minimize the number of edges adjacent to level- p control nodes for covariate 2.Observe that every feasible solution, namely, every perfect matching in this bipartite graph hasthe property that the sum of the costs according to all objectives excluding the ﬁrst one is exactly n .Furthermore, observe that in a Pareto set of this multi-criteria optimization problem, there is exactlyone point where for every p = 1 , , . . . , k , the total cost of the matched edges according to the ( p + 1)-thobjective is exactly the number of treatment samples of the p -th level of the second covariate. We willrefer to this point as the candidate feasible point of the Pareto set .Now, we use the result of Papadimitriou and Yannakakis [18] to conclude that the algorithm forweighted exact matching gives the required algorithm for approximating the Pareto set and ﬁnding the(1 + ǫ )-approximated Pareto set (see theorem 4 and corollary 5 in [18]). We can use the results of [18]since the maximum cost coeﬃcient of an edge in our instance is 2 nπ , that is polynomially bounded inthe input encoding length, and the number of objectives is a constant.We next argue that the (1+ ǫ )-approximated Pareto set is actually the Pareto set of the multi-criteriaproblem. The cost vectors of the edges are always integral and have a maximum coeﬃcient of at most2 nπ , and thus for every objective the cost of the matching is at most 2 n π . Therefore, if we approximatethis objective with an approximation ratio of 1 + ǫ then by our choice of ǫ we get the optimal value of7this objective. By the last claim we conclude that the candidate feasible point of the Pareto set is oneof the solutions that appear in this Pareto set and we consider this speciﬁc solution.We delete from the candidate solution the edges (in the matching) that are adjacent to the additionalnodes. By the notion of the candidate solution, the ﬁne balance constraints of the second covariate aresatisﬁed. So if the resulting matching is not a feasible solution to the BM’ instance, it means thatthere is at least one additional node that used to be matched (in the candidate solution) to a controlsample of a diﬀerent covariate 1 level. This is so as otherwise, for every level of the ﬁrst covariate,the number of selected control samples of this level is the same as the number of treatment samples ofthis level. Consequently, if the resulting matching is infeasible for BM’, then the ﬁrst objective valueof the candidate solution is at least 2 nπ , which implies that the BM’ problem is infeasible as we shownext. If BM’ has a feasible solution, then we can create an alternative candidate solution by adding tothis solution for BM’ a zero distance matching of the additional nodes. This alternative solution hasa ﬁrst objective value that is less than or equal to nπ , and all other objectives values are the same asthe candidate solution. So according to the (1 + ǫ ) Pareto optimality, the resulting matching must befeasible for the BM’ problem.In summary, we consider the candidate solution for BM’ obtained from the candidate feasible pointof the Pareto set after deleting the edges adjacent to the additional nodes. We check if the candidatesolution for BM’ satisﬁes the ﬁne balance constraints. If it does, then this is the output of the algorithmfor BM’, and if it does not, then the BM’ instance is infeasible.Furthermore, if we use the existing algorithm of [16] to solve the weighted exact matching thenthe resulting algorithm will be randomized polynomial time algorithm whereas if we will use it witha deterministic polynomial time algorithm then the resulting algorithm for BM’ is also a deterministicpolynomial time algorithm.Next we consider the κ -BM’ that is the special case of 2-covariate κ -BM where the second covariatehas a constant number of levels, and once again we assume that the distance matrix is integral and themaximum distance is upper bounded by a polynomial π of the input encoding length. We show that κ -BM’ has the same complexity status as BM’ (for all κ ≥ Theorem

There is a polynomial time algorithm for BM’ if and only if there is a polynomialtime algorithm for κ -BM’.Proof. Assume that there is a polynomial time algorithm ALG for BM’. Consider an instance of the κ -BM’ problem, and replace every treatment sample by κ copies of it with the same pair of levels as theoriginal element of the treatment group. We deﬁne the distance matrix as follows. The distance betweena treatment sample x that is a copy of the original treatment sample x ′ (of the instance for the κ -BM’)and a control sample y , is now deﬁned as the distance between x ′ and y . The resulting treatment groupand control group is the instance for BM’, and we apply ALG on that instance. A feasible solutionfor this BM’ instance gives a feasible solution for the κ -BM’ instance (simply by matching a treatmentsample to a control sample if one of the copies of the treatment sample was matched to that controlsample) and of the same total distance. Similarly, a feasible solution to the original κ -BM’ instancegives a feasible solution for the BM’ instance of the same cost by matching the set of κ matched controlsample to the unique treatment sample to the κ copies of this treatment sample in the BM’ instance(with one control sample matched to each copy). Thus, we get a polynomial time algorithm for κ -BM’problem.Consider the other direction. Assume that we are given a polynomial time algorithm ALG for κ -BM’and we establish the existence of a polynomial time algorithm for BM’. Consider an instance of the BM’problem. First, we add ( n + 1) π for every component of the distance matrix. This modiﬁcation of thedistance matrix ensures that every feasible solution for BM’ has a cost that is at least ( n + n ) π and atmost ( n + 2 n ) π . Next, for every treatment sample x we add another κ − x whose distance from x is 0 and from other treatment samples the distanceis ( n + 2 n ) · π + 1. These κ − κ -BM’ on which we apply ALG. The output has cost B ≤ ( n + 2 n ) · π D. S. HOCHBAUM, A. LEVIN, AND X. RAO if and only if the solution we obtain by deleting all the dummy control sample is a feasible solution forBM’ of cost B . To prove the last claim note that the solution for the κ -BM’ instance cannot matchdummy control samples to treatment samples if their distance is not zero because the distance of suchmatch is larger than B and all distances are non-negative. Furthermore, the κ − n + 1) · π > ( n + 2 n ) · π . Thus, by deleting the dummy control sample we get a feasible solutionfor the BM’ problem (after the modiﬁcation of the distance matrix) of the same cost. Similarly, if wetake an optimal solution for the BM’ problem (before or after the modiﬁcation of the distances) andadd to it the dummy control samples that are matched using the zero-distances then we get an optimalsolution for the κ -BM’. Therefore, by applying ALG on the last κ -BM’ problem we get a polynomialtime algorithm for solving the original BM’ instance.

5. The maximum selection κ -ﬁne-balance matching ( κ -MSBM) problem. Since any κ -BM problem can be solved as a κ -MSBM problem with the same κ and the same number of covariates,Corollary 4.3 implies that the P -covariate κ -MSBM problem is also NP-hard for P ≥ κ . And in section 2 we show that the 1-covariate MSBM problem can be solved as an MCNF problemin polynomial time. We are going to show next that the 1-covariate κ -MSBM problem is NP-hard evenfor any constant κ ≥ Theorem

For any constant value of κ such that κ ≥ , the -covariate κ -MSBM problem isNP-hard even with only one level.Proof. Given an instance of Exact-3-cover, namely a collection C of 3-element subsets (triplets) of aground set E with | E | = 3 q for some integer q . We will deﬁne an instance of 1-covariate κ -MSBM withonly one level for any constant κ such that κ ≥

3. In the constructed instance, we have one treatmentsample for every triplet in C , and we have one control sample for every element e ∈ E . In additionwe have ( κ − · q dummy control samples. For each triplet T ∈ C , the distance of the correspondingtreatment sample to a control sample is deﬁned as follows: it is zero if the control sample is one of thedummy samples or if the control sample is an element of the triplet T . All other distances (that are stillundeﬁned) are set to one. Observe that a feasible solution for the κ -MSBM instance selects all controlgroup and selects exactly q treatment samples. We claim that in this instance, an optimal solution for κ -MSBM has total distance of zero, if and only if the Exact-3-cover instance is a YES instance.To see the last claim assume ﬁrst that there is a subcollection of triplets C ′ ⊆ C such that everyelement of E appears in exactly one triplet in C ′ . Then we construct a zero-distance solution for the κ -MSBM instance as follows. We select the samples C ′ of the treatment group and each such selectedsample T ∈ C ′ is matched to the samples of the control consisting of the three elements samples in T together with κ − C ′ has q triplets, we have suﬃcient number ofadditional control samples. Furthermore, since every element in E appears only once in triplets of C ′ ,we conclude that its control sample is matched to exactly one selected treatment sample.On the other hand, assume that there is a zero-distance solution for the κ -MSBM instance. Then,this solution selects exactly q treatment samples corresponding to q triplets. Denote by C ′ ⊆ C thissubcollection of q triplets. Note that if there is a selected treatment sample that is matched to at least κ − κ − E , then at least one of those matches has distance one. This contradicts the assumptionthat the cost of the κ -MSBM solution is zero. Therefore, every selected treatment sample is matched toexactly κ − E appears in exactly one triplet of C ′ , so the Exact-3-cover instance is a YES instance.For the 1-covariate κ -MSBM problem where κ = 2, the complexity status remains open.

6. Fixed-parameter tractable algorithms.

In this section, we consider the special cases of the κ -FBS, κ -BM, and κ -MSBM problems where all covariates have a small number of levels.9Let K = Q Pi =1 k i be the number of level-intersections. Observe that if the number of covariatesis constant and all covariates have constant number of levels, then K is a constant. We note thatthe problems κ -FBS, κ -BM, and MSBM can be solved in ﬁxed-parameter tractable (FPT) time withparameter K . In order to state these results, we say that a problem is ﬁxed-parameterized complexitywith parameter K and denote it by F P T ( K ) if it has an algorithm whose time complexity is upperbounded by a function of the form f ( K ) · poly where f ( K ) is some computable function of the parameter K , and poly is some polynomial of the input binary encoding length. We also say that an algorithm runsin F P T ( K ) time and mean that its time complexity can be upper bounded by a function of the form f ( K ) · poly where f ( K ) is some computable function of the parameter K , and poly is some polynomialof the input binary encoding length. Here we show that these problems, namely κ -FBS, κ -BM problemsfor all κ , and MSBM problem are F P T ( K ). Similar results for κ -MSBM where κ ≥ P = N P as shown in Theorem 5.1. The complexity status of the 2-MSBM problem with constant K is open.Our proof for the F P T ( K ) results uses the existence of fast algorithms for solving integer program-ming in ﬁxed dimension and for solving mixed-integer linear programs if the number of integral variablesis ﬁxed. Lenstra [14] (see also [12] for an improved time complexity of these algorithms) showed that theinteger linear programming problem with a ﬁxed number of variables is polynomially solvable, and healso showed that a mixed-integer linear program with a ﬁxed number of integer variables can be solvedin polynomial time. In fact, these algorithms runs in F P T time with parameter being the number ofintegral variables. Therefore, to prove our results we show either an integer programming (IP) formu-lation with number of decision variables O ( K ) or a mixed-integer linear program (MILP) with O ( K )integer variables such that solving this MILP to optimality ensures that the resulting solution is integraland solves the corresponding problem. κ -FBS problem. First consider the κ -FBS problem. For this problem we use an integerprogram with dimension O ( K ) that is based on (IP-FBS) . Let u i ,i ,...,i P = | L ,i ∩ L ,i ∩ ... ∩ L P,i p | and u ′ i ,i ,...,i P = | L ′ ,i ∩ L ′ ,i ∩ ... ∩ L ′ P,i p | for i p = 1 , ..., k p , p = 1 , ..., P . The decision variables are: x i ,i ,...,i P : the number of treatment samples selected from the ( i , i , . . . , i P ) level intersection L ,i ∩ L ,i ∩ ... ∩ L P,i p , for i p = 1 , ..., k p , p = 1 , ..., P ; x ′ i ,i ,...,i P : the number of control samples selected from the ( i , i , . . . , i P ) level intersection L ′ ,i ∩ L ′ ,i ∩ ... ∩ L ′ P,i p , for i p = 1 , ..., k p , p = 1 , ..., P .The integer programming formulation is:max k X i =1 k X i =1 · · · k P X i P =1 x i ,i ,...,i P (6.1a) s.t. κ · k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 x i ,i ,...,i P = k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 x ′ i ,i ,...,i P p = 1 , . . . , P i p = 1 , ..., k p (6.1b) 0 ≤ x i ,i ,...,i P ≤ u i ,i ,...,i P p = 1 , ..., P, i p = 1 , ..., k p (6.1c) 0 ≤ x ′ i ,i ,...,i P ≤ u ′ i ,i ,...,i P p = 1 , ..., P, i p = 1 , ..., k p (6.1d) x i ,i ,...,i P , x ′ i ,i ,...,i P integers p = 1 , ..., P, i p = 1 , ..., k p (6.1e). Note that this integer programming formulation has 2 K decision variables and O ( K ) constraints,and thus the algorithm that constructs it and solves it to optimality runs in F P T ( K ) time. The optimalsolution for this integer program encodes the optimal solution for κ -FBS similarly to the proof of theorem2.10 D. S. HOCHBAUM, A. LEVIN, AND X. RAO κ -BM problem. Next consider the κ -BM problem. In section 2, we describe an MCNFformulation when the level intersection sizes s ′ i ,i ,...,i P for p = 1 , ..., P and i p = 1 , ..., k p are given.Observe that if we treat the sizes s ′ i ,i ,...,i P for all p and i p as decision variables, then by enforcing theintegrality of these K variables and adding the constraints saying that k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s ′ i ,i ,...,i P = κ · ℓ p,i p , i p = 1 , ..., k p p = 1 , . . . , P forcing the κ -ﬁne balance constraints to the MCNF formulation, we get a MILP formulation of κ -BMwith K integral variables. In fact if we restrict ourselves to a common integral values of these K variables,then the other decision variables are integral as we argue next. By considering the values of these K integral variables as constants, the resulting linear programming formulation is in fact an MCNF LPformulation whose supply/demand vector depends on the values of these K integral variables. Thus, theoptimal solution for the MILP is without loss of generality integral, and even if it does not satisfy thisintegral requirement it can be transformed to another optimal solution that is integral in polynomialtime.Since the number of variables of the resulting mixed-integer program is at most n · n ′ + K , the numberof integer variables is K , and the number of constraints is O ( n · n ′ ), we conclude that the algorithm thatformulates this MILP and solves it to optimality guaranteeing that the optimal solution is integral, runsin F P T ( K ) time. κ -MSBM problems. We know from Theorem 5.1 that the 1-covariate κ -MSBM problem for κ ≥ s i ,i ,...,i P and s ′ i ,i ,...,i P are given. Observe that if we treat s i ,i ,...,i P and s ′ i ,i ,...,i P asdecision variables, then by enforcing the integrality of these O ( K ) variables and adding the constraintssaying that k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s i ,i ,...,i P = k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s ′ i ,i ,...,i P , i p = 1 , ..., k p p = 1 , . . . , P that is the ﬁne balance constraints, in addition to the constraint saying that the sum over all s i ,i ,...,i P equals the objective function value of κ -FBS, to the MCNF formulation, we get a MILP formulationof κ -MSBM with 2 K integral variables. In fact if we restrict ourselves to a common integral values ofthese 2 K variables, then the other decision variables are without loss of generality integral as well as weargue next. By considering the values of these 2 K integral variables as constants, the resulting linearprogramming formulation is in fact a MCNF LP formulation whose supply/demand vector depends onthe values of these 2 K integral variables. Thus, the optimal solution for the MILP is without loss ofgenerality integral, and even if it does not satisfy this integral requirement it can be transformed toanother optimal solution that is integral in polynomial time.Since the number of variables of the resulting mixed-integer program is at most n · n ′ + 2 K , thenumber of integer variables is 2 K , and the number of constraints is O ( n · n ′ ), we conclude that thealgorithm that formulates this MILP and solves it to optimality guaranteeing that the optimal solutionis integral, runs in F P T ( K ) time. Appendix A. The minimum cost network ﬂow.

We formulate here the minimum cost networkﬂow problem (MCNF). The input to the problem is a graph G = ( V, A ) with a set of nodes V and a setof arcs A , where each arc ( i, j ) ∈ A is associated with a cost c ij , capacity upper bound u ij , and capacitylower bound l ij . Each node i ∈ V has supply b i which is interpreted as demand if negative, and can be0. Let x ij be the amount of ﬂow on arc ( i, j ) ∈ A . The ﬂow vector x is said to be feasible if it satisﬁes:(1) Flow balance constraints:

For every node k ∈ V Outf low ( k ) − Inf low ( k ) = b k (2) Capacity constraints:

For each arc ( i, j ) ∈ A , l ij ≤ x ij ≤ u ij .1The linear programming formulation of the problem is:(MCNF) min P ( i,j ) ∈ A c ij x ij subject to P j :( k,j ) ∈ A x kj − P i :( i,k ) ∈ A x ik = b k ∀ k ∈ Vl ij ≤ x ij ≤ u ij , ∀ ( i, j ) ∈ A. The ﬂow balance constraints coeﬃcients form a { , , − } -matrix where in each column there isexactly one 1 and one −

1. Such matrix is a special case of matrices where each column (or row) has atmost one 1 and at most one −

1, which are known to be totally unimodular.

REFERENCES[1]

R. K. Ahyja, J. B. Orlin, and T. L. Magnanti , Network ﬂows: theory, algorithms, and applications , Prentice-Hall,1993.[2]

M. Bennett, J. P. Vielma, and J. R. Zubizarreta , Building representative matched samples with multi-valuedtreatments in large observational studies , Journal of Computational and Graphical Statistics, 0 (2020), pp. 1–29,https://doi.org/10.1080/10618600.2020.1753532.[3]

M. A. Brookhart, S. Schneeweiss, K. J. Rothman, R. J. Glynn, J. Avorn, and T. St¨urmer , Variable selectionfor propensity score models , American journal of epidemiology, 163 (2006), pp. 1149–1156.[4]

R. Busaker and P. J. Gowen , A procedure for determining minimal-cost network ﬂow patterns , tech. report, OROTechnical Report 15, Operational Research Oﬃce, John Hopkins University, 1961.[5]

J. Edmonds and R. M. Karp , Theoretical improvements in algorithmic eﬃciency for network ﬂow problems , Journalof the ACM (JACM), 19 (1972), pp. 248–264, https://doi.org/10.1145/321694.321699.[6]

M. R. Garey and D. S. Johnson , Computers and Intractability: A Guide to the Theory of NP-Completeness , W.H. Freeman, New York, NY, USA, 1978.[7]

D. E. Ho, K. Imai, G. King, and E. A. Stuart , Matching as nonparametric preprocessing for reducing modeldependence in parametric causal inference , Political analysis, 15 (2007), pp. 199–236.[8]

D. S. Hochbaum and X. Rao , Network ﬂow methods for the minimum covariates imbalance problem , arXiv preprintarXiv:2007.06828, (2020).[9]

G. W. Imbens , Nonparametric estimation of average treatment eﬀects under exogeneity: A review , Review of Eco-nomics and statistics, 86 (2004), pp. 4–29.[10]

M. Iri , A new method of solving transportation-network problems , Journal of the Operations Research Society ofJapan, 3 (1960), p. 2.[11]

W. S. Jewell , Optimal ﬂow through networks , in Operations Research, vol. 6, 1958, pp. 633–633.[12]

R. Kannan , Improved algorithms for integer programming and related lattice problems , in Proceedings of the 15thAnnual ACM Symposium on Theory of Computing, 25-27 April, 1983, Boston, Massachusetts, USA, D. S.Johnson, R. Fagin, M. L. Fredman, D. Harel, R. M. Karp, N. A. Lynch, C. H. Papadimitriou, R. L. Rivest,W. L. Ruzzo, and J. I. Seiferas, eds., ACM, 1983, pp. 193–206, https://doi.org/10.1145/800061.808749, https://doi.org/10.1145/800061.808749.[13]

R. M. Karp , Reducibility among combinatorial problems , in Complexity of computer computations, Springer, 1972,pp. 85–103.[14]

H. W. Lenstra Jr , Integer programming with a ﬁxed number of variables , Mathematics of operations research, 8(1983), pp. 538–548.[15]

S. L. Morgan and D. J. Harding , Matching estimators of causal eﬀects: Prospects and pitfalls in theory andpractice , Sociological methods & research, 35 (2006), pp. 3–60.[16]

K. Mulmuley, U. Vazirani, and V. Vazirani , Matching is as easy as matrix inversion , Combinatorica, 7 (1987),pp. 105–113.[17]

C. H. Papadimitriou and M. Yannakakis , The complexity of restricted spanning tree problems , Journal of the ACM(JACM), 29 (1982), pp. 285–309.[18]

C. H. Papadimitriou and M. Yannakakis , On the approximability of trade-oﬀs and optimal access of web sources ,in Proceedings 41st Annual Symposium on Foundations of Computer Science, IEEE, 2000, pp. 86–92.[19]

S. D. Pimentel, R. R. Kelz, J. H. Silber, and P. R. Rosenbaum , Large, sparse optimal matching with reﬁnedcovariate balance in an observational study of the health outcomes produced by new surgeons , Journal of theAmerican Statistical Association, 110 (2015), pp. 515–527, https://doi.org/10.1080/01621459.2014.997879.[20]

P. R. Rosenbaum , Overt bias in observational studies , in Observational studies, Springer, 2002, pp. 71–104.[21]

P. R. Rosenbaum, R. N. Ross, and J. H. Silber , Minimum distance matched sampling with ﬁne balance in anobservational study of treatment for ovarian cancer , Journal of the American Statistical Association, 102 (2007),pp. 75–83, https://doi.org/10.1198/016214506000001059.[22]

D. B. Rubin, E. A. Stuart, et al. , Aﬃnely invariant matching methods with discriminant mixtures of proportionalellipsoidally symmetric distributions , The Annals of Statistics, 34 (2006), pp. 1814–1826.[23]

E. A. Stuart , Matching methods for causal inference: A review and a look forward , Statistical science: a reviewjournal of the Institute of Mathematical Statistics, 25 (2010), p. 1, https://doi.org/10.1214/09-STS313. D. S. HOCHBAUM, A. LEVIN, AND X. RAO[24]

N. Tomizawa , On some techniques useful for solution of transportation network problems , Networks, 1 (1971),pp. 173–194, https://doi.org/10.1002/net.3230010206.[25]

D. Yang, D. S. Small, J. H. Silber, and P. R. Rosenbaum , Optimal matching with minimal deviation from ﬁnebalance in a study of obesity and surgical outcomes , Biometrics, 68 (2012), pp. 628–636, https://doi.org/10.1111/j.1541-0420.2011.01691.x.[26]