Algorithms and Complexity for Variants of Covariates Fine Balance
aa r X i v : . [ c s . D S ] S e p ALGORITHMS AND COMPLEXITY FOR VARIANTS OF COVARIATES FINEBALANCE
DORIT S. HOCHBAUM ∗ , ASAF LEVIN † , AND
XU RAO ‡ Abstract.
We study here several variants of the covariates fine balance problem where we generalize some of theseproblems and introduce a number of others. We present here a comprehensive complexity study of the covariates problemsproviding polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostlycombinatorial and rely on network flow techniques. In addition we present several fixed-parameter tractable results forproblems where the number of covariates and the number of levels of each covariate are seen as a parameter.
Key words.
Algorithms; Complexity; Covariate balance; Observational studies.
1. Introduction.
The problem of balancing covariates arises in observational studies in variouscontexts such as statistics [20] [22], epidemiology [3], sociology [15], economics [9] and political science[7]. In an observational study there are two disjoint groups of samples, one of treatment samples andthe other of control samples. Each of the samples in the two groups is characterized by several observedcovariates, or features.When estimating causal effects using observational data, it is desirable to replicate a randomizedexperiment as closely as possible by obtaining treatment and control groups with similar covariate distri-butions. This goal can often be achieved by choosing well-matched samples of the original treatment andcontrol groups, thereby reducing bias in the estimated treatment effects due to the observed covariates.The matching is to assign each treatment sample to one unique control sample, or, in other setups, toassign each treatment sample to a unique set of κ control samples, for κ a pre-specified integer, whereevery control sample is assigned to at most one treatment sample. A detailed review of matching-relatedmethods used for covariates balancing problems is given by [23].In this paper we address various problems of balancing covariates. The covariates here are nominal ,in that they take on discrete values or categories. The set of values of each nominal covariate partitionsthe treatment and control samples to a number of subsets referred to as levels where the samples at everylevel share the same covariate value. In an ideal situation the samples of the treatment and the controlin each matched pair or matched set belong to the same levels over all covariates. However, satisfyingthe requirement that matched samples in each pair or set belong to the same levels over all covariatestypically results in a very small selection from the treatment and control group, which is not desirable.To address this Rosenbaum et al. [21] introduced a weaker requirement to match all treatment samplesto a subset of the control samples, called selection, so that the proportion (or the number, if κ = 1) ofcontrol and treatment samples in each level of each covariate are the same. This requirement is knownin the literature as fine balance .To formalize the discussion we introduce essential notation. Let the number of treatment samplesbe n and the number of control samples be n ′ . Let the set of all treatment samples be denoted by T , |T | = n . Let P be the number of covariates to be balanced. For p = 1 , ..., P , covariate p partitionsboth treatment and control groups into k p levels each. Let the partition of the treatment group undercovariate p be L p, , L p, , ..., L p,k p of sizes ℓ p, , ℓ p, , ..., ℓ p,k p . Similarly, let the partition of the controlgroup under covariate p be L ′ p, , L ′ p, , ..., L ′ p,k p of sizes ℓ ′ p, , ℓ ′ p, , ..., ℓ ′ p,k p . Let κ be an integer specifyingthe ratio of the number of matched control samples to the number of matched treatment samples.We define the κ -fine-balance constraints for a selection of treatment and a selection of control samplesas follows: ∗ Department of IEOR, Etcheverry Hall, Berkeley, CA, supported in part by NSF award No. CMMI-1760102.([email protected]). † Faculty of Industrial Engineering and Management, The Technion, Haifa, Israel, supported in part by ISF - IsraeliScience Foundation grant number 308/18. ([email protected]). ‡ Department of IEOR, Etcheverry Hall, Berkeley, CA, supported in part by NSF award No. CMMI-1760102.([email protected]). 1
D. S. HOCHBAUM, A. LEVIN, AND X. RAO
Definition κ -fine-balance). For an integer κ , a selection S ⊆ T of the treatment group and aselection S ′ of the control group, we say that ( S, S ′ ) - κ -fine-balance is satisfied if κ · | S ∩ L p,i | = | S ′ ∩ L ′ p,i | for p = 1 , ..., P and i = 1 , ..., k p . Obviously for S , S ′ satisfying ( S, S ′ )- κ -fine-balance, the cardinality of S ′ is κ times as large as thecardinality of S , | S ′ | = κ | S | .We are now ready to define the three families of problems investigated here with complexity thatvaries according to the number of covariates and the value of κ . The maximum κ -fine-balance selection ( κ -FBS) problem is to select a subset S ⊆ T and a subset S ′ of the control group so as to maximize thesize of the selection S (equivalent to maximizing the size of S ′ since | S ′ | = κ | S | ) where the ( S, S ′ )- κ -fine-balance constraints are satisfied. This problem is introduced here for the first time.A second problem studied here is the κ -fine-balance matching ( κ -BM) problem, first introduced byRosenbaum et al. [21] for one covariate. Here we are given a distance, or cost, measure between eachtreatment and each control sample. The κ -BM problem is to minimize the total cost of the assignmentof each treatment sample in T to κ control samples such that the selection of matched control samples S ′ satisfies ( T , S ′ )- κ -fine-balance.Another problem family newly introduced here is an optimization where the feasible sets are optimalfor another problem. Formally, in the first stage the goal is to find the optimal selections to the κ -FBSproblem. In the second stage, among all maximum sized selections, find the selection that minimizes thetotal distance of an assignment of each selected treatment sample to exactly κ selected control samples.We refer to this problem as maximum selection κ -fine-balance matching problem ( κ -MSBM).For the case of κ = 1, we ignore the prefix κ so ( S, S ′ )- κ -fine-balance is called ( S, S ′ )-fine-balance, κ -FBS problem is called FBS problem, κ -BM problem is called BM problem, and κ -MSBM problem iscalled MSBM problem.A summary of the problems investigated here is given is Table 1.Table 1: Summary of problems studied here.Problem name Objective Constraintsmax fine-balance selection (FBS) max | S | ( S, S ′ )-fine-balancemax κ -fine-balance selection ( κ -FBS) max | S | ( S, S ′ )- κ -fine-balancefine-balance matching (BM) min assignment cost ( T , S ′ )-fine-balance κ -fine-balance matching ( κ -BM) min assignment cost ( T , S ′ )- κ -fine-balancemax selection fine-balance matching (MSBM) min assignment cost ( S, S ′ ) optimal for FBSmax selection κ -fine-balance matching ( κ -MSBM) min assignment cost ( S, S ′ ) optimal for κ -FBS The concept of fine balance was first introduced by Rosenbaum et al.[21], who studied the κ -BM problem for the 1-covariate problem and proposed a network flow algo-rithm. No polynomial running time algorithm has been known for the κ -BM problem with two or morecovariates.It is not always feasible to find a selection S ′ of the control samples that satisfies the ( T , S ′ )- κ -fine-balance constraints in the κ -BM problem. To that end several papers considered the goal of minimizingthe violation of this requirement, which we refer to as imbalance , [25], [26], [19], [2], [8]. The studies inall these papers require the entire treatment group to be selected or matched. Bennett et al. [2], andHochbaum and Rao [8] considered finding the selection of control group that minimizes an imbalanceobjective, defined as P Pp =1 P k p i =1 || S ′ ∩ L ′ p,i | − κ · ℓ p,i | . This problem is called minimum κ -imbalanceproblem . The problem is trivial to solve for the 1-covariate problem (see section 2 for details); the 2-covariate problem was proved to be polynomial time solvable using linear programming in [2] and usingnetwork flow algorithms in [8]; for three or more covariates, the problem is NP-hard [2, 8]. Yang et al.[25] and Pimental et al. [19] considered a more complicated problem that minimizes the total assignmentcost of the matched sets, each consisting of a single treatment sample and κ control samples, subject tothe requirement that the selection of matched control samples is optimal for the minimum κ -imbalanceproblem. Yang et al. [25] proposed two network flow algorithms for the case of the 1-covariate problem;Pimental et al. [19] proposed a network flow algorithm for the case in which the covariates form anested sequence. Zubizarreta [26] considered a different variant which minimizes the total assignmentcost of the matched sets with a penalty on the imbalance, and presented a mixed integer programmingformulation for an arbitrary number of covariates. We show here, for the first time, that the 2-covariate BM problem is in factNP-hard and therefore there is no polynomial time algorithm for the κ -BM problem with two or morecovariates unless P = N P . This NP-hardness result for κ -BM problem with two or more covariatesholds for any value of κ .The κ -MSBM problem is newly introduced here. It relaxes the requirement in the κ -BM problem ofselecting all treatment samples and replaces it with a maximum size selection possible, while enforcingthe κ -fine-balance constraints. This κ -MSBM problem, as shown here, is NP-hard with two or morecovariates for any given value of κ . What’s more, it is also proved here to be NP-hard for the 1-covariateproblem when κ ≥
3. We present a polynomial algorithm for the 1-covariate MSBM problem, but weleave the complexity status of the 1-covariate 2-MSBM problem open.The κ -FBS problem is a simpler problem compared to the κ -MSBM problem. This problem isstudied here for the first time. We prove that for three or more covariates, the FBS and κ -FBS problemsare NP-hard for any value of κ . For the case of the 2-covariate problem we present here an efficientalgorithm for the FBS problem. The algorithm is based on an integer programming formulation of theproblem in which the constraint matrix, for two covariates, has the structure of network flow constraints.For the resulting minimum cost network flow problem we apply an algorithm with running time O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )). We also prove that for κ ≥
3, the 2-covariate κ -FBS problemis NP-hard. For the remaining case in which κ = 2 and the number of covariates is two, the complexitystatus of the 2-FBS problem is left open.We observe here that, for any number of covariates, if the selections of treatment and control samplesare fixed, then the optimal assignment among the selected samples, and therefore the optimal solutionto the κ -MSBM problem, is attained by solving a minimum cost network flow problem. See section 2for details.A summary of the complexity results for the three problem families is given in Table 2.Table 2: Summary of complexity and algorithmic results derived here. (Here n is the size of treatment group and n ′ is the size of control group) Problem One covariate Two covariates ≥ O ( n + n ′ ) O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k ))) NP-hard κ -FBS ( κ ≥ O ( n + n ′ ) NP-hard for κ ≥
3, open for κ = 2 NP-hard κ -BM (any κ ) O (( n + n ′ ) ) [21] NP-hard NP-hardMSBM O (( n + n ′ ) nn ′ ) NP-hard NP-hard κ -MSBM ( κ ≥
2) NP-hard for κ ≥ κ = 2Beyond these complexity results, we also address here fixed-parameter tractable (FPT) results. Weprove that the κ -FBS, κ -BM and MSBM problems are solvable in fixed-parameter tractable time forconstant numbers of covariates levels, yet the κ -MSBM problem is NP-hard for constant κ ≥ κ -BM problem where one of thecovariates has a constant number of levels (whereas the other one may have a linear number of levels), D. S. HOCHBAUM, A. LEVIN, AND X. RAO and we show that the complexity status of these special cases is tied to the complexity status of the exactmatching problem, a problem that is known to have a randomized polynomial time algorithm [16] butfor which the existence of a deterministic polynomial time algorithm is a long-standing open problem.
In section 2 we consider the case of the 1-covariate κ -FBS, κ -BM, MSBMproblems, and provide a compact representation of the sample selections. Then we present our complexityand algorithmic results of the other cases for the three families of problems separately, i.e., the κ -FBSproblem in section 3, the κ -BM problem in section 4, and the κ -MSBM problem in section 5. Thefixed-parameter complexity results are provided in section 6.
2. Preliminaries.
Consider first the case of a single covariate, P = 1, that partitions the controland treatment groups into, say, k levels each. Let the sizes of levels of the treatment group be ℓ , ..., ℓ k ,and the sizes of levels of the control group be ℓ ′ , ..., ℓ ′ k . It is easy to see that there exists a selection S ′ of control samples that satisfies the ( T , S ′ )- κ -fine-balance if and only if ℓ ′ i ≥ κℓ i for i = 1 , ..., k . If thiscondition is satisfied then any subset S ∗ of the control group with κ · ℓ i samples in level i , i = 1 , ..., k ,satisfies the ( T , S ∗ )- κ -fine-balance, and as such is a feasible selection for the κ -BM problem. With theseknown numbers of control samples to be selected in each level, the optimal solution to the 1-covariate κ -BM problem is found using a minimum cost network flow formulation, as shown next. Note thata standard linear programming formulation of the minimum cost network flow (MCNF) is given inAppendix A.The MCNF problem the solution to which is an optimal solution to κ -BM is constructed on abipartite graph with the treatment samples each represented by a node on one side, and the controlsamples each represented by a node on the other side. The cost on each arc between a treatment sampleand a control sample is the “distance” value between the two, and the arc capacity is 1. Each treatmentsample has a supply of κ . To account for the requirement that in each level i of control samples therewill be κ · ℓ i samples matched we add to the bipartite graph a third layer of k nodes, one for each level.The i th node in the third layer has demand of κ · ℓ i and there are arcs to this demand node from allcontrol samples in level i with capacity 1 and cost of 0. In an optimal solution to this MCNF problemthe control sample nodes through which there is a positive flow (of one unit) are the ones selected andmatched to the respective treatment sample nodes from which they have a positive flow.If ℓ ′ i < κℓ i for some i then there is no selection S ′ of control samples that satisfies the ( T , S ′ )- κ -fine-balance. Addressing this context, as mentioned earlier, [25], [26], [19], [2], [8], consider the problemof minimizing the κ -imbalance, which is the sum of violations for all levels, P ki =1 || S ′ ∩ L ′ i | − κ · ℓ i | .The solution to this 1-covariate minimum κ -imbalance problem is straightforward: in step 1, selectmin { κ · ℓ i , ℓ ′ i } control samples in level i ; if the number of control samples selected is less than n in step1, then we select random additional control samples such that the selection is of size n . Another wayto address this context is to seek a solution for the ( S, S ′ )- κ -fine-balance where rather than forcing allsamples of T to be included, finding a solution in which the size of the selection S , and equivalently | S ′ | ,is maximized, the κ -FBS problem. The solution to the 1-covariate κ -FBS problem is also straightforward:select ¯ ℓ i = min { ℓ i , ⌊ ℓ ′ i /κ ⌋} treatment samples of level i and κ · ¯ ℓ i control samples of level i .Again, for the 1-covariate problem of finding an optimal matching, or assignment, among all optimalselections for either the minimum κ -imbalance or the FBS problem, we solve a MCNF problem for theknown number of samples to select from each level, similar to the one defined above with the followingmodifications. For the minimum κ -imbalance, we first need to change the demand of the demand nodesin the above MCNF problem from κ · ℓ i to min { κ · ℓ i , ℓ ′ i } for each level i . We also add a dummy demandnode in the third layer with demand κ · n − P ki =1 min { κ · ℓ i , ℓ ′ i } , which connects with all control nodes eachwith capacity 1 and cost of 0. For the optimal selections of FBS, in addition to changing the demandfrom ℓ i to ¯ ℓ i for each level i , we also remove the supply on each treatment sample, add for every level i a supply node with supply ¯ ℓ i , and add arcs from this supply node to all treatment samples in level i with capacity 1 and cost of 0. The best assignment found with a selection that is optimal for the FBSproblem is an optimal solution for the MSBM problem. However, this method does not apply to the κ -MSBM problem with κ ≥
2. We further show that even the 1-covariate κ -MSBM problem is NP-hardfor κ ≥ κ -MSBM problem are polynomial time solvable forthe 1-covariate case. In section 5 we show that the 1-covariate κ -MSBM does not admit a polynomialtime algorithm for κ ≥ κ -FBS problem, we observe that the selectionsfrom the treatment and control groups can be represented compactly in terms of level-intersections . For P covariates, the intersection of the level sets L ,i ∩ L ,i ∩ . . . ∩ L P,i P , i p = 1 , . . . , k p , p = 1 , . . . , P , forma partition of the treatment group. Similarly, the intersection of the level sets L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P , i p = 1 , . . . , k p , p = 1 , . . . , P , form a partition of the control group. Therefore, instead of specifying whichsample belongs to the selection, it is sufficient to determine the number of selected samples in each levelintersection for the two groups, since the identity of the specific selected samples has no effect on thefine balance requirement. With this discussion we have a theorem on the representation of the solutionto the κ -FBS problems in terms of the level-intersection sizes. Theorem
The level-intersection sizes s i ,i ,...,i P and s ′ i ,i ,...,i P are an optimal solution to the κ -FBS problem if there exists an optimal selection S of treatment samples and S ′ of control samples suchthat s i ,i ,...,i P = | S ∩ L ,i ∩ L ,i ∩ . . . ∩ L P,i P | and s ′ i ,i ,...,i P = | S ′ ∩ L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P | , for p = 1 , . . . , P , i p = 1 , . . . , k p . We will say that the optimal selection for the covariates problems here is unique if for any optimalselection S and S ′ , the numbers s i ,i ,...,i P = | S ∩ L ,i ∩ L ,i ∩ . . . ∩ L P,i P | and s ′ i ,i ,...,i P = | S ′ ∩ L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P | are unique. In order to derive an optimal selection given the optimal level-intersectionsizes, one selects any s i ,i ,...,i P treatment samples from the intersection L ,i ∩ L ,i ∩ . . . ∩ L P,i P and any s ′ i ,i ,...,i P control samples from the intersection L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P for i p = 1 , . . . , k p , p = 1 , . . . , P .We observe here that, for any number of covariates, if the optimal selection of treatment and controlsamples in terms of level-intersections is known and unique, then the optimal assignment among theselected samples, and therefore the optimal solution to the κ -MSBM problem, can also be attained bysolving an MCNF problem as follows. For each non-zero level intersection of treatment samples thereis a source node with supply of s i ,i ,...,i P . This source node is connected to all treatment samples inthe intersection L ,i ∩ L ,i ∩ . . . ∩ L P,i P with arcs of capacity 1 and cost of 0. For each non-zero levelintersection of control samples there is a demand node with supply of s ′ i ,i ,...,i P . This demand node isconnected from all control samples in the intersection L ′ ,i ∩ L ′ ,i ∩ . . . ∩ L ′ P,i P with arcs of capacity 1and cost of 0. The treatment and control sample nodes through which there is a positive flow (of soneunit) are the ones selected, and a positive flow between a treatment node and a control node indicatesthe two samples are matched. This is a minimum cost network flow problem with a total demand (orsupply) bounded by min { n, n ′ } , and O ( nn ′ ) arcs and O ( n + n ′ ) nodes. Therefore the successive shortestpaths algorithm, discussed below in section 3, solves this problem in O (( n + n ′ ) nn ′ ) steps.
3. The maximum κ -fine-balance selection ( κ -FBS ) problem. In this section we show thecomplexity and algorithmic results for the κ -FBS problems. We present the results separately first forthe 1-covariate problem, next for three or more covariates, then the 2-covariate FBS problem, and finallythe 2-covariate κ -FBS problem for κ ≥ κ -FBS problem is straightforward, as discussed in section 2: select¯ ℓ i = min { ℓ ,i , ⌊ ℓ ′ ,i /κ ⌋} number of level i treatment samples and κ · ¯ ℓ i number of level i control samples.The union of the selections at each level is an optimal solution for the 1-covariate κ -FBS problem.Thevalue of the objective function corresponding to this solution is P k i =1 ¯ ℓ i = P k i =1 min { ℓ ,i , ⌊ ℓ ′ ,i /κ ⌋} . κ -FBS problem for any constant κ with P ≥ . We show herethat even for κ being constant, the κ -FBS problem with three or more covariates is NP-hard by reducingthe problem, one of Karp’s 21 NP-hard problems [13][6]. : Given a finite set X and a set of triplets U ⊂ X × X × X . Is there a subset M ⊆ U such that | M | = | X | and that no two elements of M agree in any coordinate? Theorem
The κ -FBS problem is NP-hard when P = 3 even for constant κ . D. S. HOCHBAUM, A. LEVIN, AND X. RAO
Proof.
Given an instance of 3-dimensional matching problem with a finite set X and a set of triplets U ⊂ X × X × X , we construct an instance of κ -FBS problem with P = 3 for any constant κ . Withoutloss of generality, we assume X = { , ..., | X |} .First we define the levels of the three covariates. For p = 1 , ,
3, the set of levels of covariate p is { , ..., | X |} ∪ { , ′ } . So all the three covariates have | X | + 2 levels.Next, we construct the samples. For each sample, we represent it by the ordered triplet ( a, b, c )where a is the level of the first covariate, b is the level of the second covariate, and c is the level of thethird covariate. The treatment group contains a sample ( i, i, i ) for i = 1 , ..., | X | as well as | X | copies of(0 , ,
0) and | X | copies of (0 ′ , ′ , ′ ). For each triplet u ∈ U , whose elements are denoted by [ u , u , u ],we create one control sample ( u , u , u ). In addition, for each element i ∈ X we have ( κ −
1) copies foreach of the three control samples ( i, ′ , , (0 ′ , , i ) , (0 , i, ′ ). We also create | X | copies of (0 , ,
0) and | X | copies of (0 ′ , ′ , ′ ) for the control group. That is, the control group is the union of the following threesets (we represent the different copies by a superscript as shown below.): C = { ( u , u , u ) : ∀ u = [ u , u , u ] ∈ U } ,C = { ( i, ′ , ( w ) , (0 ′ , , i ) ( w ) , (0 , i, ′ ) ( w ) : ∀ i = 1 , ..., | X | , ∀ w = 1 , ..., κ − } ,C = { (0 , , ( w ) , (0 ′ , ′ , ′ ) ( w ) : ∀ w = 1 , ..., | X |} . The treatment group constructed is of size 3 | X | and the control group constructed is of size | U | + (3 κ − | X | . The two sizes are both polynomially bounded in the size of the 3-dimensional matching instanceso the reduction can be computed in polynomial time.Finally, we claim that the optimal value of the constructed 3-covariate κ -FBS problem is 3 | X | if andonly if there exist a subset M ⊆ U such that | M | = | X | and that no two elements of M agree in anycoordinate for the 3-dimensional matching instance.Let M ⊆ U be the solution for the 3-dimensional matching instance, we derive a solution for theconstructed problem as follows. We select all the treatment samples. For each triplet u ∈ M , we choosethe control sample in C whose covariates levels are corresponding to the elements in u . Additionally,we also choose all control samples in C and C . To check the feasibility of this solution, first considerthe appearances of level i for each i = 1 , ..., | X | under each covariate p = 1 , ,
3: level i appears once inthe treatment group for each covariate p ; it appears once among the selected samples in C as i appearsonce in each coordinate in M ; it appears for κ − C ; it does not appear in C . That is, thereis exactly one selected treatment sample of level i under covariate p and κ selected control samples oflevel i under covariate p , for each i and p . Next, consider appearances of level 0 and 0 ′ under eachcovariate p = 1 , ,
3: they each appear | X | times in the treatment group; they do not appear in C ; theyeach appear ( κ − | X | times in C and | X | times in C . That is, 0 and 0 ′ each appear κ | X | times inthe selected control samples under each covariate. Therefore, this selection is feasible and the objectivevalue, the number of selected treatment samples, is 3 | X | .On the other hand, if the constructed 3-covariate κ -FBS problem has an optimal solution S, S ′ ofobjective value 3 | X | , we say that u = [ u , u , u ] ∈ U is selected for M if the control sample ( u , u , u )is selected in S ′ for the constructed problem. We will show that M is a feasible solution of the 3-dimensional matching instance. Since the size of the treatment group is 3 | X | , all treatment samplesmust be selected in the optimal solution, and that 3 κ | X | number of control samples must be selected.For each covariate 1 , ,
3, levels 0 and 0 ′ each appears | X | times in the treatment group, so the numberof appearance of each of these two levels must be κ | X | in the selection S ′ of the control samples. So allsamples in C and C must be selected, otherwise there is no enough level 0 or level 0 ′ samples in S ′ .Therefore, M = 3 κ | X | − | C | − | C | = | X | . Furthermore, for each covariate p and for i = 1 , ..., | X | , level i appears exactly once in the treatment group so there are κ number of selected control samples in level i . For each i under each covariate p , since there are ( κ −
1) number of samples in level i in C ∪ C , onlyone sample in C in that same level is selected. So there is no overlap in each coordinate for any twotriplets in M .With the above arguments, any 3-dimensional matching problem can be reduced to a 3-covariate κ -FBS problem for any constant integer κ , and hence, the κ -FBS problem is NP-hard for any such κ when P = 3. Corollary
The κ -FBS problem is NP-hard for any integer P ≥ , even for constant κ .Proof. For any constant integer κ , any 3-covariate κ -FBS problem, and any P >
3, we can constructan equivalent P -covariate κ -FBS problem as follows: for each sample of the given 3-covariate κ -FBSproblem, we create a sample for the constructed κ -FBS problem such that they have the same levelvalue for covariate p = 1 , ,
3. For p = 4 , ..., P , set covariate p to have only one level so all samples inthe constructed κ -FBS problem have the same value.Therefore, the NP-hardness of 3-covariate κ -FBS problem implies that the P -covariate κ -FBS prob-lem is NP-hard for every value of P when P ≥ κ -FBS problem is NP-hard for P ≥
3, there is no polynomial time algorithm unless P = N P .In the following subsections, we will discuss the remaining case of the 2-covariate problems. P = 2 . In this subsection, we present aninteger programming formulation with network flow constraints for the 2-covariate FBS problem. Wethen show how to solve the problem efficiently with a network flow algorithm.It was noted, in Theorem 2.1, that there is no differentiation between the individual samples selectedin each level intersection, only the number of those selected counts. We thus define the decision variablesas follows: x i ,i : the number of treatment samples selected from the ( i , i ) level intersection L ,i ∩ L ,i , for i = 1 , ..., k and i = 1 , ..., k ; x ′ i ,i : the number of control samples selected from the ( i , i ) level intersection L ′ ,i ∩ L ′ ,i , for i = 1 , ..., k and i = 1 , ..., k .Let u i ,i = | L ,i ∩ L ,i | and u ′ i ,i = | L ′ ,i ∩ L ′ ,i | for i = 1 , ..., k , i = 1 , ..., k . Clearly, x i ,i must bean integer between 0 and u i ,i , and x ′ i ,i must be an integer between 0 and u ′ i ,i . With these decisionvariables the following is an integer programming formulation for the 2-covariate FBS problem:(IP-FBS) max k X i =1 k X i =1 x i ,i (3.1a) s.t. k X i =1 x i ,i − k X i =1 x ′ i ,i = 0 i = 1 , ..., k (3.1b) k X i =1 x i ,i − k X i =1 x ′ i ,i = 0 i = 1 , ..., k (3.1c) 0 ≤ x i ,i ≤ u i ,i i = 1 , ..., k , i = 1 , ..., k (3.1d) 0 ≤ x ′ i ,i ≤ u ′ i ,i i = 1 , ..., k , i = 1 , ..., k (3.1e) x i ,i , x ′ i ,i integers i = 1 , ..., k , i = 1 , ..., k (3.1f). The objective (3.1a) is the total number of selected treatment samples. Constraints (3.1b) are thefine balance requirement under covariate 1, as P k i =1 x i ,i equals the number of selected treatmentsamples in level i under covariate 1 and P k i =1 x ′ i ,i equals the number of selected control samples inthe same level. Similarly, constraints (3.1c) are the fine balance requirement under covariate 2.Formulation (IP-FBS) is in fact also a network flow formulation. In a minimum cost network flowformulation, each column of the constraint matrix corresponding to a variable that is a flow along an arc,has exactly one 1 and one -1. The corresponding MCNF network is shown in Figure 1, where all capacitylower bounds are 0, and each arc has a cost per unit flow and upper bound associated with it. The flow D. S. HOCHBAUM, A. LEVIN, AND X. RAO on the arc from node (1 , i ) to node (2 , i ) represents variable x i ,i , which is bounded between 0 and u i ,i as stated in constraints (3.1d); arc from node (2 , i ) to node (1 , i ) represents variable x ′ i ,i , whichis bounded between 0 and u ′ i ,i as stated in constraints (3.1e). To get a “minimize” type objective, wetake the negative value of | S | = P k i =1 P k i =1 x i ,i as the objective, so the per unit arc cost should be − { (1 , , (1 , , ..., (1 , k ) } to any node in { (2 , , (2 , , ..., (2 , k ) } . All otherarcs have cost 0. It is easy to verify that constraints (3.1b) are corresponding to the flow balance atnodes (1 , i ) for all i , constraints (3.1c) are corresponding to the flow balance at nodes (2 , i ) for all i .1,11,21, k k ... ... ( − , u , )( − , u , )( − , u ,k )( − , u , )( − , u k ,k )(0 , u ′ , )(0 , u ′ , )(0 , u ′ k ,k )(0 , u ′ ,k ) Fig. 1: Min-cost network flow graph corresponding to formulation (IP-FBS) . arc legend: (cost, upperbound)
Theorem
The -covariate FBS problem is solved as a minimum cost network flow problem in O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )) time.Proof. To solve the minimum cost network flow problem of the 2-covariate FBS problem, we choosethe algorithm of successive shortest paths that is particularly efficient for a MCNF with “small” total arccapacity (see [1] Section 9.7). The successive shortest paths algorithm starts with a network graph withno negative cycles, so we first modify the network shown in Figure 1 using a well-known arc reversaltransformation in [1] Section 2.4. The resulting network graph is shown in Figure 3.The successive shortest path algorithm iteratively selects a node s with excess supply (supply notyet sent to some demand node) and a node t with unfulfilled demand and sends flow from s to t alonga shortest path in the residual network [11], [10], [4]. The algorithm terminates when the flow satisfies1,11,21, k k ... ...( − ℓ , )( − ℓ , )( − ℓ ,k ) ( ℓ , )( ℓ , )( ℓ ,k ) (1 , u , )(1 , u , )(1 , u ,k )(1 , u , )(1 , u k ,k )(0 , u ′ , )(0 , u ′ , )(0 , u ′ k ,k )(0 , u ′ ,k ) Fig. 3: Min-cost network flow graph after arc reversal. arc legend: (cost, upperbound); node legend: (supply) all the flow balance constraints. Since at each iteration, the number of remaining units of supply to besent is reduced by at least one unit, the number of iterations is bounded by the total amount of supply.For the network in Figure 3 the total supply is n .At each iteration, the shortest path can be solved with Dijkstra’s algorithm of complexity O ( | A | + | V | log | V | ), where | V | is number nodes and | A | is number of arcs [24], [5]. In our formulation, | V | is O ( k + k ), which is at most O ( n ). Since the number of nonempty sets L ,i ∩ L ,i is at mostmin { n, k k } , the number of unit-cost arcs is O (min { n, k k } ). Since the number of nonempty sets L ′ ,i ∩ L ′ ,i is at most min { n ′ , k k } , the number of zero-cost arcs is O (min { n ′ , k k } ). So the totalnumber of arcs | A | is O (min { n + n ′ , k k } ).Hence, the total running time of applying the successive shortest path algorithm on our formulationis O ( n · (min { n + n ′ , k k } + ( k + k ) log( k + k )).In contrast to the 2-covariate FBS problem which is polynomial time solvable, we show next thatthe 2-covariate κ -FBS problem is NP-hard when κ ≥ -covariate κ -FBS problem with κ ≥ . We prove here that the2-covariate κ -FBS problem is NP-hard for all constant values of κ such that κ ≥
3. The proof reducesthe exact 3-cover problem, which is NP-hard [13][6].
Exact 3-cover : Given a collection C of 3-element subsets (triplets) of a ground set E with | E | = 3 q forsome integer q , is there a subcollection C ′ ⊆ C where each element e ∈ E appears in exactly one triplet0 D. S. HOCHBAUM, A. LEVIN, AND X. RAO of C ′ ? Theorem
The -covariate κ -FBS problem is NP-hard for any constant κ ≥ .Proof. Given an instance of Exact 3-cover problem with ground set E of size 3 q and a collection oftriplets C , we construct an instance of 2-covariate κ -FBS problem for any constant integer κ ≥ T for each triplet T ∈ C , and a level e ′ for each element e ∈ E . So there are | C | + | E | levels of covariate 1. For covariate2, we have a level e ′′ for each element e ∈ E , and one additional level dentoed as X . So there are | E | + 1levels of covariate 2.Next, we construct the samples. For each sample, we represent it by the ordered pair ( a, b ) where a is the level of the first covariate and b is the level of the second covariate. The treatment groupcontains a sample ( T, X ) for each triplet T ∈ C and a sample ( e ′ , e ′′ ) for each element e ∈ E . Moreover,for each triplet T ∈ C , whose elements are denoted by e t , e t , e t , we create three control samples( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ), as well as ( κ −
3) copies of control sample (
T, X ). In addition, for each element e ∈ E we have one control sample ( e ′ , X ) and ( κ −
1) copies of control sample ( e ′ , e ′′ ). The treatmentgroup constructed is of size | C | + | E | and the control group constructed is of size κ | C | + κ | E | . The twosizes are both polynomially bounded in the size of the Exact 3-cover instance so the reduction can becomputed in polynomial time.Finally, we claim that the constructed 2-covariate κ -FBS problem has a feasible solution with ob-jective value of at least 4 q if and only if the Exact 3-cover instance has a subcollection C ′ ⊆ C such thateach element e ∈ E appears in exactly one triplet of C ′ .Let C ′ be a subcollection such that each element e ∈ E appears in exactly one triplet of C ′ , wederive a solution for the constructed problem as follows. In the treatment group, we choose ( T, X ) forall T ∈ C ′ , and ( e ′ , e ′′ ) for all e ∈ E . In the control group, for each T ∈ C ′ , whose elements are denotedby e t , e t , e t , we choose ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) and the ( κ −
3) copies of (
T, X ). Additionally, we alsochoose from the control group sample ( e ′ , X ) and ( κ −
1) copies of ( e ′ , e ′′ ) for each e ∈ E . That is, theselection of treatment samples is S = { ( T, X ) : ∀ T ∈ C ′ } ∪ { ( e ′ , e ′′ ) : ∀ e ∈ E } and the selection of control samples is S ′ = { ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) : ∀ T = ( e t , e t , e t ) ∈ C ′ } ∪ { ( T, X ) ( w ) ) : ∀ T ∈ C ′ , w = 1 , ..., κ − }∪{ ( e ′ , X ) : ∀ e ∈ E } ∪ { ( e ′ , e ′′ ) ( w ) : ∀ e ∈ E, w = 1 , ..., κ − } . To check the feasibility of this solution, first consider the levels of the first covariate. For each T ∈ C ,if T ∈ C ′ then there is exactly one treatment sample with level T in S and we choose exactly κ suchcontrol samples for S ′ ; and if T / ∈ C ′ , then we have not chosen any sample of this level for neither thetreatment nor the control group. Furthermore, for each e ∈ E , we choose exactly one sample from thetreatment group with covariate 1 level e ′ and exactly κ control samples with covariate 1 level e ′ . Next,consider the levels of the second covariate. We choose | C ′ | number of level X treatment samples and( κ − | C ′ | + | E | number of level X control samples. Note that the size of subcollection C ′ must be q aseach element in E appears in exactly one triplet of C ′ , so ( κ − | C ′ | + | E | = κq = κ | C ′ | . For every ¯ e ∈ E ,we choose exactly one treatment sample with level ¯ e ′′ . There are κ − e ′′ undercovariate 2 in the set { ( e ′ , ¯ e ′′ ) ( w ) : ∀ e ∈ E, w = 1 , ..., κ − } . Since each element ¯ e appears in exactly onetriple of C ′ , ¯ e ′′ appears exactly once in the set { ( T, e ′′ t ) , ( T, e ′′ t ) , ( T, e ′′ t ) : ∀ T = ( e t , e t , e t ) ∈ C ′ } . Sothere are κ control samples of level ¯ e ′′ under covariate 2 in the selection S ′ . Therefore, this solution isfeasible and the objective value of this solution is | S | = | C ′ | + | E | = 4 q .On the other hand, if the constructed 2-covariate κ -FBS problem has a feasible solution S, S ′ ofobjective value at least 4 q , we say that T ∈ C is selected for subcollection C ′ if the treatment sample( T, X ) is selected in S for the constructed problem. We will show that the subcollection of selected tripletsis a feasible solution of the Exact 3-cover problem. Since there are only 3 q samples in the treatment1group that do not correspond to a triplets, the size of C ′ must be at lease q . So the subcollection isfeasible if we can establish that every pair of selected triplets is disjoint. Assume by contradiction thatthere exist two selected subsets T, T ′ ∈ C ′ that have a common element e . Due to ( S, S ′ )- κ -fine-balance,we know the number of control samples in each level under any covariate must be an integer multiple of κ . Since there is only one sample of level ¯ e ′′ in the treatment group, in selection S ′ there can be either 0or κ samples of level ¯ e ′′ under covariate 2. Since both ( T, X ) and ( T ′ , X ) are in the selection S , controlsamples ( T, ¯ e ′′ ) and ( T ′ , ¯ e ′′ ) must be chosen in S ′ , otherwise the number of selected samples in covariate1 levels T and T ′ will not satisfy ( S, S ′ )- κ -fine-balance. Thus, the number of samples in S ′ with thesecond covariate being ¯ e ′′ is at least 2. So the number | S ′ ∩ L ′ , ¯ e ′′ | must be κ , the number | S ∩ L , ¯ e ′′ | mustbe 1, and the number of sample (¯ e ′ , ¯ e ′′ ) in selection S ′ must be less than or equal to κ −
2. This impliesthat the number of control samples of level ¯ e ′ under covariate 1 can not be more than κ −
1, whichmeans, it must be zero. So we can derive that treatment sample (¯ e ′ , ¯ e ′′ ) is not selected, which meansthe number of treatment samples in level e ′′ under covariate 2 is 0, contradicts with | S ∩ L ,e ′′ | = 1.
4. The κ -balanced-matching ( κ -BM) problem with P ≥ . The 1-covariate κ -BM problem issolvable in polynomial time [21]. However, we show here for the first time that even for any constant κ the κ -BM problem is NP-hard for two or more covariates, P ≥
2. For the 2-covariate BM problem andthe 2-covariate κ -BM problem, the complexity status when the numbers of levels of both covariates areconstants is discussed in section 6 together with the other two families of problems. Here we consideran intermediate case where only one of the covariates has a constant number of levels. We will showthat the 2-covariate BM problem and the 2-covariate κ -BM problem when the second covariate has aconstant number of levels can be solved efficiently if and only if the exact matching problem on bipartitegraphs can be solved efficiently. P -covariate BM problem and the P -covariate κ -BM problemfor P ≥ . We will first present the integer programming formulation for the 2-covariate κ -BM problem.Denote the distance between the i th treatment sample and the j th control sample by δ ij for i = 1 , ..., n and j = 1 , ..., n ′ . The decision variable x ij is a binary variable which equals 1 if and only if the i thtreatment sample is matched to the j th control sample. The integer programming formulation is givenbelow. min n X i =1 n ′ X j =1 δ ij x ij (4.1a) s.t. n X i =1 X j ∈ L ′ ,q x ij = κℓ ,q q = 1 , . . . , k (4.1b) n X i =1 X j ∈ L ′ ,q x ij = κℓ ,q q = 1 , . . . , k (4.1c) n ′ X j =1 x ij = κ i = 1 , ..., n (4.1d) n X i =1 x ij ≤ j = 1 , ..., n ′ (4.1e) x ij ∈ { , } i = 1 , ..., n, j = 1 , . . . , n ′ . (4.1f)The objective (4.1a) is the total distance of all matched pairs. Constraints (4.1b) ensure that thenumber of matched level q control samples is κ times as large as the number of matched level q treatmentsamples under covariate 1 for each level q . Constraints (4.1c) ensure the κ -fine-balance requirement under2 D. S. HOCHBAUM, A. LEVIN, AND X. RAO covariate 2. Constraints (4.1d) assign each treatment sample to κ control samples, and constraints (4.1e)specify that each control sample can not be matched to more than one treatment sample.Next we prove that the 2-covariate BM problem is NP-hard by reducing the problem, one ofKarp’s 21 NP-hard problems [13][6]. : Given clauses C , C , ..., C m , each consisting of 3 literals from the set { v , v , ..., v n }∪{ ¯ v , ¯ v , ..., ¯ v n } .Is the conjunction of the given clauses satisfiable? Theorem
The -covariate BM problem is NP-hard.Proof. Given an instance of 3-SAT problem with clauses C , C , ..., C m and variables v , v , ..., v n ,we construct an instance of 2-covariate BM problem. We define two types of gadgets: the “variable”gadgets, one for each variable, and the “clause” gadgets, one for each clause. The distances between thetreatment samples and the control samples are set to 0 or 1. We construct a 2-covariate BM problemsuch that the 3-SAT problem is satisfiable if and only if the optimal objective value is 0. Moreover, thezero-distance matching implies the values of the 3-SAT variables in a truth assignment.First, we define the variable gadgets. Consider the j th variable v j that appears p j times as v j and q j times as ¯ v j . We have the following 2 p j + 2 q j − j th variable gadget: a ( j ) i for i = 0 , , , ..., q j − b ( j ) i for i = 1 , , ..., q j − c ( j ) i for i = 0 , , , ..., p j −
1, and d ( j ) i for i = 1 , , ..., p j −
1. The p j + q j − v j gadget are: e ( j ) i for i = 0 , , ..., q j − f ( j ) i for i = 1 , ..., p j −
1. (We also call the e ( j )0 sample f ( j )0 for simplicity.)There are p j + q j − v j gadget, each one consists of three samples:a pair of control samples out of the 2 p j + 2 q j − p j + q j − { a ( j ) i , b ( j ) i +1 } for i = 0 , , ..., q j − { c ( j ) i , d ( j ) i +1 } for i = 0 , , ..., p j − { a ( j ) q j − , c ( j ) p j − } . Sinceeach of these levels has one treatment sample, exactly one control sample out of those pairs need to bematched.The zero-distance pairs for the v j gadget are [ e ( j ) i , a ( j ) i ] for i = 0 , , ..., q j −
1, [ e ( j ) i , b ( j ) i ] for i =1 , , ..., q −
1, [ f ( j ) i , c ( j ) i ] for i = 0 , , ..., p j − f ( j ) i , d ( j ) i ] for i = 1 , , ..., p j −
1. Each treatment samplehas two potential matches with zero distance, and each control samples has one potential matches withzero distance. Note that sample e ( j )0 = f ( j )0 must be matched to either a ( j )0 or c ( j )0 in a zero-distancematching. So there are two possible zero-distance matchings: • Case 1 ( e ( j )0 is matched to a ( j )0 ): e ( j ) i is matched to a ( j ) i for i = 0 , , , ..., q j − f ( j ) i is matchedto d ( j ) i for i = 1 , ..., p j −
1; samples { b ( j ) i : i = 1 , , ..., q j − } and { c ( j ) i : i = 0 , , , ..., p j − } areunmatched; • Case 2 ( e ( j )0 = f ( j )0 is matched to c ( j )0 ): f ( j ) i is matched to c ( j ) i for i = 0 , , , ..., p j − e ( j ) i is matched to b ( j ) i for i = 1 , ..., q j −
1; samples { a ( j ) i : i = 0 , , ..., q j − } and { d ( j ) i : i =1 , , , ..., p j − } are unmatched;Consider the first case in which e ( j )0 is matched to a ( j )0 . For i = 0 , ..., q j −
2, if treatment sample e ( j ) i is matched to a ( j ) i , as only one control sample from the same level can be matched, b ( j ) i +1 must not bematched. Then as the two only zero-distance matches for e ( j ) i +1 are a ( j ) i +1 and b ( j ) i +1 , we infer that e ( j ) i +1 must be matched to a ( j ) i +1 . By such induction, we know all samples in { a ( j ) i : i = 0 , , , ..., q j − } arematched, and samples in { b ( j ) i : i = 1 , , ..., q j − } are unmatched. Since the only zero-distance matchof c ( j )0 is e ( j )0 , c ( j )0 is unmatched in Case 1. For i = 0 , ..., p j −
2, if sample c ( j ) i is unmatched, as onecontrol sample need to be matched in each level, we can infer that sample d ( j ) i +1 is matched to its onlyzero-distance pair f ( j ) i +1 . Therefore, c ( j ) i +1 can not be matched as its only zero-distance pair is taken. Bysuch induction, we know samples in { c ( j ) i : i = 0 , , , ..., p j − } are unmatched, and all samples in { d ( j ) i : i = 1 , , ..., p j − } are matched. With similar arguments, in the second case in which e ( j )0 = f ( j )0 is matched to c ( j )0 , all samples in { b ( j ) i : i = 1 , , ..., q j − } ∪ { c ( j ) i : i = 0 , , , ..., p j − } are matched,3and samples in { a ( j ) i : i = 0 , , ..., q j − } ∪ { d ( j ) i : i = 1 , , , ..., p j − } are unmatched. In Case 1, we saythat we assign variable v j the value FALSE, and in Case 2 we say that we assign variable v j the valueTRUE.Next, consider the clause gadgets. For clause C w that consists of variables v w , v w , v w , we pickthree samples without replacement from the variable gadgets that correspond to v w , v w , v w as follows.If the variable appears as literal v j ( w ) in the clause, we pick one of the samples of the type c ( j ( w )) i fromthe v j ( w ) gadget; if the variable appears as literal ¯ v j ( w ) we pick one of the samples of the type a ( j ( w )) i from the v j ( w ) gadget. Since for each occurrence of literal v j there is a type c ( j ) i sample, and for eachoccurrence of literal ¯ v w there is a type a ( j ) i sample in the v j gadget, we can ensure that there are enoughsamples to be selected from without replacement. The clause gadget is then augmented by eight newsamples used as “garbage collectors”: three treatment samples g ( w )1 , g ( w )2 , g ( w )3 and five control samples h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 . The fifteen pairs of distances between the three treatment samples and thefive control samples are set to zero.We introduce a new covariate 1 level that includes all of the eight new samples, so that it is disjointfrom the other levels created for the variable gadgets. Since there are three treatment samples in thislevel, three of the control samples h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 have to be matched.We introduce a new covariate 2 level which consists of five control samples: h ′ ( w )1 , h ′ ( w )2 and the threepreviously selected samples of type a i or c i in the v w , v w , v w gadgets. We also set the three treatmentsamples g ( w )1 , g ( w )2 , g ( w )3 to be in this covariate 2 level. Therefore, three out of the five control samples ofthis level must be matched. Observe that any three control samples in this level must contain at leastone sample of type a i or c i from a variable gadget. That means, the clause is satisfied by our variableassignment rule above: suppose the matched literal appears as v j ( w ) and the sample selected for thisclause gadget is c ( j ( w )) i , sample c ( j ( w )) i is matched so variable v j ( w ) must be assigned the value TRUEfrom our previous discussion; and suppose the matched literal appears as ¯ v j ( w ) and the sample selectedfor this clause gadget is a ( j ( w )) i , sample a ( j ( w )) i is matched so variable v j ( w ) must be assigned the valueFALSE.For all remaining samples that have not been assigned to a covariate 2 level in the above discussion,i.e. the treatment samples in the variable gadgets and the control samples of type h ( w )1 , h ( w )2 , h ( w )3 , wecreate a “dump” covariate 2 level that includes all of them.With the construction above, a zero value matching solution implies a truth assignment of the 3-SATproblem. The values of the 3-SAT variables are determined by which of the two zero-distance matchingsis in each variable gadget. For each clause gadget, at least one of the type a i or c i sample is matched,which implies the clause is satisfied by our variable assignment rule.On the other hand, given a truth assignment of the 3-SAT problem, there is a zero-distance match-ing for the constructed problem. First match the samples in the variable gadgets as follows: if thevariable takes value TRUE, then match the samples as described above in Case 2; if it takes valueFALSE, then match the samples as described in Case 1. By doing so, the number of matched controlsamples under the covariate 1 level associated to each variable gadget, equals the number of treat-ment samples in that level. Next, for each clause gadget, if there is one, respectively two, respectivelythree satisfied literals, then match two, respectively one, respectively zero out of { h ′ , h ′ } to g , g .That will match exactly three samples of the corresponding covariate 2 level as required. Finally, for w = 1 , ..., m , the remaining unmatched treatment samples in clause C w are matched to control samplesout of { h ( w )1 , h ( w )2 , h ( w )3 } . Since the three treatment samples g ( w )1 , g ( w )2 , g ( w )3 are matched to the controlsamples in { h ( w )1 , h ( w )2 , h ( w )3 , h ′ ( w )1 , h ′ ( w )2 } , the numbers of matched samples in the associated covariate 1level are three for both the treatment and control sides. For either the treatment or control side, thenumber of matched samples in the dump level under covariate 2 is the total number of matched samplesminus the number of matched samples in other covariate 2 levels. As we have shown that the numberof matched samples in the covariate 2 level associated with each clause gadget are three for both sides,4 D. S. HOCHBAUM, A. LEVIN, AND X. RAO the numbers of matched samples in the dump level also equal to each other for the two groups. So thisis a zero-distance matching for the constructed 2-covariate BM problem.This proof of Theorem 4.1 can be extended to the 2-covariate κ -BM problem for any constant κ . Corollary
The -covariate κ -BM problem is NP-hard for any constant κ .Proof. We prove this corollary also by reduction from the 3-SAT problem with clauses C , C , ..., C m and variables v , v , ..., v n .We first construct the same covariates levels, treatment and control samples which are the same asthe 2-covariate BM instance in the proof of Theorem 4.1. Let t denote the number of treatment samplesconstructed. The distance between each treatment and each control is modified: we change all distances0 to 1 and all distances 1 to M for M being a large constant which is greater than t . According to theproof of Theorem 4.1, the 3-SAT problem is satisfiable if and only if the optimal objective value of the2-covariate BM problem with this modified distance is t .Next, we add more samples to the control group. For each treatment sample constructed, we add κ − κ − κ − M . We claim that the 3-SAT problem is satisfiableif and only if the optimal objective value of the 2-covariate κ -BM problem on this new instance is t .If the 3-SAT problem is satisfiable, we can find an assignment for the BM problem on the constructedinstance with distance t as in the proof of Theorem 4.1. By assigning additionally each treatment sampleto the corresponding κ − κ -BM problem on the constructedinstance. Furthermore, this is also the optimal solution for the κ -BM problem as there are only ( κ − · t zero-distance pairs.On the other hand, if the optimal solution to the 2-covariate κ -BM problem has a total distance of t , then all the ( κ − · t zero-distance pairs must be matched. In addition, there must be t matched pairwith distance 1. From the arguments in the proof of Theorem 4.1, we can derive that there is a truthassignment for the 3-SAT problem.We can further derive that the κ -BM problem is also NP-hard for more than two covariates. Corollary
The P -covariate κ -BM problem is NP-hard for every value of P when P ≥ forany constant κ .Proof. For any 2-covariate κ -BM problem, and any P ≥
3, we can construct an equivalent P -covariate κ -BM problem by adding P − κ -BM probleminstance and set the value of the p th covariate to be the same for all samples, for each p = 3 , ..., P .Therefore, the NP-hardness of 2-covariate κ -BM problem implies that the P -covariate κ -BM problemis NP-hard as long as P ≥ -covariate BM and -covariate κ -BM problems where onecovariate has a constant number of levels. Let BM’ be the special case of 2-covariate BM problemwhere the second covariate has a constant number of levels while the first covariate has no restrictionon the number of levels. In Section 6 we will establish that if both covariates have constant number oflevels then the 2-covariate BM problem is polynomial time solvable. We show here that the complexitystatus of the 2-covariate problem in which only one covariate has a constant number of levels is linked tothe complexity status of the exact matching problem and its weighted version denoted as weighted exactmatching . In order to present this connection we assume that the distance matrix is integral and alldistances are given in unary, that is, there is a polynomial π of the input encoding length where δ ij ≤ π for all i, j .The exact matching in bipartite graph problem is defined as follows. The input is an integer number k together with a bipartite graph G = ( V ∪ V , E ) with | V | = | V | = q , and its edge set E is partitionedinto E b ∪ E r where E b is the set of blue edges and E r is the set of red edges. The exact matchingproblem is to find a perfect matching that has exactly k blue edges (and all other q − k edges are red).The complexity status of the exact matching problem is as follows. While [16] showed that there is a5randomized polynomial time algorithm for the problem, the existence of a deterministic polynomial timealgorithm is still an important open problem.The weighted exact matching problem is defined as follows. The input is a bipartite graph G = ( V, E )together with non-negative integral distances δ e for all e ∈ E , where there is a polynomial π of | V | + | E | such that δ e ≤ π for all e ∈ E . We are also given a target value K . The goal is to find a perfect matchingof total distance exactly K . Note that the weighted exact matching problem is a generalization of theexact matching problem since the later problem can be interpreted as the weighted exact matchingproblem where the weight of a blue edge is 1 and the weight of a red edge is 0. Thus, a polynomialtime algorithm for the weighted exact matching problem gives a polynomial time algorithm for theexact matching problem. On the other hand, it is known that a polynomial time algorithm for theexact matching problem gives a polynomial time algorithm for the weighted exact matching problem(see proposition 1 in [17]). If the algorithm for the exact matching is deterministic (randomized), thenthe algorithm for the weighted exact matching is deterministic (randomized, respectively) as well [17].Therefore, the complexity of the weighted problem has the same status as the one of the exact matchingproblem. Namely, the result of [16] gives a randomized polynomial time algorithm for the weighted exactmatching problem, while the existence of a deterministic polynomial time algorithm for this problem willresult in a deterministic polynomial time algorithm for the exact matching in bipartite graphs problem.We show the following connections between the exact matching problem (or the weighted exactmatching) and problem BM’. Theorem
If there is a deterministic (or randomized) polynomial time algorithm for BM’ thenthere is a deterministic (or randomized, respectively) polynomial time algorithm for exact matching inbipartite graphs.Proof.
Assume that there is a polynomial time algorithm ALG for BM’ and we will establish theexistence of a polynomial time algorithm for the exact matching problem. Given an input to the exactmatching problem with 2 q nodes ( q nodes on each side of the bipartite graph), we denote the bipartitionof the graph by V ∪ V and the partition of the edge set into E b ∪ E r , and we let k be the numberof the required blue edges in the matching. We define the following input for BM’. We will associatesamples with nodes so the control group consists of nodes and the treatment group also consists ofnodes. For every node v ∈ V , we have two nodes in the control corresponding to v : a red node v r anda blue node v b . All blue edges in E b that were incident to v in the input to the exact matching problemare now incident to v b , and all red edges that were incident to v are now incident to v r . These edgescorresponding to original edges in the input for the exact matching instance have zero distance, whileall other distances are set to 1. The nodes in V (of the original input graph to the exact matching) arethe treatment nodes, so the distances we defined represent the distances between a treatment node anda control node.The levels of the first covariate are defined such that every pair [ v r , v b ] for v ∈ V defines one level ofthe first covariate, and we have one treatment node in each such level. Observe that the number of levelsof the first covariate is q , and we have q nodes in the treatment group, so this assignment of levels ofthe first covariate is feasible. Next, consider the second covariate. We will have two levels of the secondcovariate corresponding to blue and red. The red level of the control is the set of all red nodes, and theblue level of the control is the set of all blue nodes. The second level of the treatment are defined sothat there will be exactly k nodes of the treatment in the blue level and the remaining q − k nodes inthe treatment are in the red level.We would apply algorithm ALG on the BM’ instance and check if the output cost is zero or strictlypositive. In any feasible solution of the BM’ defined, exactly one of two control nodes [ v r , v b ] is matchedfor each v ∈ V , as there is exactly one treatment node in each level of the first covariate. And since weneed to match k blue control nodes and q − k red control nodes, a zero distance matching of the BM’represents a set of edges in the exact matching instance that is a perfect matching consisting of k blueedges and q − k red edges.Observe that this construction is of deterministic polynomial time. Thus the algorithm that con-6 D. S. HOCHBAUM, A. LEVIN, AND X. RAO structs the input to BM’ and apply ALG on that input is a deterministic (randomized) polynomialtime algorithm for the exact matching problem if ALG is a deterministic (randomized, respectively)polynomial time algorithm for BM’.We next consider the other direction.
Theorem
If there is a deterministic (or randomized) polynomial time algorithm for weightedexact matching in bipartite graphs then there is a deterministic (or randomized, respectively) polynomialtime algorithm for BM’.Proof.
Assume that there is a polynomial time algorithm for the weighted exact matching problemin bipartite graph, we will establish the existence of a polynomial time algorithm for BM’. Here we aregoing to use the fact that the maximum distance is at most π and without loss of generality, we assumethat n ′ ≤ π . We set ǫ = 1 / (2 n π + 1), and we define a multi-objective optimization problem. As a stepin the algorithm for solving BM’, we will find a (1 + ǫ )-approximated Pareto set of perfect matchings inthe following graph with the following multi-criteria objective.We consider a complete bipartite graph where one side consists of the control group, namely onenode for each control sample, the other side of the graph correspond to the treatment group togetherwith some additional nodes. If we have in the instance of BM’, a control group of size n ′ and a treatmentgroup of size n , then we will have exactly n ′ − n additional nodes. The graph that we consider is thecomplete bipartite graph with n ′ nodes on each side. In this graph a feasible solution is a perfectmatching of all nodes.Next, we define the 1 + k objectives, for constant k being the number of levels of the secondcovariate. The first objective corresponds to total distance, and one of the remaining objectives for eachlevel of the second covariate. These objectives are sums of cost for edges in the perfect matching withdifferent coefficients. We denote these cost coefficients as a vector for each edge so the first coefficientof this cost is the coefficient of the cost function of the first objective. In order to define the first costcoefficient, we assign a covariate 1 level for the additional nodes as follows. For each level p of thefirst covariate, we assign exactly ℓ ′ ,p − ℓ ,p additional nodes to have level p . Note that without loss ofgenerality we have ℓ ′ ,p ≥ ℓ ,p for all p (as otherwise BM’ is infeasible), and thus our definition of levels ofthe first covariate for the additional nodes, indeed define such level for every additional node. Consideran edge [ u, v ] between an additional node u and a control sample node v . The first cost coefficient ofedge [ u, v ] is 0 if the two nodes share the same level of covariate 1, otherwise it is set to 2 nπ . Theother k cost coefficients of such an edge [ u, v ] are set to 0. Consider next an edge [ u ′ , v ] where u ′ is atreatment node (and not an additional node) and assume that the second covariate level of v is p . Thefirst cost coefficient is the distance between samples u ′ and v in the BM’ instance (that is, δ u ′ ,v ). The( p + 1)th cost coefficient is set to 1 and the other k − p + 1)thobjective is to minimize the number of edges adjacent to level- p control nodes for covariate 2.Observe that every feasible solution, namely, every perfect matching in this bipartite graph hasthe property that the sum of the costs according to all objectives excluding the first one is exactly n .Furthermore, observe that in a Pareto set of this multi-criteria optimization problem, there is exactlyone point where for every p = 1 , , . . . , k , the total cost of the matched edges according to the ( p + 1)-thobjective is exactly the number of treatment samples of the p -th level of the second covariate. We willrefer to this point as the candidate feasible point of the Pareto set .Now, we use the result of Papadimitriou and Yannakakis [18] to conclude that the algorithm forweighted exact matching gives the required algorithm for approximating the Pareto set and finding the(1 + ǫ )-approximated Pareto set (see theorem 4 and corollary 5 in [18]). We can use the results of [18]since the maximum cost coefficient of an edge in our instance is 2 nπ , that is polynomially bounded inthe input encoding length, and the number of objectives is a constant.We next argue that the (1+ ǫ )-approximated Pareto set is actually the Pareto set of the multi-criteriaproblem. The cost vectors of the edges are always integral and have a maximum coefficient of at most2 nπ , and thus for every objective the cost of the matching is at most 2 n π . Therefore, if we approximatethis objective with an approximation ratio of 1 + ǫ then by our choice of ǫ we get the optimal value of7this objective. By the last claim we conclude that the candidate feasible point of the Pareto set is oneof the solutions that appear in this Pareto set and we consider this specific solution.We delete from the candidate solution the edges (in the matching) that are adjacent to the additionalnodes. By the notion of the candidate solution, the fine balance constraints of the second covariate aresatisfied. So if the resulting matching is not a feasible solution to the BM’ instance, it means thatthere is at least one additional node that used to be matched (in the candidate solution) to a controlsample of a different covariate 1 level. This is so as otherwise, for every level of the first covariate,the number of selected control samples of this level is the same as the number of treatment samples ofthis level. Consequently, if the resulting matching is infeasible for BM’, then the first objective valueof the candidate solution is at least 2 nπ , which implies that the BM’ problem is infeasible as we shownext. If BM’ has a feasible solution, then we can create an alternative candidate solution by adding tothis solution for BM’ a zero distance matching of the additional nodes. This alternative solution hasa first objective value that is less than or equal to nπ , and all other objectives values are the same asthe candidate solution. So according to the (1 + ǫ ) Pareto optimality, the resulting matching must befeasible for the BM’ problem.In summary, we consider the candidate solution for BM’ obtained from the candidate feasible pointof the Pareto set after deleting the edges adjacent to the additional nodes. We check if the candidatesolution for BM’ satisfies the fine balance constraints. If it does, then this is the output of the algorithmfor BM’, and if it does not, then the BM’ instance is infeasible.Furthermore, if we use the existing algorithm of [16] to solve the weighted exact matching thenthe resulting algorithm will be randomized polynomial time algorithm whereas if we will use it witha deterministic polynomial time algorithm then the resulting algorithm for BM’ is also a deterministicpolynomial time algorithm.Next we consider the κ -BM’ that is the special case of 2-covariate κ -BM where the second covariatehas a constant number of levels, and once again we assume that the distance matrix is integral and themaximum distance is upper bounded by a polynomial π of the input encoding length. We show that κ -BM’ has the same complexity status as BM’ (for all κ ≥ Theorem
There is a polynomial time algorithm for BM’ if and only if there is a polynomialtime algorithm for κ -BM’.Proof. Assume that there is a polynomial time algorithm ALG for BM’. Consider an instance of the κ -BM’ problem, and replace every treatment sample by κ copies of it with the same pair of levels as theoriginal element of the treatment group. We define the distance matrix as follows. The distance betweena treatment sample x that is a copy of the original treatment sample x ′ (of the instance for the κ -BM’)and a control sample y , is now defined as the distance between x ′ and y . The resulting treatment groupand control group is the instance for BM’, and we apply ALG on that instance. A feasible solutionfor this BM’ instance gives a feasible solution for the κ -BM’ instance (simply by matching a treatmentsample to a control sample if one of the copies of the treatment sample was matched to that controlsample) and of the same total distance. Similarly, a feasible solution to the original κ -BM’ instancegives a feasible solution for the BM’ instance of the same cost by matching the set of κ matched controlsample to the unique treatment sample to the κ copies of this treatment sample in the BM’ instance(with one control sample matched to each copy). Thus, we get a polynomial time algorithm for κ -BM’problem.Consider the other direction. Assume that we are given a polynomial time algorithm ALG for κ -BM’and we establish the existence of a polynomial time algorithm for BM’. Consider an instance of the BM’problem. First, we add ( n + 1) π for every component of the distance matrix. This modification of thedistance matrix ensures that every feasible solution for BM’ has a cost that is at least ( n + n ) π and atmost ( n + 2 n ) π . Next, for every treatment sample x we add another κ − x whose distance from x is 0 and from other treatment samples the distanceis ( n + 2 n ) · π + 1. These κ − κ -BM’ on which we apply ALG. The output has cost B ≤ ( n + 2 n ) · π D. S. HOCHBAUM, A. LEVIN, AND X. RAO if and only if the solution we obtain by deleting all the dummy control sample is a feasible solution forBM’ of cost B . To prove the last claim note that the solution for the κ -BM’ instance cannot matchdummy control samples to treatment samples if their distance is not zero because the distance of suchmatch is larger than B and all distances are non-negative. Furthermore, the κ − n + 1) · π > ( n + 2 n ) · π . Thus, by deleting the dummy control sample we get a feasible solutionfor the BM’ problem (after the modification of the distance matrix) of the same cost. Similarly, if wetake an optimal solution for the BM’ problem (before or after the modification of the distances) andadd to it the dummy control samples that are matched using the zero-distances then we get an optimalsolution for the κ -BM’. Therefore, by applying ALG on the last κ -BM’ problem we get a polynomialtime algorithm for solving the original BM’ instance.
5. The maximum selection κ -fine-balance matching ( κ -MSBM) problem. Since any κ -BM problem can be solved as a κ -MSBM problem with the same κ and the same number of covariates,Corollary 4.3 implies that the P -covariate κ -MSBM problem is also NP-hard for P ≥ κ . And in section 2 we show that the 1-covariate MSBM problem can be solved as an MCNF problemin polynomial time. We are going to show next that the 1-covariate κ -MSBM problem is NP-hard evenfor any constant κ ≥ Theorem
For any constant value of κ such that κ ≥ , the -covariate κ -MSBM problem isNP-hard even with only one level.Proof. Given an instance of Exact-3-cover, namely a collection C of 3-element subsets (triplets) of aground set E with | E | = 3 q for some integer q . We will define an instance of 1-covariate κ -MSBM withonly one level for any constant κ such that κ ≥
3. In the constructed instance, we have one treatmentsample for every triplet in C , and we have one control sample for every element e ∈ E . In additionwe have ( κ − · q dummy control samples. For each triplet T ∈ C , the distance of the correspondingtreatment sample to a control sample is defined as follows: it is zero if the control sample is one of thedummy samples or if the control sample is an element of the triplet T . All other distances (that are stillundefined) are set to one. Observe that a feasible solution for the κ -MSBM instance selects all controlgroup and selects exactly q treatment samples. We claim that in this instance, an optimal solution for κ -MSBM has total distance of zero, if and only if the Exact-3-cover instance is a YES instance.To see the last claim assume first that there is a subcollection of triplets C ′ ⊆ C such that everyelement of E appears in exactly one triplet in C ′ . Then we construct a zero-distance solution for the κ -MSBM instance as follows. We select the samples C ′ of the treatment group and each such selectedsample T ∈ C ′ is matched to the samples of the control consisting of the three elements samples in T together with κ − C ′ has q triplets, we have sufficient number ofadditional control samples. Furthermore, since every element in E appears only once in triplets of C ′ ,we conclude that its control sample is matched to exactly one selected treatment sample.On the other hand, assume that there is a zero-distance solution for the κ -MSBM instance. Then,this solution selects exactly q treatment samples corresponding to q triplets. Denote by C ′ ⊆ C thissubcollection of q triplets. Note that if there is a selected treatment sample that is matched to at least κ − κ − E , then at least one of those matches has distance one. This contradicts the assumptionthat the cost of the κ -MSBM solution is zero. Therefore, every selected treatment sample is matched toexactly κ − E appears in exactly one triplet of C ′ , so the Exact-3-cover instance is a YES instance.For the 1-covariate κ -MSBM problem where κ = 2, the complexity status remains open.
6. Fixed-parameter tractable algorithms.
In this section, we consider the special cases of the κ -FBS, κ -BM, and κ -MSBM problems where all covariates have a small number of levels.9Let K = Q Pi =1 k i be the number of level-intersections. Observe that if the number of covariatesis constant and all covariates have constant number of levels, then K is a constant. We note thatthe problems κ -FBS, κ -BM, and MSBM can be solved in fixed-parameter tractable (FPT) time withparameter K . In order to state these results, we say that a problem is fixed-parameterized complexitywith parameter K and denote it by F P T ( K ) if it has an algorithm whose time complexity is upperbounded by a function of the form f ( K ) · poly where f ( K ) is some computable function of the parameter K , and poly is some polynomial of the input binary encoding length. We also say that an algorithm runsin F P T ( K ) time and mean that its time complexity can be upper bounded by a function of the form f ( K ) · poly where f ( K ) is some computable function of the parameter K , and poly is some polynomialof the input binary encoding length. Here we show that these problems, namely κ -FBS, κ -BM problemsfor all κ , and MSBM problem are F P T ( K ). Similar results for κ -MSBM where κ ≥ P = N P as shown in Theorem 5.1. The complexity status of the 2-MSBM problem with constant K is open.Our proof for the F P T ( K ) results uses the existence of fast algorithms for solving integer program-ming in fixed dimension and for solving mixed-integer linear programs if the number of integral variablesis fixed. Lenstra [14] (see also [12] for an improved time complexity of these algorithms) showed that theinteger linear programming problem with a fixed number of variables is polynomially solvable, and healso showed that a mixed-integer linear program with a fixed number of integer variables can be solvedin polynomial time. In fact, these algorithms runs in F P T time with parameter being the number ofintegral variables. Therefore, to prove our results we show either an integer programming (IP) formu-lation with number of decision variables O ( K ) or a mixed-integer linear program (MILP) with O ( K )integer variables such that solving this MILP to optimality ensures that the resulting solution is integraland solves the corresponding problem. κ -FBS problem. First consider the κ -FBS problem. For this problem we use an integerprogram with dimension O ( K ) that is based on (IP-FBS) . Let u i ,i ,...,i P = | L ,i ∩ L ,i ∩ ... ∩ L P,i p | and u ′ i ,i ,...,i P = | L ′ ,i ∩ L ′ ,i ∩ ... ∩ L ′ P,i p | for i p = 1 , ..., k p , p = 1 , ..., P . The decision variables are: x i ,i ,...,i P : the number of treatment samples selected from the ( i , i , . . . , i P ) level intersection L ,i ∩ L ,i ∩ ... ∩ L P,i p , for i p = 1 , ..., k p , p = 1 , ..., P ; x ′ i ,i ,...,i P : the number of control samples selected from the ( i , i , . . . , i P ) level intersection L ′ ,i ∩ L ′ ,i ∩ ... ∩ L ′ P,i p , for i p = 1 , ..., k p , p = 1 , ..., P .The integer programming formulation is:max k X i =1 k X i =1 · · · k P X i P =1 x i ,i ,...,i P (6.1a) s.t. κ · k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 x i ,i ,...,i P = k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 x ′ i ,i ,...,i P p = 1 , . . . , P i p = 1 , ..., k p (6.1b) 0 ≤ x i ,i ,...,i P ≤ u i ,i ,...,i P p = 1 , ..., P, i p = 1 , ..., k p (6.1c) 0 ≤ x ′ i ,i ,...,i P ≤ u ′ i ,i ,...,i P p = 1 , ..., P, i p = 1 , ..., k p (6.1d) x i ,i ,...,i P , x ′ i ,i ,...,i P integers p = 1 , ..., P, i p = 1 , ..., k p (6.1e). Note that this integer programming formulation has 2 K decision variables and O ( K ) constraints,and thus the algorithm that constructs it and solves it to optimality runs in F P T ( K ) time. The optimalsolution for this integer program encodes the optimal solution for κ -FBS similarly to the proof of theorem2.10 D. S. HOCHBAUM, A. LEVIN, AND X. RAO κ -BM problem. Next consider the κ -BM problem. In section 2, we describe an MCNFformulation when the level intersection sizes s ′ i ,i ,...,i P for p = 1 , ..., P and i p = 1 , ..., k p are given.Observe that if we treat the sizes s ′ i ,i ,...,i P for all p and i p as decision variables, then by enforcing theintegrality of these K variables and adding the constraints saying that k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s ′ i ,i ,...,i P = κ · ℓ p,i p , i p = 1 , ..., k p p = 1 , . . . , P forcing the κ -fine balance constraints to the MCNF formulation, we get a MILP formulation of κ -BMwith K integral variables. In fact if we restrict ourselves to a common integral values of these K variables,then the other decision variables are integral as we argue next. By considering the values of these K integral variables as constants, the resulting linear programming formulation is in fact an MCNF LPformulation whose supply/demand vector depends on the values of these K integral variables. Thus, theoptimal solution for the MILP is without loss of generality integral, and even if it does not satisfy thisintegral requirement it can be transformed to another optimal solution that is integral in polynomialtime.Since the number of variables of the resulting mixed-integer program is at most n · n ′ + K , the numberof integer variables is K , and the number of constraints is O ( n · n ′ ), we conclude that the algorithm thatformulates this MILP and solves it to optimality guaranteeing that the optimal solution is integral, runsin F P T ( K ) time. κ -MSBM problems. We know from Theorem 5.1 that the 1-covariate κ -MSBM problem for κ ≥ s i ,i ,...,i P and s ′ i ,i ,...,i P are given. Observe that if we treat s i ,i ,...,i P and s ′ i ,i ,...,i P asdecision variables, then by enforcing the integrality of these O ( K ) variables and adding the constraintssaying that k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s i ,i ,...,i P = k X i =1 ... k p − X i p − =1 k p +1 X i p +1 =1 ... k P X i P =1 s ′ i ,i ,...,i P , i p = 1 , ..., k p p = 1 , . . . , P that is the fine balance constraints, in addition to the constraint saying that the sum over all s i ,i ,...,i P equals the objective function value of κ -FBS, to the MCNF formulation, we get a MILP formulationof κ -MSBM with 2 K integral variables. In fact if we restrict ourselves to a common integral values ofthese 2 K variables, then the other decision variables are without loss of generality integral as well as weargue next. By considering the values of these 2 K integral variables as constants, the resulting linearprogramming formulation is in fact a MCNF LP formulation whose supply/demand vector depends onthe values of these 2 K integral variables. Thus, the optimal solution for the MILP is without loss ofgenerality integral, and even if it does not satisfy this integral requirement it can be transformed toanother optimal solution that is integral in polynomial time.Since the number of variables of the resulting mixed-integer program is at most n · n ′ + 2 K , thenumber of integer variables is 2 K , and the number of constraints is O ( n · n ′ ), we conclude that thealgorithm that formulates this MILP and solves it to optimality guaranteeing that the optimal solutionis integral, runs in F P T ( K ) time. Appendix A. The minimum cost network flow.
We formulate here the minimum cost networkflow problem (MCNF). The input to the problem is a graph G = ( V, A ) with a set of nodes V and a setof arcs A , where each arc ( i, j ) ∈ A is associated with a cost c ij , capacity upper bound u ij , and capacitylower bound l ij . Each node i ∈ V has supply b i which is interpreted as demand if negative, and can be0. Let x ij be the amount of flow on arc ( i, j ) ∈ A . The flow vector x is said to be feasible if it satisfies:(1) Flow balance constraints:
For every node k ∈ V Outf low ( k ) − Inf low ( k ) = b k (2) Capacity constraints:
For each arc ( i, j ) ∈ A , l ij ≤ x ij ≤ u ij .1The linear programming formulation of the problem is:(MCNF) min P ( i,j ) ∈ A c ij x ij subject to P j :( k,j ) ∈ A x kj − P i :( i,k ) ∈ A x ik = b k ∀ k ∈ Vl ij ≤ x ij ≤ u ij , ∀ ( i, j ) ∈ A. The flow balance constraints coefficients form a { , , − } -matrix where in each column there isexactly one 1 and one −
1. Such matrix is a special case of matrices where each column (or row) has atmost one 1 and at most one −
1, which are known to be totally unimodular.
REFERENCES[1]
R. K. Ahyja, J. B. Orlin, and T. L. Magnanti , Network flows: theory, algorithms, and applications , Prentice-Hall,1993.[2]
M. Bennett, J. P. Vielma, and J. R. Zubizarreta , Building representative matched samples with multi-valuedtreatments in large observational studies , Journal of Computational and Graphical Statistics, 0 (2020), pp. 1–29,https://doi.org/10.1080/10618600.2020.1753532.[3]
M. A. Brookhart, S. Schneeweiss, K. J. Rothman, R. J. Glynn, J. Avorn, and T. St¨urmer , Variable selectionfor propensity score models , American journal of epidemiology, 163 (2006), pp. 1149–1156.[4]
R. Busaker and P. J. Gowen , A procedure for determining minimal-cost network flow patterns , tech. report, OROTechnical Report 15, Operational Research Office, John Hopkins University, 1961.[5]
J. Edmonds and R. M. Karp , Theoretical improvements in algorithmic efficiency for network flow problems , Journalof the ACM (JACM), 19 (1972), pp. 248–264, https://doi.org/10.1145/321694.321699.[6]
M. R. Garey and D. S. Johnson , Computers and Intractability: A Guide to the Theory of NP-Completeness , W.H. Freeman, New York, NY, USA, 1978.[7]
D. E. Ho, K. Imai, G. King, and E. A. Stuart , Matching as nonparametric preprocessing for reducing modeldependence in parametric causal inference , Political analysis, 15 (2007), pp. 199–236.[8]
D. S. Hochbaum and X. Rao , Network flow methods for the minimum covariates imbalance problem , arXiv preprintarXiv:2007.06828, (2020).[9]
G. W. Imbens , Nonparametric estimation of average treatment effects under exogeneity: A review , Review of Eco-nomics and statistics, 86 (2004), pp. 4–29.[10]
M. Iri , A new method of solving transportation-network problems , Journal of the Operations Research Society ofJapan, 3 (1960), p. 2.[11]
W. S. Jewell , Optimal flow through networks , in Operations Research, vol. 6, 1958, pp. 633–633.[12]
R. Kannan , Improved algorithms for integer programming and related lattice problems , in Proceedings of the 15thAnnual ACM Symposium on Theory of Computing, 25-27 April, 1983, Boston, Massachusetts, USA, D. S.Johnson, R. Fagin, M. L. Fredman, D. Harel, R. M. Karp, N. A. Lynch, C. H. Papadimitriou, R. L. Rivest,W. L. Ruzzo, and J. I. Seiferas, eds., ACM, 1983, pp. 193–206, https://doi.org/10.1145/800061.808749, https://doi.org/10.1145/800061.808749.[13]
R. M. Karp , Reducibility among combinatorial problems , in Complexity of computer computations, Springer, 1972,pp. 85–103.[14]
H. W. Lenstra Jr , Integer programming with a fixed number of variables , Mathematics of operations research, 8(1983), pp. 538–548.[15]
S. L. Morgan and D. J. Harding , Matching estimators of causal effects: Prospects and pitfalls in theory andpractice , Sociological methods & research, 35 (2006), pp. 3–60.[16]
K. Mulmuley, U. Vazirani, and V. Vazirani , Matching is as easy as matrix inversion , Combinatorica, 7 (1987),pp. 105–113.[17]
C. H. Papadimitriou and M. Yannakakis , The complexity of restricted spanning tree problems , Journal of the ACM(JACM), 29 (1982), pp. 285–309.[18]
C. H. Papadimitriou and M. Yannakakis , On the approximability of trade-offs and optimal access of web sources ,in Proceedings 41st Annual Symposium on Foundations of Computer Science, IEEE, 2000, pp. 86–92.[19]
S. D. Pimentel, R. R. Kelz, J. H. Silber, and P. R. Rosenbaum , Large, sparse optimal matching with refinedcovariate balance in an observational study of the health outcomes produced by new surgeons , Journal of theAmerican Statistical Association, 110 (2015), pp. 515–527, https://doi.org/10.1080/01621459.2014.997879.[20]
P. R. Rosenbaum , Overt bias in observational studies , in Observational studies, Springer, 2002, pp. 71–104.[21]
P. R. Rosenbaum, R. N. Ross, and J. H. Silber , Minimum distance matched sampling with fine balance in anobservational study of treatment for ovarian cancer , Journal of the American Statistical Association, 102 (2007),pp. 75–83, https://doi.org/10.1198/016214506000001059.[22]
D. B. Rubin, E. A. Stuart, et al. , Affinely invariant matching methods with discriminant mixtures of proportionalellipsoidally symmetric distributions , The Annals of Statistics, 34 (2006), pp. 1814–1826.[23]
E. A. Stuart , Matching methods for causal inference: A review and a look forward , Statistical science: a reviewjournal of the Institute of Mathematical Statistics, 25 (2010), p. 1, https://doi.org/10.1214/09-STS313. D. S. HOCHBAUM, A. LEVIN, AND X. RAO[24]
N. Tomizawa , On some techniques useful for solution of transportation network problems , Networks, 1 (1971),pp. 173–194, https://doi.org/10.1002/net.3230010206.[25]
D. Yang, D. S. Small, J. H. Silber, and P. R. Rosenbaum , Optimal matching with minimal deviation from finebalance in a study of obesity and surgical outcomes , Biometrics, 68 (2012), pp. 628–636, https://doi.org/10.1111/j.1541-0420.2011.01691.x.[26]