Exact Computation of Maximum Rank Correlation Estimator
EExact Computation of Maximum Rank Correlation Estimator ∗ Youngki Shin † Zvezdomir Todorov ‡ September 9, 2020
Abstract
In this paper we provide an exact computation algorithm for the maximum rank correlationestimator using the mixed integer programming (MIP) approach. We construct a new con-strained optimization problem by transforming all indicator functions into binary parametersto be estimated and show that it is equivalent to the original problem. Using a modern MIPsolver, we apply the proposed method to an empirical example and Monte Carlo simulations.We also consider an application of the best subset rank prediction and show that the originaloptimization problem can be reformulated as MIP. We derive the non-asymptotic bound for thetail probability of the predictive performance measure.
Keywords: mixed integer programming, semiparametric estimation, maximum rank correlation,U-process, finite sample property.
JEL classification:
C14, C61. ∗ We thank Toru Kitagawa and Mike Veall for helpful comments. Shin gratefully acknowledges support from theSocial Sciences and Humanities Research Council of Canada (SSHRC-435-2018-0275). † Department of Economics, McMaster University, ON, Canada. Email: [email protected]. ‡ Department of Economics, McMaster University, ON, Canada. Email: [email protected]. a r X i v : . [ ec on . E M ] S e p Introduction
In this paper we provide an exact computation algorithm for the maximum rank correlation(MRC) estimator using the mixed integer programming (MIP) approach. The MRC estimator wasfirst proposed by Han (1987) to estimate the generalized regression model: y i = D ◦ F ( x (cid:48) i β, (cid:15) ) , (1)where D : R (cid:55)→ R is non-degenerate monotonic and F : R (cid:55)→ R is strictly monotonic in eacharguments. The object of interest is the linear index parameter β . The model is general enough toinclude a binary choice model, a censored regression model, and a proportional hazards model as itsexample. Han (1987) proposed to estimate β by maximizing Kendall’s rank correlation coefficient: (cid:98) β = arg max β ∈B n ( n − n (cid:88) i =1 (cid:88) j (cid:54) = i (cid:8) x (cid:48) i β > x (cid:48) j β (cid:9) { y i > y j } , (2)where 1 {·} is an indicator function. He showed the consistency of the MRC estimator and Sherman(1993) proved the √ n -consistency and the asymptotic normality later. The flexible model structureleads to various extensions of the MRC estimator: for example, a quantile index model (Khan, 2001),a generalized panel model (Abrevaya, 2000), a rank estimation of a nonparametric function (Chen,2002), a random censoring model (Khan and Tamer, 2007), and a partial linear model (Abrevayaand Shin, 2011).Implementing the MRC estimator casts some computational challenges in practice, where thegrid search method is not feasible. First, the objective function in (2) is not differentiable in β andwe cannot apply a gradient-based optimization algorithm. Second, the objective function is notconcave. Therefore, any solution found by a numerical algorithm could not be a global solution buta local one. This difficulty is well described in Chay and Honore (1998), where they apply Powell’sconjugate directions method, the simplex method with multiple starting points, and the piece-wisegrid search method repeatedly to achieve a better solution in the empirical application. Evenafter these repeated searches, we are not sure whether the current solution is the global optimum.Finally, the objective function is the second order U-process and requires O ( n ) computations for asingle evaluation. Abrevaya (1999) shows that the computation order can be reduced to O ( n log n )by adopting the binary search tree structure. However, the fundamental local solution issue stillremains.The contribution of this paper is twofold. First, we propose a new computation algorithm thatassures the global solution of the MRC estimator. We achieve this goal by transforming all indicator2unctions into binary parameters to be estimated along with additional constraints. We show thatthe proposed mixed integer programming (MIP) problem is equivalent to the original optimizationproblem. Although this is still an NP-hard problem, we use a modern mixed integer programming(MIP) solver and confirm that it is feasible to get the solution within a reasonable time budget. Theadditional advantage of the MIP approach is that it provides us with the gap between the objectivefunction value at the current best solution and the bound of the possible global maximum at anytime point of the computation procedure. By this MIP gap, we can measure the quality of theinterim solution when the time limit prevents us from waiting for the convergence of the procedure.Second, we consider an application of the best subset rank prediction and analyze the predictionperformance. Building on Chen and Lee (2018a), we derive a non-asymptotic bound of the tailprobability of the predictive performance measure. Since the objective function is defined as asecond-order U-process, we develop a new technique to derive the exact tail probability bound forhigher order U-processes.We review some related literature. The MIP procedure is recently adopted in various applica-tions in econometrics and statistics. Florios and Skouras (2008) show that the maximum score esti-mator of Manski (1975) can be reformulated as an MIP structure. Bertsimas, King, and Mazumder(2016) consider the best subset selection problem and show that the MIP algorithm outperformsother penalty based methods in terms of achieving sparse solutions with good predictive power.Chen and Lee (2018a,b) investigate the binary prediction problem with variable selection and theinstrumental variable quantile regression in the MIP formulation. Kitagawa and Tetenov (2018)apply the MIP procedure when they estimate the personalized optimal welfare policy. Finally, Leeet al. (2018) develop a MIP computation algorithm to estimate a two-regime regression model whenthe regime is determined by multi-dimensional factors. To the best of our knowledge, however, thisis the first paper in the literature to apply the MIP approach when the objective function is definedas a higher order U-process.The remainder of the paper is organized as follows. In section 2, we propose the MIP computa-tion algorithm for the maximum rank correlation estimator. We show that the proposed algorithmis equivalent to the original optimization problem and illustrate how it can achieve a feasible so-lution. In section 3, we show the better performance of the proposed MIP algorithm by applyingit to the female labor participation data of Mroz (1987). Additional numerical evidence is pro-vided through Monte Carlo simulation studies in section 4. In Section 5, we provide an applicationon the best subset rank prediction and derive the non-asymptotic tail probability bound of the3erformance measure. Section 6 concludes. In this section we describe the computational challenges of the maximum rank correlation(MRC) estimator and propose a new algorithm to compute the exact solution of it. We illustratethe advantage of the new algorithm by investigating a simple numerical example.We first discuss the computational difficulties of the MRC estimator. Recall that MRC is definedas follows: (cid:98) β = arg max β ∈B n ( n − n (cid:88) i =1 (cid:88) j (cid:54) = i (cid:8) x (cid:48) i β > x (cid:48) j β (cid:9) { y i > y j } , (3)where B is the parameter space of β and 1 {·} is an indicator function. Note that the objectivefunction is neither differentiable nor concave in β . Furthermore, it is defined as a second-order U-process, which requires O ( n ) order of computations for each evaluation of a candidate parametervalue. As a result, we cannot apply any gradient-based optimization algorithm. Researchersusually adopt a simplex-based algorithm such as the Nelder-Meade method in MRC applications.However, it is difficult to get the exact solution even with multiple starting points since the objectivefunction is not globally concave. A grid search algorithm would give more robust solutions but thecurse of dimensionality makes it infeasible in most cases when the dimension of x is larger than 2.In this paper we propose an alternative computational algorithm that is based on the mixedinteger programming (MIP) procedure. Let x ij := x i − x j be a pairwise difference of x i and x j .Let ε be a small positive number, e.g. ε = 10 − , to denote an effective zero. Consider the followingmixed integer programming problem: for i, j = 1 , . . . , n and i (cid:54) = j (cid:16) (cid:98) β, (cid:110) (cid:98) d ij (cid:111)(cid:17) = arg max β, { d ij } n ( n − n (cid:88) i =1 (cid:88) j (cid:54) = i d ij { y i > y j } (4)subject to β ∈ B (5)( d ij − M ij < x (cid:48) ij β ≤ d ij M ij (6) d ij ∈ { , } (7) Abrevaya (1999) proposes a nice algorithm that reduces the computation order to O ( n log n ) by using the binarysearch tree. However, it still does not guarantee an exact solution. M ij = max β ∈B (cid:12)(cid:12)(cid:12) x (cid:48) ij β (cid:12)(cid:12)(cid:12) + ε. Since the objective function in (4) is the linear function of thebinary variables { d ij } , the formulation becomes a mixed integer linear programming problem.We check the equivalence between the original problem in (3) and the MIP problem in (4)–(7).Consider that the MIP problem chooses (cid:98) d ij = 1 for some i, j . Then, the constraint (6) impliesthat the estimate for (cid:98) β should satisfy 0 < x (cid:48) ij (cid:98) β ≤ M ij , which is equivalent to x i (cid:98) β > x j (cid:98) β for alarge enough M ij . Similarly, (cid:98) d ij = 0 is equivalent to x i (cid:98) β ≤ x j (cid:98) β . In sum, the constraint forces d ij = 1 { x (cid:48) ij β > } = 1 { x (cid:48) i β > x (cid:48) j β } given any β ∈ B . Therefore, we can compute the exact solution (cid:98) β for (3) by solving the equivalent MIP problem in (4)–(7). These two optimization problems giveus the same numerical results but the MIP procedure has a clear computational advantage overthe original problem, which is illustrated below.Modern numerical solvers such as CPLEX and Gurobi make it possible to solve a large scale MIPproblem by adopting branch-and-bound type approaches. We provide a heuristic explanation of howa vanilla branch-and-bound algorithm reduces the computational burden followed by a numericalexample. Consider a binary tree representation for all possible values of { d ij } (for example, seeFigure 1). A bottom node of the tree represents a different possible solution for { d ij } , and β canbe easily solved by the linear programming procedure since d ij is fixed there. However, we have2 n ( n − bottom nodes in total and the brute force approach is still infeasible with a standard samplesize. The branch-and-bound approach help eliminate a partition of the final nodes systematically.Suppose that we are in a node located in the middle of the tree, where only a part of { d ij } is fixed.Let U ∗ be the current best objective function value. Now we solve a subproblem after relaxingall { d ij } that are not fixed by parent nodes into continuous variables on the interval [0 , U ∗ R , is less than or equal to U ∗ . Since the objective function value of the original subproblem is always worse than that of therelaxed subproblem, we cannot achieve a better result than U ∗ by solving any bottom nodes belowthe current node. Thus, we can drop all of them from the computation list. Second, U ∗ R > U ∗ andthe solution of the relaxed problem satisfies the binary restriction for { d ij } . This solution coincideswith that of the original subproblem. Then, we update U ∗ = U ∗ R and can drop all bottom nodesbelow it from the computation list. While moving on to a child node, we solve a relaxed subproblemrepeatedly and drop a partition of the bottom nodes from our computation list. An initial solution can be achieved from the linear programming problem at any bottom node of the tree.
5e provide a simple numerical example to illustrate how the branch-and-bound algorithmworks.
Example 1.
Consider a sample of { ( y i , x i , x i ) } i =1 = { (1 , , , (0 , , , (0 , , , (0 , . , } . Wenormalize β = 1 and set the parameter space for β as [ − , . There are only three pairedobservations that satisfy the condition { y i > y j } and the MIP problem becomes arg max β ,d ,d ,d
112 ( d + d + d ) subject to β ∈ [ − , d − · < − β ≤ d · d − · < − β ≤ d · d − · < − . ≤ d · d , d , d ∈ { , } . Figure 1 shows the binary tree representation and the brute force approach requires solving 8 linearprogramming problems at the bottom nodes. We set U ∗ = −∞ and solve the first relaxed subproblemat the child node of d = 1 (the first right branch in Figure 1). The solution for this relaxedsubproblem is ( β , d , d ) = (5 , , with the objective function value Q ∗ R = 2 / . Since U ∗ R > U ∗ and ( d , d ) satisfies the binary restriction, we update U ∗ = 2 / and drop all the nodes below d = 1 . We next look at the relaxed subproblem at d = 0 (the first left branch in Figure 1).A solution is ( β , d , d ) = (1 / , / , with the objective function value U ∗ R = 23 / . Since U ∗ R < U ∗ , we can drop all the nodes below d = 0 . Recall that any objective function value fromthe bottom nodes under d = 0 cannot be larger than / . Therefore, we achieve the solutionby solving only two linear programming problems out of the total eight problems. Finally, we have some remarks on the implementation of the MIP procedure in (4)–(7). First, M ij can be computed by solving a separate linear programming problem and saving those values.Alternatively, we can also set a big number M max for all i, j , which is large enough to cover theabsolute bound of the index x (cid:48) ij β . The second order U-process property requires O ( n ) computa-tions for getting each M ij and it is usually faster to impose a constant number M max for all M ij than to solve the linear programming problem for each i.j in our simulation studies. Second, it iswell-known in the numerical optimization literature that any strict inequality should be switched6igure 1: Binary Tree Representation of { d ij }
000 001 010 011 100 101 110 111 d_12=0 d_12=1 d_13=0 d_13=1 d_13=0 d_13=1 d_14=0 d_14=1d_14=0 d_14=1d_14=0 d_14=1d_14=0 d_14=1
Note: The triplet at the bottom of the decision tree denotes a possible choice for ( d , d , d ). For example, 010means ( d , d , d ) = (0 , , into a weak inequality with some numerical precision bound. Thus, we change the second constraintin (6) into ( d ij − M ij + ε ≤ x (cid:48) ij β ≤ d ij M ij . Third, when there are many tied observations in the dependent variable, we can reduce computationcost substantially by vectorizing paired observations and dropping the tied pairs as we have observedin Example 1.
In this section we illustrate the advantage of the MIP procedure in an empirical application.We revisit the female labor force participation application in Mroz (1987) and estimate the binarychoice model using the generalized regression model in (1). Specifically, the composite functionsare defined as F ( x (cid:48) β, ε ) := x (cid:48) β + ε and D ( A ) := 1 { A > } so that it becomes a semiparametricbinary choice model: y i = 1 { x (cid:48) i β + ε i > } , where the distribution of ε i is not specified. The parameter of interest is β and we estimate itusing the maximum rank correlation estimator defined in (3). The outcome variable, y i , is 1 if she7able 1: Summary StatisticsVariable Names Mean Std. Div. Mean Std. Div.Subsample Original SampleLabor Participation 0.55 0.50 0.57 0.50kidslt6 0.22 0.52 0.24 0.52kidsge6 1.54 1.31 1.35 1.32educ 11.74 2.12 12.29 2.28nwifeinc 19.66 10.64 20.13 11.63exper 10.83 8.30 10.63 8.07age 42.92 8.15 42.54 8.07Sample Size 100 753 Note: The data set is from Mroz (1987). The original sample was collectedfrom the Panel Studies of Income Dynamics in 1975. The variable namesare explained in the main text. participated in the labor force and 0, otherwise. We choose the following seven variables from thedata for the covariate x i : the number of kids less than 6-year-old ( kidslt kidge educ ), family income minus her income ( nwif einc )in $1,000, years of experience ( exper ), experience squared ( expersq ), and age ( age ). We randomlydraw 100 observations out of 753 for this computation exercise. Table 1 reports summary statisticsof both samples and we confirm that there is not much difference in terms of the mean and thestandard deviation of each variable. We normalize the coefficient of kidslt B = [ − , . The8able 2: Female Labor Participation Method Obj. Time (sec.) kidslt6 kidsge6 educ nwifeinc exper expersq ageMIP 0.2134 601.3750 -1 -0.1294 0.1245 -0.0076 0.0464 0.0016 -0.0694Nelder-Mead 1 0.2087 0.3290 -1 0.0385 0.2812 -0.0147 0.2061 -0.0028 -0.0533Nelder-Mead 2 0.2029 609.9390 -1 -1.0499 9.6507 -1.3712 9.7150 -0.1652 -1.3847Grid-Iter 0.1989 3.0860 -1 -0.3800 2.9300 -0.0400 1.5500 -0.0100 -0.2200Note: MIP denotes the mixed integer programming method. Nelder-Mead 1 and 2 denote the Nelder-Meadsimplex methods with an initial value from OLS and multiple random initial values given the time budget of600 seconds. Iter-Grid denotes the iterative grid search method with an initial value from OLS. The unit ofcomputation time is seconds. random starting points of the Nelder-Mead method was generated from the uniform distributionon B . We use the 2,001 equi-spaced grid points for each parameter for Iter-Grid. The Nelder-Mead method has been adopted in the applications of the MRC estimator, where the grid searchis infeasible (for example, see Cavanagh and Sherman (1998), Abrevaya (2003), Khan and Tamer(2009)). A more sophisticated version of the iterative grid search method is introduced by Wang(2007) and it is adopted in Fan, Han, Li, and Zhou (2020) for their simulation studies with multi-dimensional regressors.The estimation result in Table 2 reveals several advantages of MIP over the existing alternativealgorithms. First, MIP achieves the best objective function value among the candidate estimationmethods within a reasonable time budget. Second, some estimates of ˆ β by alternative algorithmsare qualitatively different from the solution of MIP. The coefficient estimate of kidsge educ by alternative algorithms show muchhigher effects than MIP. Third, the Nelder-Mead algorithm with the multiple staring point for 600seconds does not improve the result. In fact, the objective function value of Nelder-Mead 2 is lowerthan that of Nelder-Mead 1 which uses only one starting point of the OLS estimate. Finally, Figure2 shows how difficult the optimization problem is. We plot the empirical objective function valuesover the convex combinations of two β estimates of MIP and Nelder-Mead 1. We can confirm thatthe objective function is not concave and that there exist many local maxima even between thesetwo estimates.In sum, inference based on inferior local solutions could lead researchers to imprecise or incorrectconclusions in practice although the theoretical properties of the MRC estimator are robust.9igure 2: Objective Function Values a O b j e c t i v e F un c t i on Note:
The empirical objective function values are plotted over the convex combinations of two β estimates of MIPand Nelder-Mead 1: ˆ β α = α · ˆ β MIP + (1 − α ) · ˆ β NM for α ∈ [0 , β MIP and ˆ β NM are MIP and Nelder-Mead1 estimates, respectively. In this section we investigate the performance of the proposed MIP algorithm for the MRCestimator via Monte Carlo simulation studies. We focus on the achieved objective function valueand the computation time in this section. All simulations are carried out on a computer equippedwith AMD Ryzen Threadripper 1950X 16-Core processor and 64 Gigabytes of RAM.We consider the following two regression models: for i = 1 , . . . , n ,Binary Regression: y i = 1 { x (cid:48) i β + ε i > } (8)Censored Regression: y i = max { x (cid:48) i β + ε i , } (9)where x i is a k -dimensional vector generated from N (0 , I k ), ε i is an error term generated from N (0 , . ), and β is a parameter of interest. The true parameter value is set to be β = (1 , . . . , β by the MRC estimator. For identification, we normalize thefirst coefficient of β to be 1. To compare the performance of the MIP algorithm, we also estimatethe model using the Nelder-Mead algorithm and the iterative grid search. For all methods, the OLSestimate is used as a starting point and the parameter space is set to be B = [ − , k − . Thetime budget is set to be 600 seconds. The Nelder-Mead algorithm with repeated random startingpoints (Nelder-Mead 2 in the previous section) does not perform better (especially for large k ) than10igure 3: Loss/Tie/Win Ratio (Binary) n = 50 n = 100 k = 2 Lo ss / T i e / W i n R a t i o k = 3 Lo ss / T i e / W i n R a t i o k = 2 Lo ss / T i e / W i n R a t i o k = 3 Lo ss / T i e / W i n R a t i o n = 200 n = 400 k = 10 Lo ss / T i e / W i n R a t i o k = 20 Lo ss / T i e / W i n R a t i o k = 10 Lo ss / T i e / W i n R a t i o k = 20 Lo ss / T i e / W i n R a t i o k = 2 Lo ss / T i e / W i n R a t i o Win Tie Loss
Nelder-Mead 1 and is dropped in these simulation studies.We first consider small-scale designs and check if the MIP algorithm achieves the global objectivefunction. We set the sample size and the dimension of regressors to be n = (50 , k = (2 , n = (200 , k = (10 , n = 50 , k = 2), bothNelder-Mead and Iter-grid achieve the same objective function value at 40% and 100% but the11igure 4: Loss/Tie/Win Ratio (Censored) n = 50 n = 100 k = 2 Lo ss / T i e / W i n R a t i o k = 3 Lo ss / T i e / W i n R a t i o k = 2 Lo ss / T i e / W i n R a t i o k = 3 Lo ss / T i e / W i n R a t i o n = 200 n = 400 k = 10 Lo ss / T i e / W i n R a t i o k = 20 Lo ss / T i e / W i n R a t i o k = 10 Lo ss / T i e / W i n R a t i o k = 20 Lo ss / T i e / W i n R a t i o k = 2 Lo ss / T i e / W i n R a t i o Win Tie Loss ratios go down as the design becomes more complicated. When n = 400, MIP performs betterthan the alternatives 100% of the time when there are 10 regressors and 90% when there are 20regressors.We observe similar patterns in Censored Regression in Figure 4. We have two additionalremarks. First, MIP performs better when the parameter size is bigger ( k = 20) when n = 200or 400. As we see in the tables below, MIP provides a better solution measured by an MIP gapwhen the space of β is more complicated. Second, the overall performance of MIP in CensoredRegression is better than that in Binary Regression. However, when n = 400, MIO finds a worsesolution than its competitors 10% of the time. This is because the implied parameter space of d ij has a much bigger dimension in Censored Regression than Binary Regression as it has less tiedpairs of ( y i , y j ). Recall that d ij is multiplied by 0 in the objective function when y i and y j are tiedand we do not need to estimate such a d ij .Tables 3–4 provide some summary statistics of the computation time and the MIP gap. We12able 3: Computation Time and MIP Gap (Binary)MIP Nelder-Mead Iter MIP Nelder-Mead Iter n = 50 , k = 2 n = 50 , k = 3Time Max 0.08 0.12 0.15 0.08 0.01 0.30Median 0.03 0.00 0.09 0.04 0.00 0.20MIP Gap Max 0.000 NA NA 0.000 NA NAMedian 0.000 NA NA 0.000 NA NA n = 100 , k = 2 n = 100 , k = 3Time Max 0.53 0.01 0.31 20.69 0.01 0.85Median 0.29 0.00 0.17 0.69 0.00 0.50MIP Gap Max 0.000 NA NA 0.000 NA NAMedian 0.000 NA NA 0.000 NA NA n = 200 , k = 10 n = 200 , k = 20Time Max 32.03 0.05 13.56 0.52 0.28 75.04Median 0.3 0.05 13.21 0.39 0.25 37.1MIP Gap Max 0.010 NA NA 0.000 NA NAMedian 0.000 NA NA 0.000 NA NA n = 400 , k = 10 n = 400 , k = 20Time Max 607.22 0.33 77.15 2.05 0.95 433.74Median 600.65 0.18 51.44 1.80 0.82 279.33MIP Gap Max 0.290 NA NA 0.000 NA NAMedian 0.093 NA NA 0.000 NA NA Note:
The units for Time and MIP Gap are seconds and percent, respectively.13rst discuss the result of Binary Regression in Table 3. In small-scale designs, MIP requires aboutthe same computation time as the alternative algorithms and it finds the global solution in lessthan a second except n = 100 and k = 3. In large-scale designs MIP is still able to find the globalsolution within the allocated time budget of 600 seconds, except when n = 400 and k = 10. In thatdesign MIP hits the time limit of 600 seconds more often and the MIP gap does not achieve 0%,i.e. we are not sure whether the solution is global or not. However, the gap size is quite small andless than 1%. It is noteworthy that MIP performs much better in terms of the MIP gap when k isbigger in large-scale designs. The computation time is even dramatically reduced when n = 400.We turn our attention to the result of Censored Regression in Table 4. As we discussed above,Censored Regression requires more computation time than Binary Regression and it mostly reachesthe time limit of 600 seconds when n = 100 and k = 3. In large-scale designs, we observe the MIPgaps larger than 1%, which could be the reason that the solutions of MIP are sometimes worse thanthose of the alternative algorithms. Other patters are quite similar to those in Binary Regressionincluding that the performance of MIP becomes better when k is higher for large-scale designs.In sum, the performance of the proposed MIP algorithm for the MRC estimator is satisfactory.It always finds the global solution in small-scale designs where the existing methods fail to do quiteoften. Furthermore, it performs better than the alternative algorithms even in large-scale designsby spending a feasible amount of computation time. The MIP gap also provides useful guidancefor the quality of a solution in hand when a researcher should stop searching for the global solutionbecause of the time limit. In this section we consider how to find the best subset for rank prediction. We propose the (cid:96) -constraint maximum rank correlation predictor and show that the MIP method in (3)–(7) canbe immediately extended to the (cid:96) -constraint MRC predictor. Building on Chen and Lee (2018a),we also provide the non-asymptotic bound of the rank prediction error.In this prediction problem we have a training set of { ( y i , x (cid:48) i ) : i = 1 , . . . , n } . The goal is to learnthe rank prediction rule for y as well as to select the best s predictors among x ’s. Let x = ( x , x (cid:48)− )be ( p + 1) covariates and we know that x should be included in the predictor set. Let (cid:107) · (cid:107) be the (cid:96) -norm, i.e. (cid:107) β (cid:107) is the number of non-zero elements of the vector β . For any k (cid:54) = l , we propose14able 4: Computation Time and MIP Gap (Censored)MIP Nelder-Mead Iter MIP Nelder-Mead Iter n = 50 , k = 2 n = 50 , k = 3Time Max 0.23 0.00 0.1 12.58 0.01 0.65Median 0.15 0.00 0.1 3.57 0.00 0.31MIP Gap Max 0.000 NA NA 0.000 NA NAMedian 0.000 NA NA 0.000 NA NA n = 100 , k = 2 n = 100 , k = 3Time Max 21.62 0.01 0.2 600.38 0.01 1.06Median 6.85 0.00 0.19 600.19 0.01 0.65MIP Gap Max 0.000 NA NA 0.540 NA NAMedian 0.000 NA NA 0.126 NA NA n = 200 , k = 10 n = 200 , k = 20Time Max 600.18 0.08 32.9 602.70 0.32 121.83Median 600.16 0.06 20.5 600.23 0.21 87.39MIP Gap Max 2.219 NA NA 1.267 NA NAMedian 1.302 NA NA 0.956 NA NA n = 400 , k = 10 n = 400 , k = 20Time Max 626.26 0.46 119.94 607.92 1.75 699.43Median 606.24 0.38 93.29 602.77 1.49 506.32MIP Gap Max 1.998 NA NA 1.398 NA NAMedian 1.738 NA NA 1.172 NA NA Note:
The units for Time and MIP Gap are seconds and percent, respectively.15he following prediction rule: R β ( x k , x l ) = 1 { x ,k + x (cid:48)− ,k β > x ,l + x (cid:48)− ,l β } , where R β ( x k , x l ) = 1 implies that y k is predicted to be larger than y l . When we are given thewhole prediction set { x l : l = 1 . . . , n p } , the rank of y k is predicted by (cid:80) n p l =1 R β ( x k , x l ). Let F bethe join distribution of ( Y, X ) and Q := P × P be the product measure of P . Then, we choose theprediction rule as a sample analogue of S ( β ) := Q [1 { y k > y l } = R β ( x k , x l )] . Recall that we also want to select the best s predictors out of the total p covariates of x − . Therefore,the prediction rule composed of the best s predictors can be achieved by solving the the following (cid:96) -constraint optimization problem: max β ∈B s S n ( β ) , (10)where B s := { β ∈ R p : (cid:107) β (cid:107) ≤ s } and S n ( β ) = 2 n ( n − n (cid:88) i =1 (cid:88) j>i { { y i > y j } = R β ( x i , x j ) } . (11)We evaluate the performance of the predictor by the following measure: U n := S ∗ s − S ( (cid:98) β ) , where S ∗ s := sup β ∈B s S ( β ) and (cid:98) β is the solution of the constraint maximization problem defined in(10)–(11) above. Note that U n ≥ S ∗ n and that a good prediction rule resultsin a small value of S n with a high probability. In the next theorem, we provide a non-asymptoticbound of U n . Let a ∨ b := max { a, b } and r n := s ln( p ∨ n ) ∨ Theorem 1.
Suppose that s ≥ . For any σ > , there exists a universal constant D σ such that Pr (cid:32) U n > (cid:114) D σ r n n (cid:33) ≤ exp( − σr n ) (12) provided that (12 s + 12) ln( D σ r n ) ≤ r n + (24 s + 24) ln 2 , (13) (cid:18) s + 172 (cid:19) ln( D σ r n ) + (16 s + 16)(9 ln 2 + 1) ≤ r n . (14)16heorem 1 shows that the tail probability of U n decreases exponentially in r n . The probabilitybound in (12) is non-asymptotic and holds for every n if two inequality conditions (13)–(14) hold.Compared to the non-asymptotic bound of the best subset selection in Chen and Lee (2018a),Theorem 1 requires an additional condition (14) to bound the second order degenerate U-process.However, focusing on the leading terms, we confirm that both conditions hold if12(ln s + ln D σ + ln (ln( p ∨ n )) ≤
12 ln( p ∨ n ) . (15)Note that Theorem 1 implies that E ( U n ) = O ( n − / (cid:112) s ln( p ∨ n )) = o (1) if s ln( p ∨ n ) = o ( n ).Therefore, the best subset rank prediction performs well even when p grows exponentially providedthat s increases slowly, e.g. at a polynomial rate.We finish this section by formulating the (cid:96) -constraint optimization problem as an MIP problem.Let x − ,ij := x − ,i − x − ,j as before. For i, j = 1 , . . . , n , i (cid:54) = j , h = 1 , . . . , p , we consider the followingconstraint MIP problem: (cid:16) (cid:98) β, (cid:110) (cid:98) d ij (cid:111) , { (cid:98) e h } (cid:17) = arg max β, { d ij } , { e h } n ( n − n (cid:88) i =1 (cid:88) j>i (cid:104) (1 − { y i > y j } ) + (2 · { y i > y j } − · d ij (cid:105) (16)subject to ( d ij − M ij < ( x ,i − x ,j ) + x (cid:48)− ,ij β ≤ d ij M ij (17) e h β h ≤ β h ≤ e h β h (18) p (cid:88) h =1 e h ≤ s (19) d ij ∈ { , } (20) e h ∈ { , } , h ∈ { , . . . , p } (21)where β h and β h are the lower bound and the upper bound of β h , respectively. The constraint MIPproblem in (16)–(21) is equivalent to the original constraint optimization problem. The objectivefunction in (16) is numerically same with S n since d ij is identical to R β for each β . Furthermore, theconstraint (18) makes β h = 0 whenever e h = 0. Thus, the (cid:96) -norm constraint (cid:107) β (cid:107) ≤ s is achievedby the constraints (18), (19) and (21). Note that the objective function can be also written in thefamiliar rank correlation form:2 n ( n − n (cid:88) i =1 (cid:88) j>i (cid:104) y i > y j ) d ij + 1( y i ≤ y j )(1 − d ij ) (cid:105) , which is equivalent to (16). 17 Conclusion
In this paper we propose a feasible exact computation method for the maximum rank correla-tion estimator of Han (1987) using the mixed integer programming (MIP). One advantage of theproposed MIP method is that it can be easily extended to many constraint rank maximizationproblems as illustrated in the best subset rank prediction problem. We show that the proposedMIP method outperforms the alternative methods in the empirical example of female labor-forceparticipation. In the rank prediction application, we prove that the non-asymptotic bound of thetail probability decays exponentially. This result sheds light on the research of the high-dimensionalrank estimation models, which we leave for future research.18 eferences
Abrevaya, J. (1999). Computation of the maximum rank correlation estimator.
Economics let-ters 62 (3), 279–285.Abrevaya, J. (2000). Rank estimation of a generalized fixed-effects regression model.
Journal ofEconometrics 95 (1), 1–23.Abrevaya, J. (2003). Pairwise-difference rank estimation of the transformation model.
Journal ofBusiness & Economic Statistics 21 (3), 437–447.Abrevaya, J. and Y. Shin (2011). Rank estimation of partially linear index models.
The Econo-metrics Journal 14 (3), 409–437.Bertsimas, D., A. King, and R. Mazumder (2016). Best subset selection via a modern optimizationlens.
The annals of statistics , 813–852.Cavanagh, C. and R. P. Sherman (1998). Rank estimators for monotonic index models.
Journal ofEconometrics 84 (2), 351–382.Chay, K. Y. and B. E. Honore (1998). Estimation of semiparametric censored regression models:an application to changes in black-white earnings inequality during the 1960s.
Journal of HumanResources , 4–38.Chen, L.-Y. and S. Lee (2018a). Best subset binary prediction.
Journal of Econometrics 206 (1),39–56.Chen, L.-Y. and S. Lee (2018b). Exact computation of gmm estimators for instrumental variablequantile regression models.
Journal of Applied Econometrics 33 (4), 553–567.Chen, S. (2002). Rank estimation of transformation models.
Econometrica 70 (4), 1683–1697.Fan, Y., F. Han, W. Li, and X.-H. Zhou (2020). On rank estimators in increasing dimensions.
Journal of Econometrics 214 (2), 379–412.Florios, K. and S. Skouras (2008). Exact computation of max weighted score estimators.
Journalof Econometrics 146 (1), 86–91.Han, A. K. (1987). Non-parametric analysis of a generalized regression model: the maximum rankcorrelation estimator.
Journal of Econometrics 35 (2-3), 303–316.19han, S. (2001). Two-stage rank estimation of quantile index models.
Journal of Economet-rics 100 (2), 319–355.Khan, S. and E. Tamer (2007). Partial rank estimation of duration models with general forms ofcensoring.
Journal of Econometrics 136 (1), 251–280.Khan, S. and E. Tamer (2009). Inference on endogenously censored regression models using condi-tional moment inequalities.
Journal of Econometrics 152 (2), 104–119.Kitagawa, T. and A. Tetenov (2018). Who should be treated? empirical welfare maximizationmethods for treatment choice.
Econometrica 86 (2), 591–616.Kosorok, M. R. (2007).
Introduction to empirical processes and semiparametric inference . SpringerScience & Business Media.Lee, S., Y. Liao, M. H. Seo, and Y. Shin (2018). Factor-driven two-regime regression. arXiv preprintarXiv:1810.11109 .Major, P. (2005). Tail behaviour of multiple random integrals and u-statistics.
Probability Surveys 2 ,448–505.Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice.
Journalof econometrics 3 (3), 205–228.Mroz, T. A. (1987). The sensitivity of an empirical model of married women’s hours of workto economic and statistical assumptions.
Econometrica: Journal of the Econometric Society ,765–799.Nolan, D. and D. Pollard (1987). U-processes: rates of convergence.
The Annals of Statistics ,780–799.Sherman, R. P. (1993). The limiting distribution of the maximum rank correlation estimator.
Econometrica: Journal of the Econometric Society , 123–137.Talagrand, M. (1994). Sharper bounds for gaussian and empirical processes.
The Annals of Prob-ability , 28–76.Wang, H. (2007). A note on iterative marginal optimization: a simple algorithm for maximum rankcorrelation estimation.
Computational statistics & data analysis 51 (6), 2803–2812.20 ppendix: Proof of Theorem 1
In this appendix we provide the proof of Theorem 1. We first prove some useful lemmas. Weneed the following notation. Let m be a subset of the index set { , , . . . , p } , where | m | = s . Let M be the collection of m . Thus, for any given p and s , |M| = (cid:0) ps (cid:1) . Let B m := { β ∈ B : β j = 0 if j / ∈ m } .Let w := ( y, x (cid:48) ) and W be the support of w . Define f β : W × W (cid:55)→ { , } as f β ( w i , w j ) := 1 { { y i > y j } = R β ( x i , x j ) } = 1 − { y i > y j } + (2 · { y i > y j } − R β ( x i , x j ) . Let F m := { f β ( · , · ) : β ∈ B m } .From the Hoeffding decomposition, we have S n ( β ) = S ( β ) + 1 n n (cid:88) i =1 g β ( w i ) + 2 n ( n − n (cid:88) i =1 (cid:88) j (cid:54) = i h β ( w i , w j ) , (22)where g β ( w i ) := (cid:90) W f β ( w i , w ) dP ( w ) + (cid:90) W f β ( w, w i ) dP ( w ) − S ( β ) (23) h β ( w i , w j ) := f β ( w i , w j ) − (cid:90) W f β ( w i , w ) dP ( w ) − (cid:90) W f β ( w, w j ) dP ( w ) + S ( β ) . (24)Note that the last term of (22) is a P -degenerate U-process. Lemma 1.
For the measurable function h β ( · , · ) defined in (24) , the following inequality holds forsome universal constants C k > , k = 1 , . . . , : Pr sup β ∈B m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − n (cid:88) i =1 (cid:88) j (cid:54) = i − h β ( w i , w j ) > tn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (512 eC ) (16 s +16) e − C t , (25) if C (16 s + 16 + γ ) / log 2 ≤ t ≤ n , where γ := max(log(512 eC ) / log n, .Proof of Lemma 1. From Lemma 9.6, 9.9(iii), 9.9(vi), and 9.9(v) in Kosorok (2007), the VC-indexof F m , V ( F m ), satisfies that V ( F m ) ≤ s + 3 . (26)We now define a pseudometric for two measurable functions f and g in F m : d Q ( f, g ) := (cid:18)(cid:90) ( f − g ) dQ (cid:19) / . ε > N ( ε, F m , d Q ) is defined as the minimal numberof open ε -balls required to cover F m . Noting that F m has a constant envelope f = 1, we applyTheorem 9.3 in Kosorok (2007) to getsup Q N ( ε, F m , d Q ) ≤ C (2 s + 3)(4 e ) s +3 (cid:18) ε (cid:19) s +1) ≤ (cid:32) ( C (2 s + 3)(16 e ) s +3 ) / (4 s +4) ε (cid:33) s +4 ≤ (cid:18) eC ε (cid:19) s +4 , (27)where C := max(1 , C ) is a universal constant. The last inequality comes from( C (2 s + 3)(16 e ) s +3 ) / (4 s +4) ≤ eC (2 s + 3) / (4 s +4) ≤ eC (2 s +3) / (4 s +4) ≤ eC . We now define the following classes of functions: F m, := (cid:26)(cid:90) W f β ( · , w ) dP ( w ) : f β ∈ F m (cid:27) F m, := (cid:26)(cid:90) W f β ( w, · ) dP ( w ) : f β ∈ F m (cid:27) Define a pseudometric for the functions f and g in F m, , F m, as d P ( f, g ) := (cid:18)(cid:90) ( f − g ) dP (cid:19) / . From Lemma 20 in Nolan and Pollard (1987), we havesup P N ( ε, F m,j , d P ) ≤ sup Q N ( ε, F m , d Q ) ≤ (cid:18) C eε (cid:19) s +4 (28)for j = 1 ,
2. Using the same arguments, we have N ( ε, S ( β ) , d ) ≤ sup P N ( ε, F m, , d P ) ≤ (cid:18) C eε (cid:19) s +4 (29)where d := (cid:0)(cid:82) ( f − g ) dw (cid:1) / for f, g ∈ S ( β ).We now consider the class of P -degenerate functions h β ( · , · ) defined in (24): H m := (cid:8) − h β ( · , · ) : β ∈ B m (cid:9) . Using the results in (27), (28), (29), and Lemma 16 in Nolan and Pollard (1987), we getsup Q N ( ε, H m , d Q ) ≤ (cid:18) eC ε (cid:19) s +16 . The desired result is established by applying Theorem 6.3 in Major (2005).22 emma 2.
For the measurable function h β ( · , · ) defined in (24) , the following inequality holds forsome universal constants D > : Pr (cid:32) sup β ∈B m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t √ n (cid:33) ≤ (cid:18) D t √ s + 12 (cid:19) s +12 e − t . (30) Proof of Lemma 2.
Let G m := { − g β ( · ) : β ∈ B m } . From (28), (29), and Lemma 16 in Nolan andPollard (1987), we have N ( ε, G m , d P ) ≤ (cid:18) eC ε (cid:19) s +12 . (31)for a universal constant C >
0. Then, the desired result is established by applying Theorem 1.3in Talagrand (1994).We now ready to prove the main theorem. Using S ( β ) ≥ S n ( β ) ≥
0, the triangular inequality,and S n ( (cid:98) β ) ≥ S ( (cid:98) β ), we have U n = sup β ∈B n S ( β ) − S ( (cid:98) β )= sup β ∈B n | S ( β ) − S n ( β ) + S n ( β ) | − S ( (cid:98) β ) ≤ sup β ∈B n | S n ( β ) − S ( β ) | + sup β ∈B n S n ( β ) − S ( (cid:98) β )= sup β ∈B n | S n ( β ) − S ( β ) | + S n ( (cid:98) β ) − S ( (cid:98) β ) ≤ β ∈B n | S n ( β ) − S ( β ) | . (32)Using (32) and (22), we have P (cid:32) U n > (cid:114) M σ r n n (cid:33) ≤ P (cid:32) β ∈B s | S n ( β ) − S ( β ) | > (cid:114) M σ r n n (cid:33) ≤ P (cid:32) sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g β ( w i ) + 2 n ( n − n (cid:88) i =1 (cid:88) j>i h β ( w i , w j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) M σ r n n (cid:33) ≤ P (cid:32) sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) M σ r n n (cid:33) + P (cid:32) sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − n (cid:88) i =1 (cid:88) j>i − h β ( w i , w j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) M σ r n n (cid:33) . (33) We first calculate the upper bound of the first term in (33). Let t = M σ r n . Using the definition23f B m , |M| = (cid:0) ps (cid:1) ≤ p s , and Lemma 2, we have P (cid:32) sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t √ n (cid:33) ≤ (cid:88) m ∈M P (cid:32) sup β ∈B m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t √ n (cid:33) ≤ p s (cid:18) D t √ s + 12 (cid:19) s +12 e − t , (34)for a universal constant D . Let D σ := D σ, ∨ D σ, , where D σ, := 2 − (1 + σ ) ∨ D and D σ, is auniversal constant that will be defined later. Since t ≥ D and s ≥
1, the bound in (34) is furtherbounded as p s (cid:18) D t √ s + 12 (cid:19) s +12 e − t ≤ e λ ( s,p,t ) , where λ ( s, p, t ) := − t + (24 s + 24) ln t + s ln p − (24 s + 24) ln 2. By the definition of t , r n ≥ s ln p ,condition (13), and the definition of D σ , we have λ ( s, p, t ) ≤ ( − D σ + 1) r n + (12 s + 12) ln( D σ r n ) − (24 s + 24) ln 2 ≤ ( − D σ + 2) r n ≤ − σr n . Therefore, it is established that P (cid:32) sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:114) M σ r n n (cid:33) ≤ e − σr n . (35)We next turn our attention to the second term in (33). Using the similar arguments in (34)with Lemma 1, we have P sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − n (cid:88) i =1 (cid:88) j>i − h β ( w i , w j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > √ ntn ≤ (cid:88) m ∈M P (cid:32) sup β ∈B m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 − g β ( w i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > √ ntn (cid:33) ≤ p s C (512 eC ) (16 s +16) e − C √ nt , (36)for some universal constants C k > k = 1 , ,
4. Let D σ, := C − (2 + σ ) ∨ ( C ∨ ( eC ) ). From t ≤ √ n , C < t , and eC < t , the bound in (36) is further bounded as p s C (512 eC ) (16 s +16) e − C √ nt ≤ e λ ( s,p,t ) , where λ ( s, p, t ) := − C t + (16 s + 17) ln t + s ln p + 9(16 s + 16) ln 2. Recall that D σ = D σ, ∨ D σ, .24sing the similar arguments as above, we have λ ( s, p, t ) ≤ ( − C D σ + 1) r n + (cid:18) s + 172 (cid:19) ln( D σ r n ) + 9(16 s + 16) ln 2 ≤ ( − C D σ + 2) r n ≤ − σr n . Therefore, it is established that P sup β ∈B s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − n (cid:88) i =1 (cid:88) j>i − h β ( w i , w j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t √ n ≤ − σr n . (37)From (33), (35), and (37), we establish the desired result: P (cid:32) U n > (cid:114) M σ r n n (cid:33) ≤ − σr n ..