A Double Exponential Lower Bound for the Distinct Vectors Problem
DDiscrete Mathematics and Theoretical Computer Science
DMTCS vol. :4, 2020, A Double-Exponential Lower Boundfor the Distinct Vectors Problem ∗ Marcin Pilipczuk Manuel Sorge
University of Warsaw, Poland received 5 th Feb. 2020 , revised 6 th July 2020 , accepted 2 nd Sep. 2020 . In the (binary) D
ISTINCT V ECTORS problem we are given a binary matrix A with pairwise different rows and want to select atmost k columns such that, restricting the matrix to these columns, all rows are still pairwise different. A result by Froese et al.[JCSS] implies a O( k ) · poly( | A | ) -time brute-force algorithm for D ISTINCT V ECTORS . We show that this running time bound isessentially optimal by showing that there is a constant c such that the existence of an algorithm solving D ISTINCT V ECTORS withrunning time O(2 ck ) · poly( | A | ) would contradict the Exponential Time Hypothesis. Keywords: feature selection, data mining, computational complexity, parameterized algorithms
For each n ∈ N , let [ n ] = { , . . . , n } . Let Σ be a set and n, m ∈ N . By Σ m × n we denote the set of m -row n -columnmatrices with entries in Σ . Let A ∈ Σ m × n . By A [ i, j ] we denote the entry of A in the i -th row and j -th column. By A [ i, ∗ ] and A [ ∗ , j ] we denote the i -th row and the j -th column of A , respectively. For easier notation, we often identifyrows or columns and their indices. Let I ⊆ [ m ] and J ⊆ [ n ] . By (i) A [ I, J ] , (ii) A [ I, ∗ ] , and (iii) A [ ∗ , J ] we denotethe submatrix of A containing (i) only the entries that are simultaneously in rows in I and columns in J , (ii) only theentries in rows in I , and (iii) only the entries in columns in J , respectively.We study the computational complexity of the following decision problem.D ISTINCT V ECTORS
Instance:
A binary matrix A ∈ { , } m × n and k ∈ N . Question:
Is there a subset K ⊆ [ n ] of at most k columns such that the rows in A [ ∗ , K ] are pairwisedistinct?We also say that K as above is a solution .D ISTINCT V ECTORS is a fundamental problem which has arisen in several different contexts. Notably, it has ap-plications in database theory, where it models key selection in relational databases (e.g. [BFS17]), machine learning,where it models combinatorial feature selection [Cha+00], and in rough set theory, where it models finding some min-imal structure [Paw91]. See Froese [Fro18] for an overview over the literature. We note that D
ISTINCT V ECTORS issometimes formulated with larger alphabet size than two, that is, the entries of A may be more than two distinct sym-bols. Since we focus here on a lower bound, however, the binary formulation is sufficient for us. Froese et al. [Fro+16,Theorem 12] gave a problem kernel with size O( k ) for D ISTINCT V ECTORS parameterized by k . (A problem kernelwith respect to a parameter k is a polynomial-time self-reduction with an upper bound, a function of k , on the resultinginstance size.) Simple brute force on the resulting instances yields a O( k ) · poly( | A | ) -time algorithm for D ISTINCT V ECTORS . It is natural to ask whether this running time bound can be improved. Here, we answer this questionnegatively by proving the following.
Theorem 1.
For each (cid:15) > , if there is a O(2 ck ) · poly( n + m ) -time algorithm solving D ISTINCT V ECTORS , then theExponential Time Hypothesis is false, where c = c ( (cid:15) ) = 1 / − (cid:15) . ∗ This research is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon2020 research and innovation programme (Grant Agreement 714704).ISSN 1365–8050 c (cid:13) a r X i v : . [ c s . CC ] S e p Marcin Pilipczuk, Manuel Sorge
Informally, the Exponential Time Hypothesis (ETH) states that 3SAT on n -variable formulas cannot be solved in o ( n ) time [IP01]. Formally, we rely on the following formulation that comes from an application of the SparsificationLemma [IPZ01]. Conjecture 2 (Exponential Time Hypothesis + Sparsification Lemma) . There exist constants δ, C > such that thereis no algorithm that, given as an input a 3CNF-SAT formula φ with n variables and at most C · n clauses, runs in time O(2 δn ) and correctly verifies the satisfiability of φ . The proof of Theorem 1 is given in Section 2. Herein, to simplify notation, we often write vectors ( v , . . . , v n ) ∈ Σ n as v v . . . v n . We also use · ◦ · to denote concatenation. That is, for each n, m ∈ N and each ( v i ) i ∈ [ n ] ∈ Σ n and ( w i ) i ∈ [ m ] ∈ Σ m we define v v . . . v n ◦ w w . . . w m = v v . . . v n w w . . . w m ∈ Σ n + m . Furthermore, for each i ∈ N and σ ∈ Σ we define σ ( i ) = σσ . . . σ ∈ Σ i . By log we refer to the base-two logarithm. By poly we refer to anarbitrary fixed polynomial. Let (cid:15) > . Let δ and C be the constants of Conjecture 2. Let φ be a boolean formula φ in conjunctive normal formwith r variables and s clauses such that each clause has size exactly three and such that s ≤ C · r .Below we construct an instance ( A, k ) of D ISTINCT V ECTORS which has a solution if and only if φ is satisfiableand such that A has n = 2 O( r/ log r ) columns and m = O( r ) rows, and there are k ≤ c (cid:48) + 2 log r columns to selectfor some constant c (cid:48) . The construction can be carried out in O( r/ log r ) time. Thus, an algorithm solving D ISTINCT V ECTORS with running time O(2 ck ) · poly( n + m ) for some constant c can be used to check satisfiability of φ in time O( r/ log r ) + 2 O(2 c · ( c (cid:48) +2 log r ) ) · poly(2 O( r/ log r ) + O( r )) . Since c = 1 / − (cid:15) , this is o ( r ) time, implying that the ETHis false. Construction.
Let X = { x , x , . . . , x r } be the set of variables in φ and C = { C , C , . . . , C s } the set of clauses.Without loss of generality, assume that r is a power of two and s equals (cid:96) − for some (cid:96) ∈ N . Otherwise, introducevariables that do not occur in any clause and repeat clauses as necessary. Note that this can be done in such a way that,afterwards, still s = O( r ) . Let r (cid:48) := (cid:100) r/ log r (cid:101) . We partition the variables into log r bundles B i = { b i , b i , . . . , b r (cid:48) i } ⊂ X , i ∈ [log r ] , where each bundle B i contains exactly r (cid:48) variables (repeat variables from the bundle if necessary to filla bundle). (i) The columns of matrix A are partitioned into log( r ) + 1 parts, one consistency part and one part for each bundle.The consistency part contains (cid:96) = log( s + 1) columns. We will make sure that all of them can be assumed to be inthe solution. In this way, these columns will serve to distinguish some rows corresponding to clause gadgets from eachother. The remaining log r parts of columns correspond one-to-one to the bundles. The columns corresponding to B i are B i ’s columns . For each i ∈ [log r ] , there will be ρ := 2 r (cid:48) columns belonging to B i which correspond one-to-oneto the possible truth-assignments to the variables in B i . We will ensure that exactly one of the columns of B i will bechosen in any solution, that is, the solution chooses a truth-assignment to the variables in B i .We now describe the construction of A by defining its rows. The rows of matrix A are partitioned into two parts I , I ⊆ [ m ] .Recall ρ = 2 r (cid:48) . The first part, A [ I , ∗ ] , of the rows of A consists of log r + 1 rows, that is I = [log r + 1] . The firstrow, A [1 , ∗ ] , contains only zeros. The ( i + 1) -th row, i ∈ [log r ] , is defined by A [ i + 1 , ∗ ] = 0 (log( s +1)) ◦ (( i − ρ ) ◦ ( ρ ) ◦ ((log r − i ) ρ ) . That is, for each bundle B i there is a row which has 1 in the columns log( s + 1) + ( i − ρ + 1 to log( s + 1) + iρ and0 otherwise. We say that the columns log( s + 1) + ( i − ρ + 1 to log( s + 1) + iρ are the columns of bundle B i . Inorder to distinguish the rows in I from the all-zero row, it is necessary, for each bundle B i , to pick at least one columnin the set of columns belonging to B i into the solution.The second part, A [ I , ∗ ] , of the rows of A consists of s rows, that is I = { log r + 2 , log r + 3 , . . . , log r + 2 s + 1 } .For each i, j ∈ N with ≤ i ≤ j − let bin( i, j ) be the binary { , } -encoding of i with exactly j bits, paddedwith leading zeros if necessary. For each bundle B i , fix an ordering of the at most ρ truth assignments to variablesin B i . Recall that we may have repeated variables in B i . If so, then repeat truth assignments in the order fixedabove so that their overall number is exactly ρ . For each p ∈ [ ρ ] and q ∈ [ s ] , let sat i ( p, q ) = 1 if the p -th truth (i) We note that the construction works as long as the number of bundles is
O(log r ) and each bundle’s size is o( r ) . We opted for log r bundles as anatural choice. Double-Exponential Lower Bound for the Distinct Vectors Problem C q true and let sat i ( p, q ) = 0 otherwise. Let sat i ( ∗ , q ) = (sat i ( p, q )) p ∈ [ ρ ] and sat( q ) =sat ( ∗ , q ) ◦ sat ( ∗ , q ) ◦ . . . ◦ sat log r ( ∗ , q ) . Define the (2 q − -th row in I , q ∈ [ s ] , by A [log r + 2 q, ∗ ] = bin( q, log( s + 1)) ◦ sat( q ) . We call these rows the odd rows in I . Define the q -th row in I , q ∈ [ s ] , by A [log r + 2 q + 1 , ∗ ] = bin( q, log( s + 1)) ◦ ( n − log( s +1)) . These are the even rows in I . We say that the (2 q − -th and the q -th rows correspond to clause q .Finally, set k = log( s + 1) + log r . This concludes the construction of the D ISTINCT V ECTORS instance ( A, k ) .Before proving the correctness, observe that all our other requirements on the construction are satisfied: For thenumber k of columns to select, we have (recall that s ≤ Cr ) k = log( s + 1) + log r ≤ log(2 s ) + log r = log(2 C ) + 2 log r .Moreover, number n of columns satisfies n = log( s + 1) + ρ log r = 2 O( r/ log r ) ; and the number m of rows satisfies m = 1 + log r + 2 s = O( r ) , each as required. Furthermore, since there are O( r/ log r ) truth assignments to thevariables in each bundle, the reduction can be carried out in O( r/ log r ) time. Correctness.
We now prove that there is a solution to the above-constructed instance ( A, k ) of D ISTINCT V ECTORS if and only if φ is satisfiable.Assume that ( A, k ) has a solution K . First, note that the even rows in A [ I , ∗ ] together with the all-zero row in I are s + 1 rows that pairwise differ only in the first log( s + 1) columns. Since for each t ∈ N we have that t selectedcolumns can pairwise distinguish at most t rows, we thus have [log( s + 1)] ⊆ K . Let K (cid:48) = K \ [log( s + 1)] andobserve | K (cid:48) | ≤ log r . Observe that in A [ I , ∗ ] there are log r rows that each differ from the all-zero column in A [ I , ∗ ] only in the columns corresponding to some distinct bundle. Thus, for each bundle B i , there is exactly one column, say r i , in K (cid:48) ∩ R i where R i is the set of B i ’s columns, and no other columns are in K (cid:48) . Observe that each r i correspondsby construction to a truth assignment to variables in B i . Call this truth assignment α i . Thus, taking the union overall i ∈ [log r ] of the truth assignment α i to the variables in B i represented by r i , we get a truth assignment α to allvariables in X . This truth assignment α is well-defined since the bundles constitute a partition of the variables. Weclaim that α satisfies φ .Since K is a solution, for each q ∈ [ s ] , the sub-row A [log r + 2 q, K ] is different from A [log r + 2 q + 1 , K ] . Thesetwo sub-rows differ only in columns of bundles B j that correspond to some truth assignment to the variables in B j thatsatisfies clause C q . Thus, α satisfies C q and indeed, since this holds for all q ∈ [ s ] , α satisfies φ , as required.Now assume that there is a truth assignment α to variables in X that satisfies φ . For each bundle B i , there is acolumn r i in B i ’s columns such that the corresponding truth assignment, call it α i , assigns to variables in B i the sametruth values as α . We construct a solution K to ( A, k ) as follows. First, we put [log( s + 1)] ⊆ K . Then, for eachbundle B i put r i ∈ K . This concludes the construction. Observe that | K | = log( s + 1) + log r , as required. It remainsto show that all rows in A [ ∗ , K ] are distinct.Consider two rows i, j ∈ [ m ] , where i (cid:54) = j . We distinguish the following cases.Case 1) i, j ∈ I . Then, one of the two rows, say i , has in the columns of some bundle and row j has in thesecolumns. Since by construction K contains exactly one column from the columns of each bundle, thus, A [ i, K ] (cid:54) = A [ j, K ] .Case 2) i ∈ I and j ∈ I . Observe that each row in I has only zeros in the first log( s + 1) columns and each rowin I has at least one one in the first log( s + 1) columns. Thus, A [ i, K ] (cid:54) = A [ j, K ] .Case 3) i, j ∈ I . If A [ i, K ] and A [ j, K ] differ in the first log( s + 1) columns, then we are done. Otherwise,both i and j correspond to the same clause, say C q , and they are not both even or both odd rows. Say i is an oddand j is an even row. By the definition of K , there is a bundle B (cid:96) and a column r (cid:96) such that α (cid:96) satisfies C q . Thus, A [ i, r (cid:96) ] = 1 (cid:54) = 0 = A [ j, r (cid:96) ] by construction of the two rows.Thus, K is a solution to ( A, k ) , as required. This concludes the proof. We thank Vincent Froese and Irene Muzi for interesting and helpful discussions. We thank three anonymous reviewersfor their insightful comments that improved the presentation of the paper.
Marcin Pilipczuk, Manuel Sorge
References [BFS17] Thomas Bl¨asius, Tobias Friedrich, and Martin Schirneck. “The Parameterized Complexity of DependencyDetection in Relational Databases”. In:
Proceedings of the 11th International Symposium on Parameterizedand Exact Computation (IPEC’16) . Vol. 63. LIPIcs. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik,2017, 6:1–6:13.
DOI : .[Cha+00] Moses Charikar, Venkatesan Guruswami, Ravi Kumar, Sridhar Rajagopalan, and Amit Sahai. “Combi-natorial Feature Selection Problems”. In: Proceedings of the 41st Annual Symposium on Foundations ofComputer Science (FoCS’00) . 2000, pp. 631–640.
DOI : .[Fro+16] Vincent Froese, Ren´e van Bevern, Rolf Niedermeier, and Manuel Sorge. “Exploiting Hidden Structurein Selecting Dimensions That Distinguish Vectors”. In: Journal of Computer and System Sciences
DOI : .[Fro18] Vincent Froese. “Fine-Grained Complexity Analysis of Some Combinatorial Data Science Problems”. PhDthesis. Technische Universit¨at Berlin, 2018. DOI : .[IP01] Russell Impagliazzo and Ramamohan Paturi. “On the Complexity of k-SAT”. In: Journal of Computer andSystem Sciences
DOI : .[IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. “Which Problems Have Strongly ExponentialComplexity?” In: Journal of Computer and System Sciences
DOI : .[Paw91] Zdzisław Pawlak. Rough sets - Theoretical Aspects of Reasoning about Data . Vol. 9. Theory and decisionlibrary : series D. Kluwer, 1991.
DOI :10.1007/978-94-011-3534-4