Binary Matrix Factorisation via Column Generation
BBinary Matrix Factorisation via Column Generation
R´eka ´A. Kov´acs, Oktay G ¨unl ¨uk, Raphael A. Hauser University of Oxford & The Alan Turing Institute Cornell [email protected], [email protected], [email protected]
Abstract
Identifying discrete patterns in binary data is an importantdimensionality reduction tool in machine learning and datamining. In this paper, we consider the problem of low-rankbinary matrix factorisation (BMF) under Boolean arithmetic.Due to the NP-hardness of this problem, most previous at-tempts rely on heuristic techniques. We formulate the problemas a mixed integer linear program and use a large scale opti-misation technique of column generation to solve it withoutthe need of heuristic pattern mining. Our approach focuseson accuracy and on the provision of optimality guarantees.Experimental results on real world datasets demonstrate thatour proposed method is effective at producing highly accuratefactorisations and improves on the previously available bestknown results for out of problem instances. Low-rank matrix approximation is an essential tool for dimen-sionality reduction in machine learning. For a given n × m data matrix X whose rows correspond to n observations oritems, columns to m features and a fixed positive integer k , computing an optimal rank- k approximation consists ofapproximately factorising X into two matrices A , B of di-mension n × k and k × m respectively, so that the discrepancybetween X and its rank- k approximate A · B is minimal. Therank- k matrix A · B describes X using only k derived fea-tures: the rows of B specify how the original features relateto the k derived features, while the rows of A provide weightshow each observation can be (approximately) expressed as alinear combination of the k derived features.Many practical datasets contain observations on categor-ical features and while classical methods such as singularvalue decomposition (SVD) [9] and non-negative matrix fac-torisation (NMF) [19] can be used to obtain low-rank ap-proximates for real valued datasets, for a binary input ma-trix X they cannot guarantee factor matrices A, B and theirproduct to be binary. Binary matrix factorisation (BMF) isan approach to compute low-rank matrix approximationsof binary matrices ensuring that the factor matrices are bi-nary as well [26]. More precisely, for a given binary matrix X ∈ { , } n × m and a fixed positive integer k , the rank- k binary matrix factorisation problem ( k -BMF) asks to findtwo matrices A ∈ { , } n × k and B ∈ { , } k × m such that Submitted to AAAI 2021. the product of A and B is a binary matrix denoted by Z , andthe distance between X and Z is minimal in the Frobeniusnorm. Many variants of rank- k BMF exist, depending onwhat arithmetic is used when the product of matrices A and B is computed. We focus on a variant where the Booleanarithmetic is used: X = A ◦ B ⇐⇒ x ij = (cid:87) (cid:96) =1 a i(cid:96) ∧ b (cid:96)j , so that s and s are interpreted as True and False, additioncorresponds to logical disjunction ( ∨ ) and multiplication toconjunction ( ∧ ). Apart from the arithmetic of the Booleansemi-ring, other choices include standard arithmetic over theintegers or modulo arithmetic over the binary field. Wefocus on the Boolean case, in which the property of Booleannon-linearity, 1 + 1 = 1 holds because many natural processesfollow this rule. For instance, when diagnosing patients witha certain condition, it is only the presence or absence of acharacteristic symptom which is important, and the frequencyof the symptom does not change the diagnosis. As an exam-ple, consider the following data matrix (inspired by [28]) X = (cid:104) (cid:105) where rows correspond to patients and columns to symptoms, x ij = 1 indicating patient i presents symptom j . Let X = A ◦ B = (cid:104) (cid:105) ◦ [ ] denote the rank-2 BMF of X . Factor B reveals that 2 un-derlying diseases cause the observed symptoms, α causingsymptoms 1 and 2, and β causing 2 and 3. Factor A revealsthat patient 1 has disease α , patient 3 has β and patient 2 hasboth. In contrast, the best rank-2 real approximation X ≈ (cid:104) .
21 0 . .
21 0 . . − . (cid:105) (cid:2) .
00 0 .
71 0 . .
71 0 . − . (cid:3) fails to reveal a clear interpretation, and the best rank-2 NMF X ≈ (cid:104) .
36 0 . .
05 1 . .
13 1 . (cid:105) [ .
80 0 .
58 0 . .
00 0 .
57 0 . ] of X suggests that symptom 2 presents with lower intensityin both α and β , an erroneous conclusion (caused by patient2) that could not have been learned from data X which is of“on/off” type. BMF-derived features are particularly naturalto interpret in biclustering gene expression datasets [40],role based access control [21, 22] and market basket dataclustering [20].1 a r X i v : . [ m a t h . O C ] N ov .1 Complexity and related work The Boolean rank of a binary matrix X ∈ { , } n × m isdefined to be the smallest integer r for which there existmatrices A ∈ { , } n × r and B ∈ { , } r × m such that X = A ◦ B , where ◦ denotes Boolean matrix multiplication definedas x ij = (cid:87) r(cid:96) =1 a i(cid:96) ∧ b (cid:96)j for all i ∈ { , . . . , n } := [ n ] , j ∈ [ m ] [12]. This is equivalent to x ij = min { , (cid:80) r(cid:96) =1 a i(cid:96) b (cid:96)j } using standard arithmetic. The Boolean rank r of X is theminimum value of r for which it is possible to factor X intoa Boolean combination of r rank- Boolean matrices X = (cid:87) r(cid:96) =1 a (cid:96) b (cid:62) (cid:96) for a (cid:96) ∈ { , } n , b (cid:96) ∈ { , } m . Interpreting X as the node-node incidence matrix of a bipartite graph G with n vertices on the left and m vertices on the right, theproblem of computing the Boolean rank of X is in one-to-one correspondence with finding a minimum edge coveringof G by complete bipartite subgraphs (bicliques)[29]. Sincethe biclique cover problem is NP-hard [30, Theorem 8.1], [8,Problem GT18] and hard to approximate [38], computing theBoolean rank is hard as well. Finding an optimal k -BMF of X has a graphic interpretation of minimizing the number oferrors in an approximate covering of G by k bicliques. Eventhe computation of -BMF is hard because of its relation tothe maximum edge biclique problem [32] which in the matrixsetting is stated as finding the largest submatrix of X of all s.Many heuristic attempts have been made to approximatelycompute BMFs by focusing on recursively partitioning thegiven matrix X ∈ { , } n × m and computing a rank- fac-torisation at each step. The first such recursive method calledProximus [17], [16] is used to compute BMF under stan-dard arithmetic over the integers. For rank- factorisationProximus uses an alternating iterative heuristic applied to arandom starting point which is based on the observation thatif a ∈ { , } n is given, then a vector b ∈ { , } m that mini-mizes the distance between X and ab (cid:62) can be computed in O ( nm ) time. Since the introduction of Proximus, much re-search has been focused on computing efficient and accuraterank- factorisations. [36] proposes an integer program (IP)for rank- factorisation and several linear programming (LP)relaxations of it, one of which leads to a -approximationscheme. [37] provides a rounding based -approximation forrank- factorisation by using an observation about the ver-tices of the polytope corresponding to the LP relaxation ofan IP.Rank- k BMF under Boolean arithmetic is explicitly intro-duced in [27] [28], along with a heuristic algorithm calledASSO. The core of ASSO is based on an association rule-mining approach to create matrix B ∈ { , } k × m and greed-ily fix A ∈ { , } n × k with respect to B . The problem offinding an optimal A with respect to fixed B is NP-hard [25]but can be solved in O (2 k kmn ) time [28]. The associationrule-mining approach of [28] is further improved in [1] by arange of iterative heuristics employing this alternative fixingidea and local search. Another approach based on an alter-nating style heuristic is explored in [40] to solve a non-linearunconstrained formulation of k -BMF with penalty terms inthe objective for non-binary entries. [39] proposes anotheriterative heuristic which at every iteration permutes the rows and columns of X to create a dense submatrix in the up-per right corner which is used as a rank-1 component in the k -BMF.In [21, 22] a series of exponential size IPs for k -BMF andexact Boolean rank computation are introduced. They usean explicit enumeration of the m possible rows for factormatrix B and corresponding indicator variables. To tacklethe exponential explosion of rows considered, a heuristicrow generation using association rule mining and subset enu-meration is developed. However, no non-heuristic methodis proposed to overcome the exponential size of the modeland hence the exact solution of the IPs is feasible up to onlyvery limited sized datasets where complete enumeration ofthe m rows is possible. An exact linear IP for k -BMF withpolynomially many variables and constraints is presented in[15]. This model uses McCormick envelopes [24] to linearizequadratic terms. In this paper, we present a novel IP formulation for k -BMFthat overcomes several limitations of earlier approaches. Inparticular, our formulation does not suffer from permutationsymmetry, it does not rely on heuristic pattern mining, and ithas a stronger LP relaxation than that of [15]. On the otherhand, our new formulation has an exponential number of vari-ables which we tackle using a column generation approachthat effectively searches over this exponential space withoutexplicit enumeration, unlike the complete enumeration usedfor the exponential size model of [21, 22]. Our proposed so-lution method is able to prove optimality for smaller datasets,while for larger datasets it provides solutions with better ac-curacy than the state-of-the-art heuristic methods. In addition,due to the entry-wise modelling of k -BMF in our approach,we can handle matrices with missing entries and our solutionscan be used for binary matrix completion.The rest of the paper is organised as follows. In Section2 we briefly discuss the model of [15] and its limitations. InSection 3 we introduce our integer linear programming for-mulation for k -BMF, detail a theoretical framework based onthe large scale optimisation technique of column generationfor its solution and discuss heuristics for the arising sub-problems. Finally, in Section 4 we demonstrate the practicalapplicability of our approach on several real world datasets. Given a binary matrix X ∈ { , } n × m , and a fixed integer k (cid:28) min( n, m ) we wish to find two binary matrices A ∈{ , } n × k and B ∈ { , } k × m to minimise (cid:107) X − A ◦ B (cid:107) F ,where (cid:107) · (cid:107) F denotes the Frobenius norm and ◦ stands forBoolean matrix multiplication as defined in Section 1.1. Since X and Z := A ◦ B are binary matrices, the Frobenius and theentry-wise (cid:96) norm coincide and we can expand the objectivefunction (cid:107) X − Z (cid:107) F = (cid:88) ( i,j ) ∈ E (1 − z ij ) + (cid:88) ( i,j ) (cid:54)∈ E z ij , (1)where E := { ( i, j ) : x ij = 1 } is the index set of the positiveentries of X . [15] formulates the problem as an exact linear2nteger program (IP) by introducing variables y i(cid:96)j for theproduct of a i(cid:96) and b (cid:96)j ( ( i, (cid:96), j ) ∈ F := [ n ] × [ k ] × [ m ] ),and using McCormick envelopes [24] to avoid the appear-ance of a quadratic constraint arising from the product. TheMcCormick envelopes represent the product of two binaryvariables a and b by a new variable y and four linear inequal-ities given by M C ( a, b ) = { y ∈ R : a + b − ≤ y, y ≤ a, y ≤ b, ≤ y } . The model of [15] reads as ( IP exact ) ζ IP = min a,b,y,z (cid:88) ( i,j ) ∈ E (1 − z ij ) + (cid:88) ( i,j ) (cid:54)∈ E z ij (2)s.t. y i(cid:96)j ≤ z ij ≤ k (cid:88) (cid:96) =1 y i(cid:96)j , ( i, (cid:96), j ) ∈ F , (3) y i(cid:96)j ∈ M C ( a i(cid:96) , b (cid:96)j ) , ( i, (cid:96), j ) ∈ F , (4) a i(cid:96) , b (cid:96)j ∈ { , } , z ij ≤ , ( i, (cid:96), j ) ∈ F , (5)The above model is exact in the sense that its optimal so-lutions correspond to optimal k -BMFs of X . Most generalpurpose IP solvers use an enumeration framework, which re-lies on bounds from the linear programming (LP) relaxationof the IP and consequently, it is easier to solve the IP whenits LP bound is tighter. For k = 1 , we have y i j = z ij forall i, j and the LP relaxation of the model is simply the LPrelaxation of the McCormick envelopes which has a rich andwell-studied polyhedral structure [31]. However, for k > ,IP exact ’s LP relaxation (LP exact ) only provides a trivial bound. Lemma 1.
For k > , LP exact has optimal objective value which is attained by at least (cid:0) k (cid:1) solutions. For the proof of Lemma 1, see Appendix 5.1. Furthermore,for k > the model is highly symmetric, since AP ◦ P − B is an equivalent solution for any permutation matrix P . Theseproperties of the model make it unlikely to be solved to opti-mality in a reasonable amount of time for a large matrix X ,though the symmetries can be partially broken by incorporat-ing constraints (cid:80) i a i(cid:96) ≥ (cid:80) i a i(cid:96) for all (cid:96) < (cid:96) .Note that constraint (3) implies k (cid:80) k(cid:96) =1 y i(cid:96)j ≤ z ij ≤ (cid:80) k(cid:96) =1 y i(cid:96)j as a lower and upper bound on each variable z ij .Due to these bounds, the objective function may be approxi-mated by ζ IP ( ρ ) = (cid:88) ( i,j ) ∈ E (1 − z ij ) + ρ (cid:88) ( i,j ) (cid:54)∈ E k (cid:88) (cid:96) =1 y i(cid:96)j , (6)where ρ is a parameter of the formulation. By setting ρ = k we underestimate the original objective, while setting ρ = 1 we overestimate. Using (6) as the objective function reducesthe number of variables and constraints in the model. Vari-ables z ij need only be declared for ( i, j ) ∈ E , and constraint(3) simplifies to z ij ≤ (cid:80) k(cid:96) =1 y i(cid:96)j , z ij ≤ for ( i, j ) ∈ E . The exact model presented in the previous section relies onpolynomially many constraints and variables and constitutesthe first approach towards obtaining rank- k factorisations ofbinary matrices with optimality guarantees. However, such a compact IP formulation may be weak in the sense that its LPrelaxation is a very coarse approximation to the convex hullof integer feasible points and an IP formulation with expo-nentially many variables or constraints can have the potentialto provide a tighter relaxation[23]. Motivated by this fact, weintroduce a new formulation with an exponential number ofvariables and detail a column generation framework for itssolution.Consider enumerating all possible rank- binary matricesof size n × m and let R = { ab (cid:62) : a ∈ { , } n , b ∈ { , } m , a , b (cid:54) = } . The size of R is |R| = (2 n − m − as any pair ofvectors a , b (cid:54) = 0 lead to a unique rank-1 matrix Y = ab (cid:62) with Y ij = 1 for { ( i, j ) : a i = 1 , b j = 1 } . Define a binarydecision variable q (cid:96) to denote if the (cid:96) -th rank- binary matrixin R is included in a rank- k factorisation of X ( q (cid:96) = 1 ),or not ( q (cid:96) = 0 ). Let q ∈ { , } |R| be a vector that has acomponent q (cid:96) for each matrix in R . We form a { , } -matrix M of dimension nm × |R| whose rows correspond to entriesof an n × m matrix, columns to rank- binary matrices in R and M ( i,j ) ,(cid:96) = 1 if the ( i, j ) -th entry of the (cid:96) -th rank- binary matrix in R is , M ( i,j ) ,(cid:96) = 0 otherwise. We split M horizontally into two matrices M and M , so that rows of M corresponding to a positive entry of the given matrix X are in M and the rest of rows of M in M , M = (cid:20) M M (cid:21) where M ∈ { , } ( nm −| E | ) ×|R| ,M ∈ { , } | E |×|R| . (7)The following integer program over an exponential numberof variables is an exact model for k -BMF, ( MIP exact ) ζ MIP = min (cid:62) ξ + (cid:62) π (8)s.t. M q + ξ ≥ (9) M q ≤ k π (10) (cid:62) q ≤ k (11) ξ ≥ , π ∈ { , } nm −| E | , (12) q ∈ { , } |R| . (13)Constraint (11) ensures that at most k rank- matrices areactive in a factorisation. Variables ξ i,j correspond to positiveentries of X , and are forced by constraint (9) to take value and increase the objective if the ( i, j ) -th positive entry of X is not covered. Similarly, variables π i,j correspond to zeroentries of X and are forced to take value by constraint (10)if the ( i, j ) -th zero entry of X is erroneously covered in afactorisation. One of the imminent advantages of MIP exact isusing indicator variables directly for rank-1 matrices insteadof the entries of factor matrices A, B , hence no permutationsymmetry arises. In addition, for all k not exceeding a certainnumber that depends on X , the LP relaxation of MIP exact (MLP exact ) has strictly positive optimal objective value. Lemma 2.
Let i ( X ) be the isolation number of X . For all k < i ( X ) , we have < ζ MLP . For the definition of isolation number and the proof ofLemma 2 see Appendix 5.2. Similarly to the polynomial sizeexact model IP exact in the previous section, we consider a3odification of MIP exact with an objective that is analogousto the one in Equation (6), ( MIP ( ρ )) z MIP ( ρ ) = min (cid:62) ξ + ρ (cid:62) M q (14)s.t. (9) , (11) hold and (15) ξ ≥ , q ∈ { , } |R| . The objective of MIP( ρ ) simply counts the number of posi-tive entries of X that are not covered by any of the k rank-1matrices chosen, plus the number of times zero valued en-tries of X are erroneously covered weighted by parameter ρ . Depending on the selection of parameter ρ , MIP ( ρ ) pro-vides a lower or upper bound on MIP exact . We denote the LPrelaxation of MIP ( ρ ) by MLP( ρ ). Lemma 3.
For ρ = k , the optimal values of the LP relax-ations MLP exact and MLP ( k ) coincide. For a short proof of Lemma 3 see Appendix 5.3. Com-bining Lemmas 1, 2 and 3 we obtain the following relationsbetween formulations IP exact , MIP exact , MIP( ρ ) and their LPrelaxations for k > , z MIP ( k ) ≤ ζ IP = ζ MIP ≤ z MIP (1) , (16) ζ LP ≤ z MLP ( k ) = ζ MLP ≤ z MLP (1) . (17)Let p be the dual variable vector associated to constraints (9)and µ be the dual variable to constraint (11). Then the dualof MLP ( ρ ) is given by ( MDP ( ρ )) z MDP ( ρ ) = max (cid:62) p − kµ (18)s.t. M (cid:62) p − µ ≤ ρ M (cid:62) , (19) µ ≥ , p ∈ [0 , | E | . (20)Due to the number of variables in the formulation, it is notpractical to solve MIP ( ρ ) or its LP relaxation MLP ( ρ ) explic-itly. Column generation (CG) is a technique to solve largeLPs by iteratively generating only the variables which havethe potential to improve the objective function [2]. The CGprocedure is initialised by explicitly solving a restricted mas-ter LP which has a small subset of the variables in MLP( ρ ).The next step is to identify a missing variable with a nega-tive reduced cost to be added to this Restricted MLP( ρ ). Toavoid explicitly considering all missing variables, a pricingproblem is formulated and solved. The solution of the pricingproblem either returns a variable with a negative reduced costand the procedure is iterated; or proves that no such variableexists and hence the solution of the Restricted MLP( ρ ) isoptimal for the complete formulation MLP( ρ ).We use CG to solve MLP( ρ ) by considering a sequence ( t = 1 , , ... ) of Restricted MLP( ρ )’s with constraint matrix M ( t ) being a subset of columns of M , where each column y ∈ { , } nm of M corresponds to a flattened rank- binarymatrix ab (cid:62) according to Equation (7). The constraint matrixof the first Restricted MLP( ρ ) may be left empty or can bewarm started by identifying a few rank- matrices in R , sayfrom a heuristic solution. Upon successful solution of the t -th Restricted MLP( ρ ), we obtain a vector of dual variables [ p ∗ , µ ∗ ] ≥ optimal for the t -th Restricted MLP( ρ ). To identify a missing column of M that has a negative reducedcost, we solve the following pricing problem: ( IP price ) ω ( p ∗ ) = max a,b,y (cid:88) ( i,j ) ∈ E p ∗ ij y ij − ρ (cid:88) ( i,j ) (cid:54)∈ E y ij s.t. y ij ∈ M C ( a i , b j ) , a i , b j ∈ { , } , i ∈ [ n ] , j ∈ [ m ] . The objective of IP price depends on the current dual solution [ p ∗ , µ ∗ ] and its optimal solution corresponds to a rank- bi-nary matrix ab (cid:62) whose corresponding variable q (cid:96) in MLP( ρ )has the smallest reduced cost. If ω ( p ∗ ) ≤ µ ∗ , then the dualvariables [ p ∗ , µ ∗ ] of the Restricted MLP( ρ ) are feasible forMDP( ρ ) and hence the current solution of the RestrictedMLP( ρ ) is optimal for MLP( ρ ). If ω ( p ∗ ) > µ ∗ , then thevariable q (cid:96) associated with the rank- binary matrix ab (cid:62) isadded to the Restricted MLP( ρ ) and the procedure is iter-ated. CG optimally terminates if at some iteration we have ω ( p ∗ ) ≤ µ ∗ .To apply the CG approach above to MLP exact only a smallmodification needs to be made. The Restricted MLP exact pro-vides dual variables for Constraints (10) which are used in theobjective of IP price for coefficients of y ij ( i, j ) (cid:54)∈ E . Notehowever, that CG cannot be used to solve a modification ofMLP exact in which constraints (10) are replaced by exponen-tially many constraints ( M ) (cid:96) q (cid:96) ≤ π for (cid:96) ∈ [ |R| ] where ( M ) (cid:96) denotes the (cid:96) -th column of M , see Appendix 5.4.If the optimal solution of MLP( ρ ) is integral, then it also isoptimal for MIP ( ρ ) . However, if it is fractional, then it onlyprovides a lower bound on the optimal value of MIP ( ρ ) . Inthis case we obtain an integer feasible solution by simplyadding integrality constraints on the variables of the finalRestricted MLP( ρ ) and solving it as an integer program. If ρ = 1 , the optimal solution of this integer program is optimalfor MIP (1) if the objective of the Restricted MIP (1) and theceiling of the Restricted MLP(1) agree. To solve the MIP( ρ )to optimality in all cases, one needs to embed CG into branch-and-bound [23] which we do not do. However, note that evenif the CG procedure is terminated prematurely, one can stillobtain a lower bound on MLP( ρ ) and MIP ( ρ ) as follows. Letthe objective value of any of the Restricted MLP( ρ )’s be z RMLP = (cid:62) ξ ∗ + ρ (cid:62) M ∗ q ∗ = (cid:62) p ∗ − kµ ∗ (21)where [ ξ ∗ , q ∗ ] is the solution of the Restricted MLP( ρ ), [ p ∗ , µ ∗ ] is the solution of the dual of the Restricted MLP( ρ )and (cid:62) M ∗ is the objective coefficient of columns in the Re-stricted MLP( ρ ). Assume that we solve IP price to optimalityand we obtain a column y for which the reduced cost is neg-ative, ω ( p ∗ ) > µ ∗ . In this case, we can construct a feasiblesolution to MDP( ρ ) by setting p := p ∗ and µ := ω ( p ∗ ) and get the following bound on the optimal value z MLP ( ρ ) ofMLP( ρ ), z MLP ( ρ ) ≥ (cid:62) p ∗ − k ω ( p ∗ ) = z RMLP − k ( ω ( p ∗ ) − µ ∗ ) . (22)If we do not have the optimal solution to IP price but have anupper bound ¯ ω ( p ∗ ) on it, ω ( p ∗ ) can be replaced by ¯ ω ( p ∗ ) inequation (22) and the bound on MLP( ρ ) still holds. Further-more, this lower bound on MLP( ρ ) provides a valid lowerbound on MIP ( ρ ) . Consequently, our approach always pro-duces a bound on the optimality gap of the final solution4hich heuristic methods cannot do. There is, however, no apriory (theoretical) bound on this gap.Table 1: Summary of binary real world datasets zoo tumor hepatitis heart lymp audio apb votesn 101 339 155 242 148 226 105 434m 17 24 38 22 44 94 105 32 i ( X )
16 24 30 22 42 90 104 %1s 44.3 24.3 47.2 34.4 29.0 11.3 8.0 47.3 Table 2: % Optimality gapzoo tumork MIP(1) [15] [21] MIP(1) [15] [21]2 0.0 0.0 100.0 0.9 43.1 *5 0.0 66.3 100.0 9.3 96.3 *10 3.0 100.0 100.0 28.4 100.0 *The efficiency of the CG procedure described above greatlydepends on solving IP price efficiently. In standard formIP price is a bipartite binary quadratic optimisation problemand can be written as ( IP price ) ω ( p ∗ ) = max a ∈{ , } n , b ∈{ , } m a (cid:62) H b (23)for H an n × m matrix with h ij = p ∗ ij ∈ [0 , for ( i, j ) ∈ E and h ij = − ρ for ( i, j ) (cid:54)∈ E . Bipartite quadratic optimisa-tion is NP-hard in general [32], hence for large X it maytake too long to solve IP price to optimality at each iteration.To speed up computations, the formulation of IP price maybe improved by eliminating redundant constraints. The Mc-Cormick envelopes [24] set two lower and two upper boundconstraints on y ij . Due to the objective function it is possibleto declare the lower (upper) bounds y ij for only ( i, j ) (cid:54)∈ E ( ( i, j ) ∈ E ) without changing the optimum. In addition, it isenough to set integrality constraint on only a or b [33].If a heuristic approach to IP price provides a solution forwhich the reduced cost is negative, that is ω ( p ∗ ) > µ ∗ , thenit is valid to add this heuristic solution as a column to the nextRestricted MLP( ρ ) without making the solution of MLP( ρ )heuristic. Most heuristic algorithms that are available forbipartite quadratic optimisation build on the idea that theoptimal a ∈ { , } n with respect to a fixed b ∈ { , } m can be computed in O ( nm ) time and this procedure can beiterated by alternatively fixing a and b . [11] presents severallocal search heuristics for IP price based on this idea alongwith a simple greedy algorithm and [6] further develops theseideas. Below we detail the greedy algorithm of [11] andintroduce some variants of it which we use in the next sectionto provide a warm start to IP price at every iteration of the CGprocedure.The greedy algorithm of [11] aims to set entries of a and b to which correspond to rows and columns of H with thelargest positive weights. In the first phase of the algorithm, the row indices i of H are put in decreasing order accord-ing to their sum of positive entries, so γ + i ≥ γ + i +1 where γ + i := (cid:80) mj =1 max(0 , h ij ) . Then sequentially according tothis ordering, a i is set to if (cid:80) mj =1 max(0 , (cid:80) i − (cid:96) =1 a (cid:96) h (cid:96)j ) < (cid:80) mj =1 max(0 , (cid:80) i(cid:96) =1 a (cid:96) h (cid:96)j ) and otherwise. In the secondphase, b j is set to if ( a (cid:62) H ) j > , otherwise. The preciseoutline of the algorithm is given in Appendix 5.5.There are many variants of the greedy algorithm one canexplore . First, the solution greatly depends on the ordering of i ’s in the first phase. If for some i (cid:54) = i we have γ + i = γ + i ,comparing the sum of negative entries of rows i and i canput more “influential” rows of H ahead in the ordering. Letus call this ordering the revised ordering and the one whichonly compares the positive sums as the original ordering .Another option is to use a completely random order of i ’s orto apply a small perturbation to sums γ + i to get a perturbed version of the revised or original ordering. None of the aboveordering strategies clearly dominates the others in all casesbut they are fast to compute hence one can evaluate all fiveordering strategies (original, revised, original perturbed, re-vised perturbed, random) and pick the best one. Second, thealgorithm as presented above first fixes a and then b . Chang-ing the order of fixing a and b can yield a different resulthence it is best to try for both H and H (cid:62) . In general, it isrecommended to start the first phase on the smaller dimen-sion. Third, the solution from the greedy algorithm may beimproved by computing the optimal a with respect to fixed b . This idea as previously mentioned then can be used to fix a and b in an alternating fashion and stop when no changesoccur in either. The CG approach introduced in the previous section providesa theoretical framework for computing k -BMF with optimal-ity guarantees. In this section we present some experimentalresults with CG to demonstrate the practical applicability ofour approach on eight real world categorical datasets thatwere downloaded from online repositories [5], [18]. Table1 shows a short summary of the eight datasets used, detailson converting categorical columns into binary, missing valuetreatment and descriptions can be found in Appendix 5.6.Table 1 also shows the value of the isolation number i ( X ) foreach dataset, which provides a lower bound on the Booleanrank [29, Section 2.3].Since the efficiency of CG greatly depends on the speed ofgenerating columns, let us illustrate the speed-up gained byusing heuristic pricing. At each iteration of CG, 30 variants ofthe greedy heuristic are computed to obtain an initial feasiblesolution to IP price . The 30 variants of the greedy algorithmuse the original and revised ordering, their transpose and per-turbed version and 22 random orderings. All greedy solutionsare improved by the alternating heuristic until no further im-provement is found. Under exact pricing, the best heuristicsolution is used as a warm start and IP price is solved to opti-mality at each iteration using CPLEX [4]. In simple heuristic( heur ) pricing, if the best heuristic solution to IP price hasnegative reduced cost, ω heur ( p ∗ ) > µ ∗ , then the heuristic5able 3: Primal objectives of MLP(1), MLP( k ), MIP(1) after 20 mins of CG k zoo tumor hepatitis heart lymp audio apb votes2 MLP( k ) 206.5 1178.9 978.7 882.9 917.2 1256.5 709 1953MLP(1) 272 1409.8 1384 1185 1188.8 1499 776 2926MIP(1) 272(271) 1411 1384(1382) 1185 1197(1184) 1499 776 29265 MLP( k ) 42.8 463.9 333.1 291.0 366.7 654.2 433.5 715.5MLP(1) 127 1019.3 1041.1 736 914.0 1159.3 683.0 2135.5MIP(1) 127(125) 1029 1228 736 997(991) 1176 684(683) 2277(2274)10 MLP( k ) 4.8 192.8 142.5 102.3 165.1 351.4 166.8 307.9MLP(1) 38.8 575.5 734.8 419 653.2 867.2 574.2 1409.5MIP(1) 40 579 910 419 737(732) 893 577(572) 1566(1549)Figure 1: Comparison of pricing strategies for solvingMLP (1) on the zoo datasetcolumn is added to the next Restricted MLP( ρ ). If at someiteration, the best heuristic column does not have negativereduced cost, CPLEX is used to solve IP price to optimalityfor that iteration. The multiple heuristic ( heur multi ) pricingis a slight modification of the simple heuristic strategy, inwhich at each iteration all columns with negative reducedcost are added to the next Restricted MLP( ρ ).Figure 1 indicates the differences between pricing strate-gies when solving MLP(1) via CG for k = 5 , on the zoodataset. The objective of MLP(1) (decreasing curve) and thevalue of the dual bound (increasing curve) computed usingthe formula in Equation (22) are plotted against time. Sharpincreases in the dual bound for heuristic pricing strategiescorrespond to iterations in which CPLEX was used to solveIP price , as for the evaluation of the dual bound on MLP(1) we need a strong upper bound on ω ( p ∗ ) which heuristic solutionsdo not provide. While we observe a tailing off effect [23] onall three curves, both heuristic pricing strategies provide asignificant speed-up from exact pricing, with adding multiplecolumns at each iteration being the fastest.In Table 2 we present computational results comparing theoptimality gap ( × best integer − best boundbest integer ) of MIP(1) and theformulations in [15] and in [21, 22] using a 20 mins timebudget. Reading in the full exponential size model of [21, 22]using 16 GB memory is not possible for other datasets than’zoo’. Table 2 shows that different formulations and algo-rithms to solve them make a big difference in practice: ournovel formulation (MIP(1)) combined with an effective com-putational optimization approach (CG) produces solutionswith smaller optimality gap than other formulations as itscales well and it has a stronger LP relaxation.In order for CG to terminate with a certificate of optimality,at least one pricing problem has to be solved to optimality.Unfortunately for the larger datasets this cannot be achievedin under 20 mins. Therefore, for datasets other than ‘zoo’, wechanged the multiple heuristic pricing strategy as follows: Weimpose an overall time limit of 20 mins on the CG processand use the barrier method in CPLEX as the LP solver forthe Restricted MLP( ρ ) at each iteration. In order to maximisethe diversity of columns added at each iteration, we choose atmost two columns with negative reduced cost that are closestto being mutually orthogonal. If CPLEX has to be used toimprove the heuristic pricing solution, we do not solve IP price to optimality but abort CPLEX if a column with negativereduced cost has been found. While these modifications resultin a speed-up, they reduce the chance of obtaining a strongdual bound. In case a strong dual bound is desired, we maycontinue applying CG iterations with exact pricing after the20 mins of heuristic pricing have run their course.In our next experiment, we explore the differences betweenformulations MLP( k ), MLP(1) and MIP(1). We warm startCG by identifying a few heuristic columns using the codeof [1] and a new fast heuristic which works by sequentiallycomputing k rank-1 binary matrices via the greedy algorithmstarting with the coefficient matrix H = 2 X − and thensetting entries of H to zero that correspond to entries of X that are covered. The precise outline of the algorithm can be6able 4: Comparison of eight methods for k -BMFzoo tumor hepatitis heart lymp audio apb votesk=2 CG IP exact
271 1408
ASSO++ 276 1437 1397 1187 1202 1503
776 2926 k -greedy 325 1422 1483 1204 1201
125 1029 1228 736 991 1176 683 2272 IP exact
133 1055
738 1029 1211 690 2293ASSO++ 133 1055
738 1039 1211 694 2302 k -greedy 233 1055 1306 748 1063 1211 690 2310pymf 142 1126 1301 835 1062 1245 730 2517ASSO 354 1092 1724 887 1352 1505 719 2503NMF 163 1207 1337 995 1158 1565 762 2526MEBF 173 1245 1439 929 1245 1672 730 2832k=10 CG
40 579
419 730 893 572 1527 IP exact
41 583
902 419
805 919 590 1573ASSO++ 55 583 910
812 922 591 1573 k -greedy 184 675 1088 565 819 976 611 1897pymf 96 703 1186 581 987 1106 602 2389ASSO 354 587 1724 694 1352 1505 661 2503NMF 153 826 1337 995 1143 1407 689 2481MEBF 122 990 1328 777 1004 1450 662 2460found in Appendix 5.8.Table 3 shows the primal objective values of MLP(1) andMLP( k ) with heuristic pricing using a time limit of 20 mins,and the objective value of MIP(1) solved on the columns gen-erated by MLP(1). If the error measured in (cid:107) · (cid:107) F differs fromthe objective of MIP(1), the former is shown in parenthesis.It is interesting to observe that MLP(1) has a tendency toproduce near integral solutions and that the objective value ofMIP(1) often coincides with the error measured in (cid:107) · (cid:107) F . Wenote that once a master LP formulation is used to generatecolumns, any of the MIP models could be used to obtain aninteger feasible solution. In experiments, we have found thatformulation MIP( ρ ) is solved much faster than MIP exact andthat setting ρ to or . provides the best integer solutions.We compare the CG approach against the most widelyused k -BMF heuristics and the exact approach of [15]. Theheuristic algorithms we evaluated include the ASSO algo-rithm [27], the alternating iterative local search algorithmof [1] (ASSO++) which uses ASSO [27] as a starting point,algorithm ( k -greedy) detailed in Appendix 5.8, the penaltyobjective formulation (pymf) of [40] and the permutation-based heuristic MEBF [39]. We also evaluate IP exact with atime limit of mins and provide the heuristic solutions ofASSO++ and k -greedy as a warm start to it. In addition, wecompute rank- k NMF and binarise it with a threshold of . .The exact details and parameters used in the computationscan be found in Appendix 5.9. Our CG approach (CG) resultsare obtained by generating columns for 20 mins using formu- lation MLP(1) with a warm start of initial columns obtainedfrom ASSO++ and k -greedy, then solving MIP( ρ ) for ρ setto and . over the generated columns and picking thebest. Table 4 shows the factorisation errors in (cid:107) · (cid:107) F afterevaluating the above described methods on all datasets for k = 2 , , . The best result for each instance is indicated inboldface. We observe that CG provides the smallest error for16 out of 24 cases. In this paper, we studied the rank- k binary matrix factori-sation problem under Boolean arithmetic. We introduced anew integer programming formulation and detailed a methodusing column generation for its solution. Our experimentsindicate that our method using 20 mins time budget is produc-ing significantly more accurate solutions than most heuristicsavailable in the literature and is able to prove optimalityfor smaller datasets. In certain critical applications such asmedicine, spending 20 minutes to obtain a higher accuracyfactorisation with a bound on the optimality gap can be easilyjustified. In addition, solving BMF to near optimality via ourproposed method paves the way to more robustly benchmarkheuristics such as the ones of [28], [1], [40], [39]. Future di-rections that could be explored are related to designing moreaccurate heuristics and faster exact algorithms for the pricingproblem. In addition, a full branch-and-price algorithm im-plementation would be beneficial once the pricing problemsare solved more efficiently.7 eferences [1] Barahona, F.; and Goncalves, J. 2019. Lo-cal Search Algorithms for Binary Matrix Fac-torization. https://github.com/IBM/binary-matrix-factorization/blob/master/code .[2] Barnhart, C.; Johnson, E. L.; Nemhauser, G. L.; Savelsbergh,M. W. P.; and Vance, P. H. 1998. Branch-and-Price: ColumnGeneration for Solving Huge Integer Programs. OperationsResearch
Using the CPLEX Callable Li-brary, V.12.8 . CPLEX Optimization, Inc., Incline Village,NV.[5] Dua, D.; and Graff, C. 2017. UCI Machine Learning Reposi-tory. URL http://archive.ics.uci.edu/ml.[6] Duarte, A.; Laguna, M.; Mart´ı, R.; and S´anchez-Oro, J. 2014.Optimization Procedures for the Bipartite Unconstrained 0-1Quadratic Programming Problem.
Comput. Oper. Res.
Computers and In-tractability: A Guide to the Theory of NP-Completeness . NewYork, NY, USA: W. H. Freeman & Co.[9] Golub, G.; and Van Loan, C. 1989.
Matrix Computations .Baltimore: Johns Hopkins University Press, 2nd edition.[10] Gong, G. 1988. UCI Machine Learning Repository: Hep-atitis Data Set. URL https://archive.ics.uci.edu/ml/datasets/Hepatitis.[11] Karapetyan, D.; and Punnen, A. P. 2012. Heuristic algorithmsfor the bipartite unconstrained 0-1 quadratic programmingproblem.
CoRR abs/1210.3684. URL http://arxiv.org/abs/1210.3684.[12] Kim, K. 1982.
Boolean Matrix Theory and Applications .Monographs and textbooks in pure and applied mathematics.Dekker. ISBN 9780824717889.[13] Kononenko, I.; and Cestnik, B. 1988. UCI Mach. Learn. Rep.:Primary Tumor Domain. URL https://archive.ics.uci.edu/ml/datasets/Primary+Tumor.[14] Kononenko, I.; and Cestnik, B. 1988. UCI Machine LearningRepository: Lymphography Data Set. URL https://archive.ics.uci.edu/ml/datasets/Lymphography.[15] Kovacs, R. A.; Gunluk, O.; and Hauser, R. A. 2017. Low-Rank Boolean Matrix Approximation by Integer Programming.NIPS Optimization for Machine Learning Workshop.[16] Koyut¨urk, M.; and Grama, A. 2003. PROXIMUS: AFramework for Analyzing Very High Dimensional Discrete-attributed Datasets. In
Proceedings of the Ninth ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining , KDD ’03, 147–156. New York, NY, USA: ACM.[17] Koyut¨urk, M.; Grama, A.; and Ramakrishnan, N. 2002. Al-gebraic Techniques for Analysis of Large Discrete-ValuedDatasets. In
Proceedings of the 6th European Conference onPrinciples of Data Mining and Knowledge Discovery , PKDD’02, 311–324. London, UK, UK: Springer-Verlag. [18] Krebs, V. 2008. Amazon Political Books. URL http://moreno.ss.uci.edu/data.html
Nature
Proceedings of the 11th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining , KDD ’05.New York, NY, USA: Association for Computing Machinery.[21] Lu, H.; Vaidya, J.; and Atluri, V. 2008. Optimal BooleanMatrix Decomposition: Application to Role Engineering. In
Proceedings of the 2008 IEEE 24th International Conferenceon Data Engineering , ICDE ’08, 297–306. Washington, DC,USA: IEEE Computer Society.[22] Lu, H.; Vaidya, J.; and Atluri, V. 2014. An optimizationframework for role mining.
Journal of Computer Security
Operations Research
Math. Program.
Inf. Process. Lett.
Knowledge Discov-ery in Databases: PKDD 2006 , 335–346. Berlin, Heidelberg:Springer Berlin Heidelberg. ISBN 978-3-540-46048-0.[28] Miettinen, P.; Mielik¨ainen, T.; Gionis, A.; Das, G.; and Man-nila, H. 2008. The Discrete Basis Problem.
IEEE Trans. onKnowl. and Data Eng.
Bulletin – Institute of Combinatorics and itsApplications
14: 17–86. ISSN 1183-1278.[30] Orlin, J. 1977. Contentment in graph theory: Covering graphswith cliques.
Indagationes Mathematicae (Proceedings)
Math. Program.
Discrete Applied Mathematics
36] Shen, B.-H.; Ji, S.; and Ye, J. 2009. Mining Discrete Patternsvia Binary Matrix Factorization. In
Proceedings of the 15thACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining , KDD ’09, 757–766. New York, NY,USA: ACM.[37] Shi, Z.; Wang, L.; and Shi, L. 2014. Approximation methodto rank-one binary matrix factorization. In , 800–805.[38] Simon, H. U. 1990. On Approximate Solutions for Combi-natorial Optimization Problems.
SIAM J. Discrete Math.
AAAI 2020 .[40] Zhang, Z.; Li, T.; Ding, C.; and Zhang, X. 2007. BinaryMatrix Factorization with Applications. In
Proceedings of the2007 Seventh IEEE International Conference on Data Mining ,ICDM ’07, 391–400. USA: IEEE Computer Society. ppendix exact Lemma 1.
For k > , the LP relaxation of IP exact (LP exact )has optimal objective value which is attained by at least (cid:0) k (cid:1) solutions.Proof. Observe that the objective function of LP exact satisfies ≤ (cid:80) ( i,j ) ∈ E (1 − z ij ) + (cid:80) ( i,j ) (cid:54)∈ E z ij as constraints (3),the McCormick envelopes and a i(cid:96) , b (cid:96)j ∈ [0 , imply z ij ∈ [0 , . Let us construct a feasible solution to LP exact whichattains this bound. Let a i(cid:96) = b (cid:96)j = for all i ∈ [ n ] , j ∈ [ m ] , (cid:96) ∈ [ k ] . The McCormick envelopes then are equivalentto M C ( , ) = { y ∈ R : + − ≤ y, y ≤ , y ≤ , ≤ y } = [0 , ] hence we may choose the value of y i(cid:96)j ∈ M C ( , ) depending on the objective coefficient of indices ( i, j ) . For ( i, j ) ∈ E the objective function is maximising z ij hence we set y i(cid:96)j = so that the upper bound (cid:80) k(cid:96) =1 y i(cid:96)j on z ij becomes greater than equal to and z ij can take value .For ( i, j ) (cid:54)∈ E the objective function is minimising z ij hencewe set y i(cid:96)j = 0 so that the lower bounds y i(cid:96)j ≤ z ij evaluateto and z ij can take value . Therefore, the following settingof the variables shows the lower bound of on the objectivefunction is attained, a i(cid:96) = 12 , i ∈ [ n ] , (cid:96) ∈ [ k ]; b (cid:96)j = 12 , (cid:96) ∈ [ k ] , j ∈ [ m ]; y i(cid:96)j = 12 , ( i, j ) ∈ E, (cid:96) ∈ [ k ]; y i(cid:96)j = 0 , ( i, j ) (cid:54)∈ E, (cid:96) ∈ [ k ]; z ij = 1 , ( i, j ) ∈ E ; z ij = 0 , ( i, j ) (cid:54)∈ E. Furthermore, for all ( i, j ) ∈ E it is enough to set y i(cid:96) j = y i(cid:96) j = for only two indices (cid:96) , (cid:96) ∈ [ k ] since this alreadyachieves the upper bound z ij ≤ (cid:80) k(cid:96) =1 y i(cid:96)j . Hence, thereis at least (cid:0) k (cid:1) different solutions of LP exact with objectivevalue . exact For our proof we will need a definition from the theory ofbinary matrices.
Definition 2. [29, Section 2.3] Let X be a binary matrix. Aset S ⊆ E = { ( i, j ) : x ij = 1 } is said to be an isolated setof ones if whenever ( i , j ) , ( i , j ) are two distinct membersof S then1. i (cid:54) = i , j (cid:54) = j and2. x i ,j x i ,j = 0 .The size of the largest cardinality isolated set of ones in X isdenoted by i ( X ) and is called the isolation number of X . Observe that requirement (1.) implies that an isolated set ofones can contain the index corresponding to at most one entryin each column and row of X . Hence i ( X ) ≤ min( n, m ) .Requirement (2.) implies that if ( i , j ) , ( i , j ) are membersof an isolated set of ones then at least one of the entries x i ,j , x i ,j is zero, hence members of an isolated set ofones cannot be contained in a same rank- submatrix of X .Therefore if the largest cardinality isolated set of ones has i ( X ) elements, to cover all s of X we need at least i ( X ) many rank- binary matrices in a factorisation of X , so i ( X ) provides a lower bound on the Boolean rank of X . Lemma 2.
Let X be a binary matrix with isolation number i ( X ) . Then for rank- k binary matrix factorisation with k k . For a contradiction,assume that MLP exact has objective value zero, ζ MLP = 0 .(1.) Now, if ζ MLP = 0 we must have M q = = π inconstraint (10) which implies that none of the rank- binarymatrices with q (cid:96) > cover zero entries of X . In other words,all the rank- binary matrices active are submatrices of X .(2.) Now, if ζ MLP = 0 we must also have ξ = whichimplies M q ≥ in constraint (9) for some q whichmay be fractional but satisfies (cid:62) q ≤ k . Let S := { ( i , j ) , . . . , ( i r , j r ) , . . . , ( i | S | , j | S | ) } ⊆ E be an isolatedset of ones in X of cardinality i ( X ) = | S | . Since membersof S cannot be contained in the same rank- submatrix of X ,all columns M (cid:96) corresponding to rank- binary submatricesof X for entries ( i r , j r ) ∈ S satisfy M ( i r ,j r ) ,(cid:96) = 1 = ⇒ M ( i,j ) ,(cid:96) = 0 ∀ ( i, j ) (cid:54) = ( i r , j r ) , ( i, j ) ∈ S. Therefore, we can partition the active rank-1 binary matricesinto i ( X ) + 1 groups G r , G r = { (cid:96) : q (cid:96) > and M ( i r ,j r ) ,(cid:96) = 1 } (24)for r = 1 , . . . , i ( X ); G i ( X )+1 = { (cid:96) : q (cid:96) > and M ( i r ,j r ) ,(cid:96) = 0 ∀ ( i r , j r ) ∈ S } . (25)While G i ( X )+1 may be empty, G r ’s for r = 1 , . . . , i ( X ) arenot empty because we know that for all ( i, j ) ∈ E we have (cid:80) (cid:96) M ( i,j ) ,(cid:96) q (cid:96) ≥ and S ⊆ E . Hence for all ( i r , j r ) ∈ S we have (cid:80) (cid:96) ∈ G r q (cid:96) ≥ which implies the contradiction (cid:62) q ≥ (cid:80) i ( X ) r =1 (cid:80) (cid:96) ∈ G r q (cid:96) ≥ i ( X ) > k and therefore ζ MLP > . Could we replace the condition k < i ( X ) in Lemma 2 bya requirement that k has to be smaller than the Boolean rankof X ? The following example shows that we cannot. Example 2.
Let X = J − I , where J is the × matrixof all s and I is the × identity matrix. One can verifythat the Boolean rank of X is and its isolation number is . X = (26) For k = 3 , the optimal objective value of MLP exact is whichis attained by a fractional solution in which the following rank- binary matrices are active. q = q = q = = q = q = exact and MIP( k ) Lemma 3.
For ρ = k , the optimal values of the LP relax-ations MLP exact and MLP ( k ) coincide.Proof. It is enough to observe that since MLP exact is a min-imisation problem, π takes the minimal optimal value k M q in MLP exact due to constraint (10) which equals the secondterm in the objective (14) of MLP( k ). exact We obtain a modification of MIP exact which we call the“strong formulation” by replacing constraints (10) by expo-nentially many constraints. The following is the LP relaxationof the strong formulation of MIP exact , ( MLP exact strong ) ζ MLP = min (cid:62) ξ + (cid:62) π s.t. M q + ξ ≥ ( M ) (cid:96) q (cid:96) ≤ π , (cid:96) ∈ [(2 n − m − (cid:62) q ≤ k ξ ≥ , π ∈ [0 , nm −| E | , q ∈ [0 , |R| . Lemma 4.
Applying the CG approach to MLP exact strong can-not be used to generate sensible columns.Proof.
Let us try applying column generation to solveMLP exact strong and add a column of all s as our first col-umn q . Then at the st iteration, for q = 1 the objectivevalue of the Restricted MLP is ζ (1) RMLP = 0 + ( nm − | E | ) for solution vector [ ξ (1) , π (1) , q (1) ] = [ , , . Addingthe same column of all s at the next iteration and set-ting [ q , q ] = [ , ] , allows us to keep ξ (2) = but set π (2) = to get ζ (2) RMLP = 0 + ( nm − | E | ) . Thereforecontinuing adding the same column of all s, after t itera-tions we have ζ ( t ) RMLP = 0 + t ( nm − | E | ) for solution vec-tor [ ξ ( t ) , π ( t ) , q ( t ) ] = [ , t , t ] . Therefore for t → ∞ we have ζ ( t ) RMLP → and we have not generated any othercolumns but the all s. For a given n × m coefficient matrix H , we aim to find a ∈ { , } n and b ∈ { , } m so that a (cid:62) H b is maximised.Let γ + i be the sum of the positive entries of H for each row i ∈ [ n ] , γ + i := (cid:80) mj =1 max(0 , h ij ) . Reorder the rows of H according to decreasing values of γ + i . Algorithm 1 is thegreedy heuristic of [11] which provides the optimal solutionto the bipartite binary quadratic problem if min( n, m ) ≤ and has an approximation ratio of / (min( n, m ) − other-wise. Algorithm 1:
Greedy AlgorithmPhase I. Order i ∈ [ n ] so that γ + i ≥ γ + i +1 .Set a = n for i ∈ [ n ] do f i − = (cid:80) mj =1 max(0 , (cid:80) i − (cid:96) =1 a (cid:96) h (cid:96)j ) f i = (cid:80) mj =1 max(0 , (cid:80) i(cid:96) =1 a (cid:96) h (cid:96)j ) if f i − < f i then Set a i = 1 endend Phase II. Set b = m . for j ∈ [ m ] doif ( a (cid:62) H ) j > then Set b j = 1 endend In general if a dataset has a categorical feature C with N discrete options v j , ( j ∈ [ N ]) , we convert feature C into N binary features B j ( j ∈ N ) so that if the i -th sample takesvalue v j for C that is ( C ) i = v j , then we have value ( B j ) i =1 and ( B (cid:96) ) i = 0 for all (cid:96) (cid:54) = j ∈ [ N ] . This techinque ofbinarisation of categorical columns has been applied in [15]and [1]. The following datasets were used in the experiments:• The Zoo dataset ( zoo ) [7] describes animals with characteristic features. All but one feature is binary. Thecategorical column which records the number of legs ananimal has, is converted into two new binary columnsindicating if the number of legs is less than or equal or greater than four. The size of the resulting fully binarydataset is × .• The Primary Tumor dataset ( tumor ) [13] contains obser-vations on tumour features detected in patients.The features are represented by binary variables and categorical variables with discrete options. The cat-egorical variables are converted into binary variablesrepresenting each discrete option. Two missing values inthe binary columns are set to value 0. The final dimensionof the dataset is × .• The Hepatitis dataset ( hepat ) [10] consists of 155 samplesof medical data of patients with hepatitis. The 19 featuresof the dataset can be used to predict whether a patientwith hepatitis will live or die. 6 of the 19 features takenumerical values and are converted into 12 binary featurescorresponding to options: less than or equal to the medianvalue , and greater than the median value . The columnthat stores the sex of patients is converted into two binarycolumns corresponding to labels man and female. Theremaining 12 columns take values yes and no and areconverted into 24 binary columns. The raw dataset contains1167 missing values, and according to the above binarisationif a sample has missing entry for an original feature itwill have 0’s in both columns binarised from that originalfeature. The final binary dataset has dimension × .• The SPECT Heart dataset ( heart ) [3] describes cardiacSingle Proton Emission Computed Tomography imagesof patients by binary feature patterns. patients’images contain none of the features and are dropped fromthe dataset, hence the final dimension of the dataset is × .• The Lymphography dataset ( lymp ) [14] contains data aboutlymphography examination of patients. features takecategorical values and are expanded into binary fea-tures representing each categorical value. One column isnumerical and we convert it into two binary columns cor-responding to options: less than or equal to median value ,and larger than median value . The final binary dataset hasdimension × .• The Audiology Standardized dataset ( audio ) [34] containsclinical audiology records on patients. The featuresinclude patient-reported symptoms, patient history infor-mation, and the results of routine tests which are neededfor the evaluation and diagnosis of hearing disorders. fea-tures that are categorical valued are binarised into newbinary variables indicating if a discrete option is selected.The final dimension of the dataset is × .• The Amazon Political Books dataset ( books ) [18] containsbinary data about US politics books sold by Ama-zon.com. Columns correspond to books and rows representfrequent co-purchasing of books by the same buyers. Thedataset has dimension × .• The 1984 United States Congressional Voting Recordsdataset ( votes )[35] includes votes for each of the U.S.House of Representatives Congressmen on the keyvotes identified by the CQA. The categorical variablestaking values of “voted for”, “voted against” or “did notvote”, are converted into binary variables. One congress-man did not vote for any of the bills and its correspondingrow of zero is dropped. The final binary dataset has dimen-sion × . In practice, the input matrix X ∈ { , } n × m may containzero rows or columns. Deleting a zero row (column) leadsto an equivalent problem whose solution A and B can eas-ily be translated to a solution of the original dimension. Inaddition, if a row (column) of X is repeated α i ( β j ) times,it is sufficient to keep only one copy of it, solve the reducedproblem and reinsert the relevant copies in the correspondingplace. To ensure that the objective function of the reducedproblem corresponds to the factorisation error of the originalproblem, the variable corresponding to the representative row(column) in the reduced problem is multiplied by α i ( β j ). k greedy heuristic For a given X ∈ { , } n × m and k ∈ Z + , accordingto Equation (1) we may write min (cid:107) X − Z (cid:107) F as | E | − max (cid:80) i ∈ [ n ] ,j ∈ [ m ] h ij z ij were h ij are entries of H := 2 X − . We propose the below heuristic to compute k -BMF bysequentially computing k rank- binary matrices using thegreedy Algorithm 1. Algorithm 2:
Greedy algorithm for rank- k binarymatrix factorisationInput: X ∈ { , } n × m , k ∈ Z + .Set H := 2 X − . for (cid:96) ∈ [ k ] do a , b = Greedy ( H ) // Compute a rank-1binary matrix via the greedyalgorithm a , b = Alt ( H, a , b ) // Improve greedysolution with the alternatingheuristic A : ,(cid:96) = a B (cid:96), : = b (cid:62) H [ ab (cid:62) == 1] = 0 // Set entries of H to zero that are covered end Output: A ∈ { , } n × k , B ∈ { , } k × m We remark that this rank- k greedy algorithm can be usedto obtain a heuristic solution to k -BMF under standard arith-metic as well: simply modify the last line to H [ ab (cid:62) ==1] = − K for a large enough number K (say K := (cid:80) ij x ij )so that each entry of X is covered at most once. The following methods were evaluated for the comparison inTable 4.• For the alternating iterative local search algorithm of [1](ASSO++) we obtained the code from the author’s githubpage. The code implements two variants of the algorithmand we report the smaller error solution from two variantsof it.• The greedy algorithm ( k -greedy) detailed in Appendix5.8 is evaluated with nine different orderings and the bestresult is chosen.• For the method of [40], we used an implementation in thegithub pymf package by Chris Schinnerl and we ran it for10000 iterations.• We evaluated the heuristic method ASSO [27] which de-pends on a parameter and we report the best results acrossnine parameter settings ( τ ∈ { . , . , . . . , . } ). Thecode was obtained form the webpage of the author.• In addition, we computed rank- k non-negative matrix fac-torisation (NMF) and binarise it by a threshold of . :after an NMF is obtained, values greater than . are setto , otherwise to . For the computation of NMF we used sklearn.decomposition module in Python.• For the MEBF method [39] we used the code from theauthor’s github page. The raw code downloaded containeda bug and did not produce a solution for some instances12hile for others it produced factorisations whose error in (cid:107) · (cid:107) F increased with the factorisation rank k . We fixed thecode and the results shown in 4 correspond to the lowesterror for each instance selected across 9 parameter settings t ∈ { . , . . . , . }}