BIDEAL: A Toolbox for Bicluster Analysis -- Generation, Visualization and Validation
Nishchal K. Verma, T. Sharma, S. Dixit, P. Agrawal, S. Sengupta, V. Singh
11 BIDEAL: A Toolbox for Bicluster Analysis -Generation, Visualization and Validation
Nishchal K. Verma, T. Sharma, S. Dixit, P. Agrawal, S. Sengupta, and V. Singh
Abstract —This paper introduces a novel toolbox named BIDEAL for the generation of biclusters, their analysis, visualization, andvalidation. The objective is to facilitate researchers to use forefront biclustering algorithms embedded on a single platform. A singletoolbox comprising various biclustering algorithms play a vital role to extract meaningful patterns from the data for detecting diseases,biomarkers, gene-drug association, etc. BIDEAL consists of seventeen biclustering algorithms, three biclusters visualization techniques,and six validation indices. The toolbox can analyze several types of data, including biological data through a graphical user interface.It also facilitates data preprocessing techniques i.e., binarization, discretization, normalization, elimination of null and missing values.The effectiveness of the developed toolbox has been presented through testing and validations on Saccharomyces cerevisiae cell cycle,Leukemia cancer, Mammary tissue profile, and Ligand screen in B-cells datasets. The biclusters of these datasets have been generatedusing BIDEAL and evaluated in terms of coherency, differential co-expression ranking, and similarity measure. The visualization ofgenerated biclusters has also been provided through a heat map and gene plot.
Index Terms —Biclustering, Gene expression analysis, Data visualization, preprocessing, Validation index, MTBA, Coherency. (cid:70)
NTRODUCTION B ICLUSTERING has become prevalent and useful datamining technique among researchers for analyzing thedata. It has been applied to a wide variety of applicationssuch as bioinformatics, information retrieval, text mining,dimensionality reduction, recommender systems, electoraldata analysis, disease identification, association rule dis-covery in databases, and many more [1]. Among these,bioinformatics [2] [3] seems to have taken the advantageof biclustering for analysis of the gene expression data.During any biological process under different experimentalconditions, genes are examined by their expression levels.The data is present in a matrix form with rows representinggenes and columns as experimental conditions. The aim isto group genes and conditions into a sub-matrix to obtaincrucial biological information such as identification of co-regulated patterns among genes. A bicluster B can be repre-sented as B = b b b . . . b | J | b b b . . . b | J | ... ... ... . . . ... b | I | b | I | b | I | . . . b | I || J | (1)where b ij refers to the expression level of instance i undersample j , ∀ i ∈ { , , ..., | I |} and ∀ j ∈ { , , ..., | J |} , | I | is the number of instances, and | J | is the number ofattributes. It involves finding the maximum sub-matrices ina data matrix with maximum coherency. Since biclustering isa NP-hard problem, various heuristics and meta-heuristics • approaches have been used in the literature to find bettersolutions [4].The traditional clustering algorithms give equal impor-tance to all the columns. These algorithms are K -meansclustering [5], hierarchical clustering [6], self-optimal clus-tering [7], improved mountain clustering [8], fuzzy C-meansclustering [9], unsupervised fuzzy clustering [10], etc. Eachalgorithm has its own advantage. Despite their usefulness,they are not very helpful in a variety of problems. Forexample, every gene may not take part in every conditionwith gene expression analysis. Thus, combinatorial regula-tion and joint patterns of gene expression biclustering areessential to realize the complex nature of genes. In [11],a plethora of solutions to perform biclustering has beenpresented. Undoubtedly, among the pool of algorithms, allhave their own distinctive ways including heuristic andstatistical approaches with their merits and demerits. It isnot expected that a single approach would turn out to bewell-suited for all types of data. So, any problem must betackled with respective suitable algorithms and the bestresult must be noted. This generates the need of a com-prehensive biclustering toolbox where various algorithmscan be tested, validated, and visualized. A toolbox can becompared in terms of the following:(a) Number of algorithms embedded in the toolbox.(b) Number of validation indices present for qualitativeanalysis of generated biclusters.(c) Number of visualization methods available for gener-ated biclusters.(d) User-friendly interface of the toolbox.Based on the above-mentioned features, it can be sum-marized that a toolbox must be diverse in nature. In thepast decade, the growing demand of biclustering algorithmshas led the intense research on developing toolboxes forbiclustering. This paper proposes a user-friendly toolbox a r X i v : . [ c s . OH ] J u l TABLE 1Summary of the biclustering toolboxes
Toolboxes Algorithms Validation Indices Visualization Methods PlatformBicAT [12] CC [23], ISA [27], OPSM [26], xMotif [30] None Heat Map [48] JAVA
BiVisu [16] Greedy version of pCluster Mean Square Residue,Average Correlation Value Heat Map, ParallelCoordinate Plots MATLAB
BicOver-lapper 2.0 [13] Visualization Toolbox None Venn like Diagrams R, JAVA
Expander [14] SAMBA [46] None Heat Map JAVA
BAT [17] BiHEA [47] Pairwise Gene Analysis Heat Map, Numerical Matrix JAVA
BiBench [18] CC, OPSM, xMotif, kSpectral [28], ISA, Plaid [31], BiMax [32],Bayesian, QUBIC [38], FABIA [34], COALESCE Jaccard Index [40], F-measure Heat Map, BiclusterProjection, Parallel Coordinates Python
BiClust [19] BiMax, CC, Plaid Jaccard Index, Constant Variance Parallel Plot, HeatMap, Bubble Plot R
BicNET [15] BicNET None Biclustering Network Data Java
MTBA [20] CC, BSGP [25], ISA, OPSM, kSpectral, ITL [29], xMotif, BiMax,Plaid, FLOC [24], BiMax, LAS [33] Jaccard Index, SB Score,Constant and Sign Variance Heat Map, Gene Plot MATLAB
CoClust [21] Modularity Based, Information-Theoretic Based None Cluster Plot, Cluster Size,Heat Map, Cluster Graph Python
BicPAMS [22] BicPAM, BicNET, Bic2PAM, BiP, BiModule None Graphical Display,Heat Map Java
BIDEAL(Proposed Toolbox)
CC, BSGP, OPSM, ISA, kSpectral, ITL, xMotif, Plaid, FLOC, BiMax,LAS, FABIA, BitBit [35], BiSim [36], MSVD [37], QUBIC, ROBA [39] Jaccard Index, SB Score, Constant andSign Variance, Hausdorff, MSE Heat Map, Gene Plot, ClusterPlot, Numerical Matrix MATLAB namely “BIDEAL” which incorporates biclustering algo-rithms, validation indices, and visualization methods.Table 1 summarizes various biclustering toolboxes in termsof available algorithms, validity indices, and visualizationmethods. Considering the visualization methods or resultpresentation for generated biclusters, BicAT [12], BicOver-lapper 2.0 [13], Expander [14], and BicNET [15] provide onlysingle visualization method. On the other hand, BiVisu [16],BAT [17], BiBench [18], BiClust [19], MTBA [20], CoClust[21], BicPAMS [22], and BIDEAL have multiple methodsof visualization. Among these, CoClust and BIDEAL offersthe maximum number of visualization methods. By default,BIDEAL provides bicluster results in a numerical matrix.Another important feature of a toolbox is the validationindices to check the quality of obtained biclusters. BiVisu,BAT, BiBench, and BiClust offers only one or two validationindices whereas, BIDEAL have six i.e. maximum amongthe listed toolboxes. The Graphical User Interface (GUI) ofany application for the execution of various algorithms ona single platform alleviates the process. The user-friendlyinterface of BIDEAL enables the testing of new datasetquite easy without any prior knowledge of back-end pro-gramming. On the other hand, BiBench, BiVisu, BiClust,CoClust, and MTBA requires a little bit familiarity withthe programming knowledge. Moreover, BicAT allows theexecution of algorithms with default parameter settings,which is a constraint whereas, BIDEAL allows to changethese parameters. Contributions:
This paper introduces the proposedBIDEAL toolbox, its necessity, and importance in compar-ison with other existing toolboxes in literature. Table 2summarizes the comparison of the features available inBIDEAL with respect to existing toolboxes in the literature.In summary, the features of BIDEAL are as follows:(a) It is developed to integrate the largest number of biclus-tering algorithms, validation indices, and visualizationmethods (over existing toolboxes) on a single platform.(b) It accommodates preprocessing methods as well within.(c) It has a user-friendly interface than other existing bi-clustering toolboxes.(d) To demonstrate the usefulness of BIDEAL, it has exper-imented with four standard datasets and their valida-tion indices have been compared.To the best of our knowledge, no existing biclustering toolboxes have all these features incorporated on a singleplatform.The paper is arranged as: Section 2 presents a briefintroduction about biclustering algorithms embedded inBIDEAL, Section 3 describes validation indices, Section 4 il-lustrates GUI of BIDEAL, and Section 5 provides the resultson four standard datasets using BIDEAL. Finally, Section 6concludes the paper.
EADY FOR USE B ICLUSTERING A L - GORITHMS
This section provides a brief overview of biclustering algo-rithms embedded in BIDEAL.Cheng and Church (CC) [23] proposed an algorithmto process expression data on the basis of Mean SquaredResidue (MSR) score asMSR = 1 | I || J | (cid:88) i ∈ I,j ∈ J ( a ij − a iJ − a Ij + a IJ ) (2)MSR measures coherency of genes and conditions us-ing mean values and extract δ -biclusters. Another effec-tive algorithm FLexible Overlapped biClustering (FLOC)[24] was proposed. It performs probabilistic steps and findoverlapped biclusters further refined using MSR score toovercome the effect of missing values in biclusters. Themissing values often create random disturbances whichaffect the quality and slow down the operation of biclustersidentification. The biclusters acquired by FLOC give betterresults for a larger matrix with smaller MSR in comparisonto CC.Dhillon [25] used Bipartite Spectral Graph Partitioning(BSGP) to model data matrix as G = ( R, C, E ) . It is based onan exhaustive bicluster enumeration approach, which triesto find partitions of the minimum cut vertex in a bipartitegraph between rows and columns. Considering the timeand memory, it is quite expensive. BSGP approach can berepresented as cut ( R ∪ C , ...R k ∪ C k ) = min O ,...O k cut ( O , O ...O k ) (3)Order Preserving Sub-Matrices (OPSM) [26] algorithmfinds matrices, which have expression level in strictly in-creasing linear order. The algorithm uses a heuristic ap-proach for biclustering. A sub-matrix can be said to be order TABLE 2Comparison of the features comprised with various biclustering toolboxes
Features Toolboxes BicAT [12]
BiVisu [16]
BicOver-lapper2.0 [13]
Expander [14]
BAT [17]
BiBench [18]
BiClust [19]
BicNET [15]
MTBA [20]
CoClust [21]
BicPAMS [22]
BIDEAL(Proposed Toolbox)No. of Algorithms
Yes No Yes Yes Yes No No Yes No No Yes
Yes
The values shown in bold represents the best feature among all the toolboxes. preserving, if under the permutation of the conditions, thevalue of the gene expression data is linearly increasing ordecreasing.Another approach proposed by Bergmann et al. i.e. Itera-tive Search Algorithm (ISA) based on coherently overlappedbiclusters, also referred as Transcription Modules (TM), canextract biclusters by iterative search from the gene expres-sion data matrix [27].In [28], Kluger et al. proposed a spectral techniqueknown as kSpectral to find biclusters based on Eigenvectorsof the data matrix. Firstly, the datasets are normalized andthen a singular value decomposition technique is applied onthe micro array, where the constant part wise Eigenvaluesgive the checkerboard patterns in the sub-matrix. Finally, k -means clustering is applied to obtain the checkerboardstructures from the data matrix.In [29], the authors presented the information-theoretic(ITL) formulation for biclustering. In this formulation, anoptimization approach has been followed where the numberof rows and column clusters are constraints and the taskis to maximize the mutual information between clusteredrandom variables. It can reduce the problem of high dimen-sionality and sparsity.Murali et al. [30] proposed a representation for geneexpression data called as conserved gene expression motifsor xMotifs. It tries to find largely conserved gene expressionmotifs from the given discretized data matrix. It uses agreedy approach that conserves row. A sub-matrix is saidto be a conserved motif if the expression level of a gene isfound consistent in the respective sub-matrix. Comparingdistinct gene motifs for distinct conditions, we get to knowof genes which are conserved in multiple conditions but arethe in dissimilar state in various conditions.In Plaid [31], a bicluster is assumed to follow the sta-tistical model and the binary least squares is used to fitthe bicluster membership parameters. In this model, datamatrix can be considered as a superposition of layers, wherelayer is a subset of genes and conditions of the data matrix.The data tries to fit in a plaid model can be expressed as a ij = B num (cid:88) k =1 θ ijk + ρ ik + κ jk (4)Binary Inclusion Maximal (BiMax) is based on fast divideand conquer approach [32]. It tries to find all the bi-maximalbiclusters which contains only one element. The algorithmrequires discretization of the gene expression level matrixinto a binary matrix by deciding a threshold.Large Average Sub-Matrix (LAS) [33] is a statisticallyadvanced algorithm which uses a Gaussian null model for gene expression data. It finds the bicluster to give the largestsignificance score which is defined as a ij = B num (cid:88) k =1 l r F (cid:0) i ∈ I r , j ∈ J r (cid:1) + ξ (5)The elements of the data matrix are subtracted from themean of the significance score (5) to form a residual matrix.The search is iteratively repeated until optimal ϕ ( D ) valuefalls below the predefined threshold.Hochreiter et al. [34] presented a multiplicative modelbiclustering algorithm i.e. Factor Analysis for Bicluster Ac-quisition (FABIA) that takes linear alliance of genes andconditions into account. In this model, the row and columnvectors need to be multiple of each other. FABIA models thedata matrix as the addition of k biclusters and an additivenoise. Here, the linear dependency of subsets of rows andcolumns can be described by outer product u × v T . Theoverall model is given by A = B num (cid:88) t =1 u t v Tt + ξ (6)In [35], bit-patterns are extracted from the data matrixusing two phase process known as BitBit algorithm. Thefirst phase includes a novel encoding process to divide thecolumns of the data matrix to a certain length determinedby the minimum number of columns. In the second phase,biclustering of bit patterns takes place using selective search.Each pair of row generates a pattern. In BitBit, the compari-son between rows takes place at bit level. To tackle excessivecomputation, iterative approach is used instead of divideand conquer approach as in BiMax by avoiding recursionand also additional traversals of the matrix a.k.a. BiSim [36].Wang et al. [37] proposed Modular Singular ValueDecomposition Multi-Objective Evolutionary biclustering(MSVD) algorithm. MSVD splits the gene expression datamatrix into a set of sub-matrices with equal dimensionsinto a non-overlapping manner. Then, it projects the dataobtained for the desired number of eigenvalues and applies k -means clustering to cluster them.Another algorithm QUalitative BIClustering (QUBIC)[38] based on graph theory approach is also embeddedin BIDEAL. In QUBIC, the expression level of genes isexpressed in a qualitative or semi-qualitative manner undermultiple conditions as an integer value.Tchagang et al. proposed ROBA [39], where basic linearalgebra techniques were used. There are three main stepsin this algorithm. The first step involves preprocessing ofdata to handle missing values and noise. The second stepdecomposes given data matrix into binary matrices. The laststep involves identification based on the type of bicluster. CCESSIBLE V ALIDATION I NDICESFOR P ERFORMANCE M EASURES
Various validation indices as performance measures areused to check the quality of biclusters as described in furthersubsections.
Jaccard index [40] compares thebiclusters obtained by applying the two biclustering algo-rithms and finding out the number of similar biclustersbetween them. Jaccard index gives a value of 0 if biclustersare dissimilar else 1. Jaccard index is defined as jac c (cid:0) B , B (cid:1) = jac (cid:0) B , B (cid:1) max (cid:0) jac c (cid:1) (7) Differential co-expression rankingscore a.k.a. SB score was proposed in [41]. Considering twobiclusters B and B , where B is formed by gene underthe first set of conditions and B is formed by the samegene with a second set of conditions. Chia et al. proposedan algorithm to compare the goodness of gene w.r.t. twononidentical set of conditions. If B is good gene thanthere will be co-expression between gene and first set ofconditions while differential co-expression between geneand second set of condition. The differential co-expressionof B can be measured as SB ( B ) = log (cid:18) max( T ( B ) + ω ) , max( Q ( B ) + ω )max( T ( B ) + ω ) , max( Q ( B ) + ω ) (cid:19) (8)where ω is used to offset the large ratios. In [7], corresponding vari-ance of genes/ conditions is taken into consideration wherethe variance is the average of the sum of Euclidean distancesbetween rows and columns of bicluster. Higher the valueof the variance, lower the quality of the bicluster. Theexpression of the variance is given byvar = (cid:88) i ∈ I,j ∈ J ( a ij − a IJ ) (9) For a coherent bicluster, thevalue of sign variance is lower [20]. It is same as constantvariance except it preprocesses the data matrix into signmatrix and then estimates variance.
The Hausdorff distance[42] calculates the distance between the pair of sub-matricesobtained from the gene expression data matrix. It is maxi-mum for traversal from the element of first bicluster to thenearest element of second bicluster and signifies dissimilar-ity. Mathematically, it can be written as HD (cid:0) B , B (cid:1) = max (cid:8) sup b ∈ B inf b ∈ B d ( b , b ) , sup b ∈ B inf b ∈ B d ( b , b ) (cid:9) (10) To calculate meansquared residue, the mean square error (MSE) of each biclus-ter is calculated [23]. Then overall MSE can be calculated bytaking the mean of individual values. EY F EATURES AND
GUI
BIDEAL integrates various biclustering algorithms into astand-alone application of graphical user interface (GUI)developed using MATLAB. It is executable on Windows aswell as on Linux operating system. BIDEAL includes severalfunctions to preprocess the raw data, validate, and visualizethe biclusters. The key features of BIDEAL are as follows:
BIDEAL includes fourpreprocessing methods, i.e. filtering, binarization, dis-cretization, and normalization. Filtering is used to eliminatethe effect of Not a Number (NaN) spots and missing valuesfrom the data. Binarization is used to convert a numericalfeature vector into a Boolean, it is mostly useful for down-stream probabilistic estimators which assume that the inputdata is distributed according to a multi-variate Bernoullidistribution. Discretization, a.k.a. quantization/ binning, isused to transform continuous features into discrete values.Some specific datasets with continuous features may not belinearly correlated with the target and are not able to handlewith feature selection methods. In such cases, obtainingan interpretable explanation of such features wont be easy.However, this type of data may be benefited from discretiza-tion because it can transform the dataset of continuous at-tributes to one with only nominal attributes. Normalizationis used for scaling the individual samples to have unit norm.In general min-max and z-score normalization are usedwhen data come from the normal distribution. However,biomedical data or most of the clinical research data donot follow the normal distribution because they are mostlyskewed. For this purpose, logarithmic transformation bis-tochastization and item independent re-scaling of rows andcolumns are used. The log transformation decreases thevariability of data and bistochastization makes all rows andcolumns to have the same mean value and the matrix isrepeatedly normalized until convergence, whereas, in theindependent row and column normalization of rows sum toa constant and columns sum to a distinct constant [28].
For biclusters generation, biclustering algo-rithms have been embedded in BIDEAL that is maximumamong all the available toolbox listed in Table 2. It providesflexibility to select biclustering algorithms according to thenature of data. Availability of all algorithms at a singleplatform allows to analyze the data with minimal efforts. Without a prior knowledge of algorithms, the parameterssetting is quite challenging for naive user. BIDEAL facilitatesthe initial value of parameters as provided in the originalpublished work which users can easily change if needed.
BIDEAL of-fers several ways to ensure a smooth and robust biclustergeneration. For example, the filtering option is availed toreduce the effect of NaN and missing values present in thedataset.
Fig. 1. Graphical User Interface of BIDEAL.
From left to right:
Homepage, Visualization page, and Validation page.
BIDEALoffers validation indices to determine the type of biclusters.For example, the constant variance can identify constantbicluster, whereas sign variance allows to identify biclusterwhere coherent sign changes on rows and columns.
BIDEAL offers two val-idation indices, i.e. Jaccard index and Hausdorff distanceto measure the similarity and dissimilarity, respectivelybetween two biclusters. The value of Jaccard index of aparticular biclustering algorithm varies from to de-pending upon the level of similarity. Hausdorff distance,widely used in several applications, can also measure thedistance between two distinct biclusters. For example, inYeast dataset, Jaccard index values were calculated for CCalgorithm and it can be seen that results obtained from otheralgorithms were dissimilar from CC as Jaccard index valueswere very less for all other algorithms. BIDEAL offers auser friendly GUI which is easy to use for bicluster anal-ysis including generation, visualization, and validation. Theunique features of this interface are:(a) BIDEAL is a self contained concise toolbox with all therelevant information present in it. It provides immedi-ate visual results and effect of each action.(b) In many cases, the installation of toolbox depends onother components like language, which in general isnot availed with toolbox package. To ease the installa-tion, the stand-alone executable files are packaged withMATLAB run-time compiler in BIDEAL. This enablesthe user to just click and install the ready to use biclus-tering algorithms.
BIDEAL hasbeen developed using MATLAB which integrates variousfeatures into a stand-alone application. The GUI of devel-oped BIDEAL toolbox comprises of the following steps forbiclusters generation, validation, and visualization:(i) The home page of BIDEAL is shown in Fig. 1. At first,the dataset should be loaded. It can be either a sampleor user-defined dataset.(ii) The data can be preprocessed using filtering, binariza-tion, normalization, or discretization.(iii) Select the required algorithm to generate biclusters.User will be prompted to feed input parameters elseBIDEAL will consider the default values. (iv) Generated results can be saved in .mat file.(v) Click the
Bicluster Visualization button on the home pageto visualize the biclusters. Any of the available threeoptions on visualization page i.e. heat map, cluster plot,or gene profile can be clicked to visualize the result.(vi) Click the
Bicluster Quality Index button to access the val-idation indices. The validation page displays individualbicluster or overall biclusters result.(vii) Press
Reset button to again access the home page.
ESTING AND V ALIDATIONS ON B ENCHMARK D ATASETS
To demonstrate the utility of gene expression profiling bygeneration of patterns or biclusters through a single plat-form decreases user efforts. Hence, BIDEAL offers a userfriendly interface to decrease the cumbersomeness facedduring the biclusters formation. In this section, the exper-iments and validation on four benchmark datasets havebeen provided using BIDEAL. The four datasets used areSaccharomyces cerevisiae cell cycle dataset (Yeast) [23] with , genes and conditions, Leukemia (ALL vs. AML)dataset [43] with , genes and conditions, Mammarytissue profile dataset (GDS205) [44] with genes and conditions, and Ligand screen in B cells dataset (GDS301):Epstein Barr virus-induced molecule-1 [45] with , genes and conditions. The biclusters formed on these fourbenchmark datasets are further validated using validationindices available in BIDEAL as depicted in Fig. 2. Table 3tabulates the number of biclusters obtained using biclus-tering algorithms embedded in BIDEAL. Since Yeast [23]and ALL vs. AML [43] datasets are preprocessed thereforeGDS205 and GDS301 were preprocessed before executionof the biclustering algorithms. In further subsections, thefindings of BIDEAL have been discussed in detail. Yeast dataset [23] comprises of , genes and conditions. The objective of this dataset is theidentification of genes whose mRNA levels are regulated bythe cell cycle. The number of biclusters generated on Yeastdataset using BIDEAL have been reported in Table 3. Thetable depicts that among all algorithms ROBA generateshighest number of biclusters whereas kSpectral fails toproduce any bicluster i.e. 0. It is due the fact that kSpectraldid not find any distinctive checkerboard patterns in Yeastdataset. On the other hand, ROBA utilizes simple linearalgebraic methods instead of complex optimization and Fig. 2. Validation indices on various datasets. TABLE 3Number of biclusters obtained with biclustering algorithms available in BIDEAL on four datasets
Datasets Algorithms CC [23]
BSGP [25]
OPSM [26]
ISA [27] kSpectral [28]
ITL [29] xMotif [30]
Plaid [31]
FLOC [24]
BiMax [32]
LAS [33]
FABIA [34]
BitBit [35]
BiSim [36]
MSVD [37]
QUBIC [38]
ROBA [39]
Yeast [23] 100 10 16 16 0 1 97 4 20 75 20 2 212 1547 13 10 10104
ALL vs. AML [43] 1 0 37 500 0 100 89 4 20 100 52 5 0 0 100 0 32591
GDS205 [44] 1 7 7 13 6 0 5 0 20 11 5 5 0 1 3 10 3925
GDS301 [45] 1 0 10 0 0 100 39 0 20 100 5 4 1 1 1 0 0
Fig. 3.
Left:
Heat map plot and
Right:
Gene plot using CC algorithm forgenerated bicluster on Yeast dataset. extracted highest i.e. 10104 number of biclusters. Since thehierarchy of biclustering algorithms is application specifictherefore one cannot measured their utility in terms of num-ber of bicluster like BiSim forms biclusters whereas ITLand FABIA extracted only 1 and 2 biclusters respectively. However all of them have their own biological significance.CC forms biclusters which cover approximately genes and approx. of conditions. Fig. 3 shows a sampleheat map and gene plot using CC algorithm for generatedbicluster on Yeast dataset. BSGP and QUBIC reported biclusters, whereas FABIA and Plaid had very few biclusterswith fewer genes and conditions. BitBit gave biclusterswhile kSpectral failed to produce any bicluster which signi-fies that this model do not fit with the given dataset. OPSMand ISA reported the same number of biclusters. Consid-ering the quality of obtained biclusters, it was noted thatthe biclusters obtained using BiSim, FABIA, and kSpectralhad no similarity w.r.t. CC in the context of Jaccard index.On the other hand, ITL, Plaid, BitBit, and ISA had very lowsimilarity. BSGP and MSVD gave higher similarity whileROBA had the maximum similarity among all. Accordingto sign variance metric, the biclusters obtained using CC, Plaid, ISA, and FABIA were less coherent while ROBA,BSGP, and BiMax gave strong coherent biclusters. LAS,BiSim, and MSVD were giving average coherent biclusters.While measuring the quality of biclusters using constantvariance, it was inferred that BSGP, MSVD, BiMax formedbetter biclusters while ISA and Plaid gave higher valuesof constant variance indicating lower quality of biclusters.LAS, CC, BitBit, ITL, and FLOC gave an average type ofbiclusters.
Leukemia dataset comprises of two subtypes of leukemiacancer i.e. Acute Myeloid Leukemia (AML) and Acute Lym-phoblastic Leukemia (ALL). It has genes and condi-tions. For ALL vs. AML dataset, also ROBA reported highestnumber of biclusters , biclusters due to its ability toextract more than one type of biclusters in given dataset.As mentioned earlier various biclustering algorithms areable extract specific patterns from dataset. For ex. BSGPworks better when dataset can be modelled using bipartitegraph efficiently whereas kSpectral is well known to extractcheckerboard patterns in data. In case of this dataset bothpatterns were not applicable therefore 0 biclusters werereported. On other hand BitBit and BiSim are known tosearch patterns in less time by traversing the binarizeddata matrix with tuned parameters. As shown in Table 3BSGP, kSpectral, BitBit and BiSim failed to produce anybicluster. BiMax successfully extracts 100 inclusive maximalbiclusters from this dataset. It is interesting to notice thatITL, BiMax and MSVD produced same number of biclustersi.e. 100 though their objective functions are different fromeach other. ITL tries to preserve mutual info whereas BiMaxfollows divide and conquer strategy and MSVD is inspiredfrom linear algebra technique. CC formed only one biclusterwhich has all genes and conditions. LAS, OPSM and xMotifresulted 52, 37 and 89 bicluster respectively. FABIA andPlaid extracted only 5 and 4 biclusters due to presence of lessconditions and few layers as per plaid model. Consideringthe Jaccard index similarity, xMotif and CC values werehigh. CC and xMotif had a negative score which indicatesdifferential co-expression. According to sign variance, CCgave coherent biclusters as it had the lowest value whilehigh value of FLOC and BiMax indicates less coherentcluster. Rest of the algorithms generated biclusters withaverage coherency. From constant variance values, it can beinferred that ISA gave very low quality biclusters. GDS205 [44] comprises of genes and conditions. For this dataset again ROBA resulted in highnumber of biclusters i.e. 3925. This indicates there are over-lapped gene and sample sets where genes are involvedin several biological pathways. Rest of the biclustering al-gorithms, embedded in BIDEAL extracted approximately12 biclusters only. BiMax successfully extracted 11 subsetsof genes and conditions whereas BiSim only extracted 1bicluster. FABIA extracted only 5 biclusters which signi-fies that GDS205 dataset is not influenced by heavy-taileddistribution. For this dataset use of FLOC algorithm overthe CC is clearly shown. FLOC resulted in biclusters without being effected by random interference whereas asCC produced only bicluster. BSGP and OPSM both gave biclusters indicating presence of order-preserving sub-matrices in GDS201. kSpectral and xMotif resulted in and , respectively. LAS and MSVD discovered , biclusters, re-spectively. Qubic identified checkerboard pattern presentin data. For this dataset ITL, Plaid, and BitBit failed toprovide any bicluster. Plaid did not find any shift biclustersin this dataset whereas ITL fails to find co entropy basedsubsets genes and conditions. Now considering the validityof these bicluster we found that in terms of sign variance,CC and QUBIC resulted in very low value i.e. more coherentbiclusters but biclusters produced by LAS were not coherenthence it had high value of sign variance. According to theconstant variance, CC and QUBIC produced best biclusters,but FLOC gave the high value of constant variance, whichmeant that the quality of the biclusters was not good.Jaccard indices were calculated w.r.t. CC like others. Itinterprets that the biclusters formed by BSGP and MSVDhad the lowest similarity with the biclusters formed by CC.It can also be concluded that CC and QUBIC producedbetter biclusters for this dataset. GDS301 dataset comprises of , genes and11 conditions collected by culturing B Cells with Ligand toperform temporal analysis. As shown in Table 3 BiMax pro-duced maximum number of biclusters i.e. . This signifies100 biclusters were found with values of 1s by enumera-tion. ITL also discovered same number of biclusters by ex-tracting mutual information between genes and conditions.BSGP, kspectral, and Plaid failed to produce any bicluster.Plaid discovers interesting pattern with multivariate datawhereas kSpectral identifies biclusters only if genes are co-regulated with expression levels. FABIA reported to extract4 biclusters. CC, ISA, BitBit, BiSim, all reported one biclus-ter having all genes and conditions in that bicluster. Thismeans algorithms failed to extract the patterns from dataset.Though MSVD formed one bicluster where all conditionswere present but only , genes were matched. In termsof Jaccard index, BitBit and BiSim had maximum similaritywith CC, whereas ITL and BiMax had less similarity withCC. In terms of sign variance, xMotif and CC gave coherentbiclusters but biclusters formed by FLOC were not coherentenough. Constant variance values were mostly similar i.e.FABIA produced maximum constant variance among all. The biological sig-nificance of biclustering algorithms refers to the identifi-cation of subset of genes clustered with similar subset ofconditions to form a pattern or bicluster. The biclustersare useful for disease identification, biomarkers generation,gene-drug association, etc. The reliability of these biclustersare justified using various evaluation measures. BIDEALprovides constant variance and sign variance as evaluationmeasures to check the coherency, significance, and reliabilityof biclusters obtained using various biclustering algorithms .In terms of coherency, for Yeast dataset, biclusters generatedusing FLOC, Bimax, LAS, and ITL algorithms had low signvariance and constant variance. In ALL vs. AML dataset, most of the algorithms failed to generate significantly coher-ent biclusters except CC and xMotif algorithms. In GDS205dataset, CC and BiSim algorithms produced coherent bi-clusters whereas, in GDS301 dataset, CC, ITL, and ISA al-gorithms produced coherent biclusters. Another evaluationmeasure, i.e. SB score, is also embedded in BIDEAL. TheSB score was quite low for Yeast dataset except for thebiclusters generated using BSGP algorithm. It shows thatthe obtained biclusters had more co-expression level for twoconditions among genes. In ALL vs. AML dataset, generatedbiclusters have differential co-expression among genes andconditions because the value of SB score was almost absent.GDS205 dataset reported the high value of SB score whichsignifies the more co-expression ranking among genes w.r.t.two sets of conditions. In each dataset, at least one algo-rithm had reported similar bicluster as CC algorithm, forexample ITL in case of ALL vs. AML dataset, whereasBiSim in GDS205 dataset. As presented in Table 3, it can beseen that for datasets, ROBA gave an exceptionally largenumber of biclusters which means overlapping biclusterswere generated, FABIA and plaid resulted in less numberof biclusters for all datasets, FLOC generated a constantnumber of biclusters i.e. . For GDS301, only CC, OPSM,ITL, xMotif, FLOC, BiMax, LAS, ISA, MSVD, and FABIAhad some result and BiSim and BitBit were quite similar toCC. In case of Yeast dataset, kSpectral failed to produce anybicluster while ITL, Plaid, and BitBit gave no bicluster onGDS205 dataset. Most of the biclusters formed using xMotif,BiSim, QUBIC, BSGP, and CC are of µ -type which indicatesclusters with strong instance and attribute effect. MSVD,FLOC, ISA, and BiMax generated biclusters are of T-typehence these biclusters are with strong instance effect. Theproposed toolbox integrates various biclustering algorithmson a single platform therefore to measure the execution timeone needs to note the execution time of each algorithm.Since the complexity of the biclustering problem relies onthe dataset and the objective function therefore its executiontime can vary from few seconds to hours. For exampleon the Yeast dataset, CC, xMotif, and BiMax takes lessthan 5 seconds to compute biclusters; BSGP, ISA, kSpectral,and FLOC take around 1 minutes to compute biclusters;BitBit and QUBIC extracts biclusters in 30 minutes; andBiSim executes in 90 minutes. Moreover, considering themaximum file sizes can be handled, the proposed toolboxhas been validated for the dataset with maximum size of25 MB. The test has been performed on Yeast dataset of filesize 198KB, ALL vs. AML dataset of file size 656KB, GS205dataset of file size 120KB, and GDS301 dataset of file size 25MB.
ONCLUSIONS
The proposed “BIDEAL” toolbox in this paper has beendeveloped to generate, validate, and visualize the biclus-ters from any data on a single platform. It integrates famous biclustering algorithms, validation indices, and visualization methods for comprehensive data interpre-tations. Additionally, it provides preprocessing module toremove outliers and NaN spots from the data which helps to rectify issues related to null values, discrete matrix, etc.The proposed toolbox has been tested and validated onfour benchmark gene expression datasets i.e. Yeast, ALLvs. AML, GDS205, and GDS301. It was inferred that eachalgorithm of BIDEAL can generate distinct set of biclustersfrom the same data; therefore, the selection of appropriatetechnique is required. The diverse nature of BIDEAL withvarious validation indices and visualization methods hasbeen proven effective for selection of best biclusters. In-formation retrieval from data mainly depends on the typeof local patterns, whether it has overlapping and constantbiclusters, or noisy data. We hope that the availability ofBIDEAL will help the research community by widespreaduse of biclustering algorithms to identify coherent groups indata which is very useful in disease subtype identification.Furthermore, the toolbox can help to cater the data analysisneeds, and it is being offered free to the community. R EFERENCES [1] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms forbiological data analysis: a survey,”
IEEE/ACM Trans. Comput. Biol.Bioinf. , vol. 1, no. 1, pp. 24-45, Jan.-March 2004.[2] V. Singh, N. K. Verma, and Y. Cui, “Type-2 fuzzy PCA approachin extracting salient features for molecular cancer diagnostics andprognostics,”
IEEE Trans. on NanoBioscience , vol. 18, no. 3, pp. 482-489, July 2019.[3] R. K. Sevakula, V. Singh, N. K. Verma, C. Kumar, and Y. Cui,“Transfer learning for molecular cancer classification using deepneural networks,”
IEEE/ACM Trans. Comput. Biol. Bioinf. , 2018.(Early Access)[4] B. Pontes, R. Girldez, and J. S. Aguilar-Ruiz, “Biclustering onexpression data: A review,”
Journal of Biomedical Informatics, vol.57, pp. 163-180, 2015.[5] J. MacQueen, “Some methods for classification and analysis ofmultivariate observations,”
In Proc. of 5th Berkeley symposium onmathematical statistics and probability , vol. 1, no. 14, pp. 281-297, 1967.[6] S. C. Johnson, “Hierarchical clustering schemes,”
Psychometrika , vol.32, no. 3, pp. 241-254, 1967.[7] N. K. Verma and A. Roy, “Self optimal clustering technique usingoptimized threshold function,”
IEEE Syst. Journal , vol. 99, pp. 1-14,2013.[8] N. K. Verma, A. Roy, and Y. Cui, “Improved mountain clusteringalgorithm for gene expression data analysis,”
Journal of Data Miningand Knowledge Discovery , vol. 2, no. 1, pp. 30-35, 2011.[9] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-meansclustering algorithm,”
Journal Computers and Geosciences , vol. 10, no.2-3, pp. 191-203, 1984.[10] A. B. Geva and D.H. Kerem, “Forecasting generalized epilepticseizures from the EEG signal by wavelet analysis and dynamicunsupervised fuzzy clustering”
IEEE Trans. on Biomedical Engg. , vol.45. no. 10, pp. 1205-1216, 1998.[11] N. K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, and Y. Cui,“A comparison of biclustering algorithms,” in
Int. Conf. on Systemsin Medicine and Biology , pp. 90-97, 2010.[12] S. Barkow et al. , “BicAT: A biclustering analysis toolbox,”
Bioinfor-matics , vol. 22, pp. 1282-1283, 2006.[13] R. Santamara, R. Thernand, and L. Quintales, “BicOverlapper 2.0:Visual analysis for gene expression,”
Bioinformatics , vol. 30, no. 12,pp. 1785-1786, 2014.[14] R. Shamir et al. , “EXPANDER - An integrative program suite formicroarray data analysis,”
Bioinformatics , vol. 6, no. 1, pp. 232, 2005.[15] R. Henriques and S. C. Madeira, “BicNET: Flexible module dis-covery in large-scale biological networks using biclustering,”
Algo-rithms for Molecular Biology , vol. 11, no. 1, pp. 1, 2011.[16] K. O. Cheng et al. , “BiVisu: Software tool for bicluster detectionand visualization,”
Bioinformatics , vol. 23, no. 17, pp. 2342-2344,2007.[17] C. A. Gallo, J. S. Dussaut, J. A. Carballido, and I. Ponzoni, “BAT:A new biclustering analysis toolbox,”
LNCS in Advances in Bioinfo.and Compt. Biology , pp. 67-70, 2010.[18] K. Eren, “Application of biclustering algorithms to biologicaldata,”
Diss. The Ohio State University , 2012. [19] S. Kaiser and F. Leisch, “BiClust: A toolbox for biclustering analy-sis in R,” 2008.[20] J. Gupta, S. Singh, and N. K. Verma, “MTBA: MATLAB toolboxfor biclustering analysis,”
IEEE Workshop on Computational Intelli-gence: Theories, Applications and Future Directions , IIT Kanpur, India,pp.148-152, 2013.[21] R. Franc¸ois, M. Stanislas, and N. Mohamed, “CoClust: A pythonpackage for co-clustering”, in
Journal of Statistical Software , vol. 88,no. 7, pp. 1-29, 2018.[22] H. Rui, F. Ferreira, and S. Madeira “BicPAMS: Software for biolog-ical data analysis with pattern-based biclustering”,
BMC Bioinfor-matics , vol. 18, no. 1, 2017.[23] Y. Cheng and G. Church, “Biclustering of expression data,”
Conf.on Intelligent Systems for Molecular Biology , vol. 8, pp. 93-103, 2000.[24] J. Yang, H. Wang, W. Wang, and P. S. Yu, “An improved bicluster-ing method for analyzing gene expression profiles,”
Int. Journal onArtificial Intelligence Tools , vol. 14, no. 5, pp. 771-789, 2005.[25] I. S. Dhillon, “Co-clustering documents and words using bipartitespectral graph partitioning,”
Int. Conf. on Knowl. discovery and datamining , pp. 269-274, 2001.[26] A. Ben-Dor et al. , “Discovering local structure in gene expressiondata: the order-preserving submatrix problem,”
Int. Conf. on Com-putational biology , vol. 10, pp. 49-57, 2000.[27] S. Bergmann, J. Ihmels, and N. Barkai, “Iterative signature algo-rithm for the analysis of large-scale gene expression data,”
PhysicalReview E , vol. 67. no. 3, pp. 031902, 2003.[28] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein, “Spectral bi-clustering of microarray data: coclustering genes and conditions,”
Genome research , vol. 13, no. 4, pp. 703-716, 2003.[29] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoreticco-clustering,”
Int. Conf. on Knowl. discovery and data mining , pp.89-98, 2003.[30] T. M. Murali and S. Kasif, “Extracting conserved gene expressionmotifs from gene expression data,” in
Proc. of Pacific SymposiumBiocomputing , vol. 3, pp. 77-88, 2003.[31] L. Lazzeroni and A. Owen, “Plaid models for gene expressiondata,”
Statistica Sinica , vol. 12, pp. 61-86, 2002.[32] A. Preli et al. , “A systematic comparison and evaluation of biclus-tering methods for gene expression data,”
Bioinformatics , vol. 22, no.9, pp. 1122-1129, 2006.[33] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel,“Finding large average sub-matrices in high dimensional data,”
TheAnnals of Applied Statistics , pp. 985-1012, 2009.[34] S. Hochreiter et al. , “FABIA: Factor analysis for bicluster informa-tion acquisition,”
Bioninformatics , vol 26, no. 12, pp. 1520-1527, 2010.[35] D. S. Rodriguez-Baena, A. J. Perez-Pulido, and J. S. Aguilar-Ruiz,“A bi-clustering algorithm for extracting bit-patterns from binarydatasets,”
Bioinformatics , vol. 27, no. 19, pp. 2738-45, 2011.[36] N. Noureen and M. A. Qadir, “BiSim: A simple and efficientbiclustering algorithm,”
Int. Conf. on Soft Computing and PatternRecognition , pp. 1-6, 2009.[37] D. Wang and Zheng, “MSVD-MOEB algorithm applied to cancergene expression data,”
Int. Conf. on Awareness Science and Technology(iCAST) , pp. 119-124, 2015.[38] L. Guojun, Q. Ma, H. Tang, A. H. Paterson, and Y. Xu, “QUBIC: Aqualitative biclustering algorithm for analyses of gene expressiondata,”
Nucleic acids research , pp. gkp491, 2009.[39] A. B. Tchagang and A. H. Tewfik, “Robust biclustering algorithm(ROBA) for DNA microarray data analysis,”
Proc. IEEE/SP 13thWorkshop on Statistical Signal Processing , pp. 984-989, 2005.[40] M. Filippone, F. Masulli, and S. Rovetta, “Stability and perfor-mances in biclustering algorithms,”
Int. Meeting on Comput. Intelli-gence Methods for Bioinformatics and Biostatistics , pp. 91-101, 2008.[41] B. K. H. SB and R. K. M. Karuturi “Differential co-expressionframework to quantify goodness of biclusters and compare biclus-tering algorithms,”
Algorithms for Molecular Biology , vol. 5, no. 1, pp.23, 2010.[42] N. K. Verma, E. Dutta, and Y. Cui, “Hausdorff distance andglobal silhouette index as novel measures for estimating quality ofbiclusters,”
Int. Conf. on Bioinformatics and Biomedicine , pp. 267-272,2015.[43] T. R. Golub, “Molecular classification of cancer: class discoveryand class prediction by gene expression monitoring”,
Science , vol.286, no. 5439, pp. 531-537, 1999.[44] S. P. Suchyta et al. , “Bovine mammary gene expression profilingusing a cDNA microarray enhanced for mammary-specific tran-scripts”,
Physiol Genomics
Bioinformatics , vol.18, pp. 136-144, 2002.[47] C. A. Gallo, J. A. Carballido, and I. Ponzoni, “Bihea: A hybridevolutionary approach for microarray biclustering,”
Symposium onBioinformatics , Springer, pp. 36-47, 2009.[48] L. Wilkinson and M. Friendly, “The history of the cluster heatmap,”
The American Statistician , vol 63, no. 2, pp. 179-184, 2009.
Nishchal K. Verma (SM’13) is a Professor inDepartment of Electrical Engineering and Inter-disciplinary Program in Cognitive Science at In-dian Institute of Technology Kanpur, India. Heobtained his PhD in Electrical Engineering fromIndian Institute of Technology Delhi, India. He isan awardee of Devendra Shukla Young FacultyResearch Fellowship by Indian Institute of Tech-nology Kanpur, India for year 2013-16. His re-search interests include big data analysis, deeplearning of neural and fuzzy networks, machinelearning algorithms, computational intelligence, computer vision, braincomputer/machine interface, intelligent informatics, soft-computing inmodelling and control, internet of things/ cyber physical systems, cogni-tive science and intelligent fault diagnosis systems, prognosis and healthmanagement. He has authored more than 200 research papers.Dr. Verma is an IETE Fellow. He is currently serving as a Guest Editor ofthe IEEE Access special section “Advance in Prognostics and SystemHealth Management”, an Editor of the IETE Technical Review Journal,an Associate Editor of the IEEE Transactions on Neural Networks andLearning Systems, an Associate Editor of the IEEE ComputationalIntelligence Magazine, an Associate Editor of the Transactions of theInstitute of Measurement and Control, U.K. and Editorial Board Memberfor several journals and conferences.
Teena Sharma received her B.Tech. degree inElectronics and Communication Engineering in2013 from UPTU, Lucknow, India. She has com-pleted her M.tech in 2014 from Banasthali Uni-versity, Rajasthan, India. Currently, She is a PhDScholar in Dept. of Electrical Engineering at In-dian Institute of Technology Kanpur, India. HerResearch interests fall under Machine Learning,Computer Vision, and its applications.
Sonal Dixit received her B.E. from RGPV Uni-versity in 2009, and Masters from BanasthaliUniversity in 2011. She is currently a doctoralstudent at IIT Kanpur. Her research interests fallmainly in the field of condition based monitoringof rotary machines, deep learning and its appli-cations, computational intelligence, and naturallanguage processing. Sourya Sengupta was born on 23rd Septem-ber,1995 at Kolkata, India. Currently he is pursu-ing his Master’s from University of Waterloo. Hereceived B.E. degree in Electrical Engineeringat Jadavpur University, Kolkata India. He passedMadhyamik and Higher Secondary Examinationfrom Ramakrishna Mission Boys Home HigherSecondary School,Rahara on 2012 and 2014 re-spectively. He also qualified WBJEE, JEE Mainsand JEE Advanced on 2014. His research inter-ests fall mainly in the field of biomedical signalprocessing, bioinformatics, cognitive neuroscience .
Vikas Singh is working toward the PhD de-gree in the Department of Electrical Engineer-ing, Indian Institute of Technology Kanpur, India.His research interests include machine learning,deep learning, big data, intelligent data mining,bioinformatics, and fuzzy systems and its appli-cations.