Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data
Dmitry I. Ignatov, Gennady V. Khvorykh, Andrey V. Khrunin, Stefan Nikolić, Makhmud Shaban, Elizaveta A. Petrova, Evgeniya A. Koltsova, Fouzi Takelait, Dmitrii Egurnov
OObject-Attribute Biclustering for Elimination ofMissing Genotypes in Ischemic StrokeGenome-Wide Data
Dmitry I. Ignatov [0000 − − − , Gennady V.Khvorykh [0000 − − − , Andrey V. Khrunin [0000 − − − ,Stefan Nikoli´c [0000 − − − , Makhmud Shaban [0000 − − − ,Elizaveta A. Petrova , Evgeniya A. Koltsova , FouziTakelait [0000 − − − X ]1 , and Dmitrii Egurnov [0000 − − − National Research University Higher School of Economics, Russia [email protected] , Institute of Molecular Genetics of National Research Centre “Kurchatov Institute”,Russia [email protected] , http://img.ras.ru Pirogov Russian National Research Medical University, Russia http://rsmu.ru
Abstract.
Missing genotypes can affect the efficacy of machine learn-ing approaches to identify the risk genetic variants of common diseasesand traits. The problem occurs when genotypic data are collected fromdifferent experiments with different DNA microarrays, each being charac-terised by its pattern of uncalled (missing) genotypes. This can preventthe machine learning classifier from assigning the classes correctly. Totackle this issue, we used well-developed notions of object-attribute bi-clusters and formal concepts that correspond to dense subrelations inthe binary relation patients × SNPs . The paper contains experimentalresults on applying a biclustering algorithm to a large real-world datasetcollected for studying the genetic bases of ischemic stroke. The algorithmcould identify large dense biclusters in the genotypic matrix for furtherprocessing, which in return significantly improved the quality of machinelearning classifiers. The proposed algorithm was also able to generate bi-clusters for the whole dataset without size constraints in comparison tothe In-Close4 algorithm for generation of formal concepts.
Keywords:
Formal Concept Analysis, Biclustering, Single NucleotidePolymorphism, Missing Genotypes, Data Mining, Ischemic Stroke
The recent progress in studying different aspects of human health and diversity(e.g., genetics of common diseases and traits, human population structure, andrelationships) is associated with the development of high-throughput genotypingtechnologies, particularly with massive parallel genotyping of Single Nucleotide a r X i v : . [ q - b i o . GN ] O c t Dmitry I. Ignatov et al.
Polymorphisms (SNPs) by DNA-microarrays [1]. They allowed the determina-tion of hundreds of thousands and millions of SNPs in one experiment and werethe basis for conducting genome-wide association studies (GWAS). Althoughthousands of genetic loci have been revealed in GWAS, there are practical prob-lems with replicating the associations identified in different studies. They seem tobe due to both limitations in the methodology of the GWAS approach itself anddifferences between various studies in data design and analysis [2]. The machinelearning (ML) approaches were found to be quite promising in this field [3].Genotyping by microarrays is efficient and cost-effective, but missing dataappear. GWAS is based on a comparison of frequencies of genetic variants amongpatients and healthy people. It assumes that all genotypes are provided (usually,their percentage is defined by a genotype calling threshold). In this article, wedemonstrate that missing data can affect not only statistical analysis but alsothe ML algorithms. The classifiers can fail because of missing values (uncalledgenotypes) being distributed non-randomly. We assume that each set of DNA-microarray can possess a specific pattern of missing values marking both thedataset of patients and healthy people. Therefore, the missing data needs to becarefully estimated and processed without dropping too many SNPs that maycontain crucial genetic information.To overcome the problem of missing data, we aimed to apply a techniquecapable of discovering some groupings in a dataset by looking at the similarityacross all individuals and their genotypes. The raw datasets can be convertedinto an integer matrix, where individuals are in rows, SNPs are in columns, andcells contain genotypes. For each SNP, the person can have either AA, AB, orBB genotype, where A and B are the alleles. Thus the genotypes can be codedas 0, 1, and 2, representing the counts of allele B.The proposed method can simultaneously cluster rows and columns in a datamatrix to find homogeneous submatrices [4], which can overlap. Each of thesesubmatrices is called a bicluster [5], and the process of finding them is calledbiclustering [4,6,7,8,9].Biclustering in genotype data allows identifying sets of individuals sharingSNPs with missing genotypes. A bicluster arises when there is a strong rela-tionship between a specific set of objects and a specific set of attributes in adata table. A particular kind of bicluster if a formal concept in Formal ConceptAnalysis (FCA) [10]. A formal concept is a pair of the form (extent, intent),where extent consists of all objects that share the attributes in intent, and du-ally the intent consists of all attributes shared by the objects in extent. Formalconcepts have a desirable property of being homogeneous and closed in the al-gebraic sense, which resulted in their extensive use in Gene Expression Analysis(GEA) [11,12,13,14].A concept-based bicluster (or object-attribute bicluster) [15] is a scalableapproximation of a formal concept withe the following advantages:1. Reduced number of patterns to analyze;2. Reduced computational cost (polynomial vs. exponential);3. Manual (interactive) tuning of bicluster density threshold; bject-Attribute Biclustering for Elimination of Missing Genotypes 3
4. Tolerance to missing (object, attribute) pairs.In this paper, we propose an extended biclustering algorithm of [16] thatcan identify large biclusters with missing genotypes for categorical data (many-valued contexts with a selected value). This algorithm can generate a smalleramount of dense object-attribute biclusters than that of existing exact algorithmsfor formal concepts like concept miner In-Close4 [17], and is, therefore, bettersuited for large datasets. Moreover, during experimentation with the ischemicstroke dataset, we found that the number of large dense biclusters identified byour algorithm is significantly lower than the number of formal concepts extractedby In-Close4 and Concept Explorer (ConExp ) [18].The paper is organized as follows. In Section 2, we recall basic notions fromFormal Concept Analysis and Biclustering. In Section 3, we introduce a methodof FCA-based biclustering and its variants along with bicluster post-processingschemes, consider discussing the complexity of the proposed algorithm. In Sec-tion 4, we describe a dataset that consists of a sample of patients and their SNPscollected from various (independent) groups of patients. Then we present the re-sults obtained during experiments on this dataset in Section 5 and mention theused hardware and software configuration. Section 6 concludes the paper. A formal context in FCA [10] is a triple K = ( G, M, I ) con-sisting of two sets, G and M , and a binary relation I ⊆ G × M between G and M . The triple can be represented by a cross-table consisting of rows G , called objects , and columns M , called attributes , and crosses representing incidencerelation I . Here, gIm or ( g, m ) ∈ I means that the object g has the attribute m . Definition 2.
For A ⊆ G and B ⊆ M , let A (cid:48) def = { m ∈ M | gIm for all g ∈ A } , and B (cid:48) def = { g ∈ G | gIm for all m ∈ B } . These two operators are the derivation operators for K = ( G, M, I ) . Proposition 1.
Let ( G, M, I ) be a formal context, for subsets A, A , A ⊆ G and B ⊆ M we have1. A ⊆ A if A (cid:48) ⊆ A (cid:48) ,2. A ⊆ A (cid:48)(cid:48) ,3. A = A (cid:48)(cid:48) (hence, A (cid:48)(cid:48)(cid:48)(cid:48) = A (cid:48)(cid:48) ), 4. ( A ∪ A ) (cid:48) = A (cid:48) ∩ A (cid:48) ,5. A ⊆ B (cid:48) ⇔ B ⊆ A (cid:48) ⇔ A × B ⊆ I .Similar properties hold for subsets of attributes. https://sourceforge.net/projects/inclose/ http://conexp.sourceforge.net Dmitry I. Ignatov et al. Definition 3. A closure operator on set S is a mapping ϕ : 2 S → S withthe following properties:Let X ⊆ S , then1. ϕ ( ϕ ( X )) = ϕ ( X ) ( idempotency ),2. X ⊆ ϕ ( X ) ( extensity ),3. X ⊆ Y ⇒ ϕ ( X ) ⊆ ϕ ( Y ) ( monotonicity ).For a closure operator ϕ the set ϕ ( ϕ ( X )) is called closure of X , while a subset X ⊆ S is called closed if ϕ ( ϕ ( X )) = X . It is evident from properties of derivation operators that for a formal context(
G, M, I ), the operators( · ) (cid:48)(cid:48) : 2 G → G and ( · ) (cid:48)(cid:48) : 2 M → M are closure operators. Definition 4. ( A, B ) is a formal concept of formal context K = ( G, M, I ) iff A ⊆ B, B ⊆ M, A (cid:48) = B, and A = B (cid:48) .The sets A and B are called the extent and the intent of the formal concept ( A, B ) , respectively. This definition says that every formal concept has two parts, namely, itsextent and intent. It follows an old tradition in philosophical concept logic, asexpressed in the
Logic of Port Royal, 1662 [19].
Definition 5.
The set of all formal concepts B ( B, M, I ) is partially ordered,given by relation ≤ K : ( A , B ) ≤ K ( A , B ) ⇐⇒ A ⊆ A (dually B ⊆ B ) B ( B, M, I ) is called concept lattice of the formal context K . In case an object has properties like colour or age the corresponding at-tributes should have values themselves.
Definition 6. A many-valued context ( G, M, W, J ) consists of sets G , M and W and a ternary relation J ⊆ G × M × W for which it holds that ( g, m, w ) ∈ J and ( g, m, v ) ∈ I imply w = v .The elements of M are called (many-valued) attributes and those of W attribute values . Since many-valued attributes can be considered as partial maps from G in W , it is convenient to write m ( g ) = w . bject-Attribute Biclustering for Elimination of Missing Genotypes 5 In [6], bicluster is defined as a homogeneous submatrix of an input object-attribute matrix of real values in general. Consider a dataset as a matrix, A = ( X, Y ) ∈ R n × m , with a set of rows/objects/individuals X = { x , . . . , x n } and set of columns/attributes/SNPs Y = { y , . . . , y m } . A submatrix constructedfrom a subset of rows I ⊆ X and that of columns J ⊆ Y is denoted as ( I, J )is called a bicluster of A [6]. The bicluster should satisfy some specific homo-geneity properties, which varies from method to another.For instance, for the purpose of this research, we use the following FCA-baseddefinition of a bicluster [20,15,16]. Definition 7.
For a formal context K = ( G, M, I ) any biset ( A, B ) ⊆ I with A (cid:54) = ∅ and B (cid:54) = ∅ is called a bicluster . If ( g, m ) ∈ I , then the bicluster ( A, B ) =( m (cid:48) , g (cid:48) ) is called an object-attribute or OA-bicluster with density ρ ( A, B ) = | I ∩ ( A × B ) || A |·| B | . The density ρ ( m (cid:48) , g (cid:48) ) of a bicluster ( m (cid:48) , g (cid:48) ) is the bicluster quality measurethat shows how many non-empty pairs the bicluster contains divided by its size.Several basic properties of OA-biclusters are below. Proposition 2.
1. For any bicluster ( A, B ) ⊆ G × M it is true that ≤ ρ ( A, B ) ≤ ,2. OA-bicluster ( m (cid:48) , g (cid:48) ) is a formal concept iff ρ = 1 ,3. If ( A, B ) is a OA-bicluster, there exists (at least one) its generating pair ( g, m ) ∈ A × B such that ( m (cid:48) , g (cid:48) ) = ( A, B ) ,4. If ( m (cid:48) , g (cid:48) ) is a OA-bicluster, then ( g (cid:48)(cid:48) , g (cid:48) ) ≤ ( m (cid:48) , m (cid:48)(cid:48) ) .5. For every ( g, m ) ∈ I , ( h, n ) ∈ [ g ] M × [ m ] G , it follows ( m (cid:48) , g (cid:48) ) = ( n (cid:48) , h (cid:48) ) . In Fig. 1, you can see the example of OA-bicluster, for a particular pair( g, m ) ∈ I of a certain context ( G, M, I ). In general, only the regions ( g (cid:48)(cid:48) , g (cid:48) )and ( m (cid:48) , m (cid:48)(cid:48) ) are full of non-empty pairs, i.e. have maximal density ρ = 1,since they are object and attribute formal concepts respectively. The black cellsindicate non-empty pairs, which one may found in less dense white regions. Definition 8.
Let ( A, B ) ∈ G × M be a OA-bicluster and ρ min ∈ (0 , , then ( A, B ) is called dense if it satisfies the constraint ρ ( A, B ) ≥ ρ min . The number of OA-biclusters of a context can be much less than the num-ber of formal concepts (which may be 2 min( | G | , | M | ) ), as stated by the followingpropositions. Proposition 3.
For a formal context K = ( G, M, I ) the largest number of OA-biclusters is equal to | I | and all OA-biclusters can be generated in time O ( | I | ) . Proposition 4.
For a formal context K = ( G, M, I ) and ρ min > the largestnumber of dense OA-biclusters is equal to | I | , all dense OA-biclusters can begenerated in time O ( | I || G || M | ) . The equivalence classes are [ g ] M = { h | h ∈ G, g (cid:48) = h (cid:48) } and [ m ] G = { n | n ∈ M, n (cid:48) = m (cid:48) } . Dmitry I. Ignatov et al. g m g''m'' g' m' Fig. 1.
OA-bicluster based on object and attribute primes.
Algorithm 1 is a straightforward implementation, which takes an initial many-valued formal context and minimal density threshold as parameters and com-putes dense biclusters for each ( g, m )as pair in the relation I that indicates whichobjects have SNP with missing values. However, since OA-biclusters for many-valued contexts were not formally introduced previously, we use a derived formalcontext with one-valued attributes denoting missing attribute-values of an orig-inal genotype matrix to correctly apply the definition of dense OA-bicluster. Definition 9.
Let K = ( G, M, W, J ) is a many-valued context and v ∈ W isa selected value (e.g., denoting the absence of an SNP value), then its derivedcontext for the value v is K v = ( G, M, I ) where gIm iff ( g, m, v ) ∈ J . For genotype matrices with missing SNP values as many-valued contexts,similar representation can be expressed in terms of co-domains of many-valuedattributes (the absence of m ( g ) means that of the corresponding SNP value) orby means of nominal scaling with a single attribute for the missing value v [10].If we compare the number of output pattern for formal concepts and denseOA-biclusters, in the worst case these values are 2 min ( | G | , | M | ) versus | I | . Thetime complexity of our algorithm is polynomial, O ( | G || M || I | ), versus exponen-tial in the worse case for BiMax [21], O ( | G || M || L | log | L | ), or O ( | G | | M || L | ) for CbO algorithms family [22], where | L | is a number of generated concepts (alsoconsidered as biclusters) and is exponential in the worst case | L | = 2 min ( | G | , | M | ) .For calculating biclusters that fulfil a minimum density constraint, we need toperform several steps (see Algorithm 1). Steps 5-8 consists of applying the Galoisoperator to all objects in G and steps 9-12 then to all attributes in M within theinduced context. The outer for loops are parallel (the concrete implementationmay differ), while the internal ones are ordinary for loops. Then all biclusters areenumerated in a parallel manner as well, and only those that fulfil the minimal bject-Attribute Biclustering for Elimination of Missing Genotypes 7 density requirement are retained (Steps 13-16). Again, efficient implementationof set data-structure for storing biclusters and duplicate elimination of the fly inparallel execution mode are not addressed in the pseudo-code.The novelties of this algorithm are that we used parallelization to generatethe OA-bicluster giving as input a medium-sized dataset (e.g. 10 × ), thatis to make our program runs faster, and the possibility to work with selectedvalues reducing many-valued context to contexts with one-valued attributes. Algorithm 1:
OA-bicluster generation for a many-valued context.
Data: K = ( G, M, W, J ) is a many-valued formal context, ρ min is a thresholddensity value of bicluster density and v ∈ W is a selected value Result: B = { ( A, B ) | ( A, B ) is an OA-bicluster for value v } begin Obj.Size := | G | Attr.Size := | M | B ←− ∅ parallel for g ∈ G do for m ∈ M do if m(g)=v then Obj [ g ] .Add ( m ) parallel for m ∈ M do for g ∈ G do if m(g)=v then Attr [ m ] .Add ( g ) parallel for ( g, m, w ) ∈ J do if w=v then if ρ ( Attr [ m ] , Obj [ g ]) ≥ ρ min then B := B ∪ { ( Attr [ m ] , Obj [ g ]) } Let us describe the online problem of finding the set of prime OA-biclusters basedon the online OAC-Prime Triclustering [23]. Let K = ( G, M, I ) be a context. Theuser has no a priori knowledge of the elements and even cardinalities of G , M ,and I . At each iteration, we receive a set of pairs (“batch”) from I : J ⊆ I . Afterthat, we must process J and get the current version of the set of all biclusters.It is important in this setting to consider every pair of biclusters different ifthey have different generating pairs even if their extents and intents are equal,because any other pair can change only one of them, thus making them different. Dmitry I. Ignatov et al.
Also, the algorithm requires that the dictionaries containing the prime setsare implemented as hash-tables or similar efficient key-value structures. Becauseof this data structure, the algorithm can efficiently access prime sets.The algorithm itself is also straightforward (Alg. 2). It takes a set of pairs( J ) and current versions of the biclusters set ( B ) and the dictionaries containingprime sets ( P rimesO and
P rimesA ) as input and outputs the modified versionsof the bicluster set and dictionaries. The algorithm processes each pair ( g, m ) of J sequentially (line 1). On each iteration the algorithm modifies the correspondingprime sets: it adds m to g (cid:48) (line 2) and g to m (cid:48) (line 3).Finally, it adds a new bicluster to the bicluster set. Note that this biclustercontains pointers to the corresponding prime sets (in the corresponding dictio-naries) instead of their copies (line 4).In effect, this algorithm is very similar to the original OA-biclustering algo-rithm with some optimizations. First of all, instead of computing prime sets atthe beginning, we modify them on spot, as adding a new pair to the relationmodifies only two prime sets by one element. Secondly, we remove the main loopby using pointers for the bicluster’ extents and intents, as we can generate bi-clusters at the same step as we modify the prime sets. And third, it uses onlyone pass through the pairs of the binary relation I , instead of enumeration ofdifferent pairwise combinations of objects and attributes. Algorithm 2:
Online generation of OA-biclusters
Input: J is a set of object-attribute pairs; B = { b = ( ∗ X, ∗ Y ) } is the current set of OA-biclusters; P rimesO , P rimesA ; Output: B = { b = ( ∗ X, ∗ Y ) } ; P rimesO , P rimesA ;1: for all ( g, m ) ∈ J do P rimesO [ g ] := P rimesO [ g ] ∪ { m } P rimesA [ m ] := P rimesAC [ m ] ∪ { g } B := B ∪ { (& P rimesA [ m ] , & P rimesO [ g ]) } end for Each step requires constant time: we need to modify two sets and add onebicluster to the set of biclusters. The total number of steps is equal to | I | ; thetime complexity is linear O ( | I | ). Beside that the algorithm is one-pass.The memory complexity is the same: for each of | I | steps the size of each dic-tionary containing prime sets is increased either by one element (if the requiredprime set is already present), or by one key-value pair (if not). Since each ofthese dictionaries requires O ( | I | ) memory, the memory complexity is also linear. bject-Attribute Biclustering for Elimination of Missing Genotypes 9 Another important step, in addition to this algorithm, is post-processing. Thus,we may want to remove additional biclusters with the same extent and intentfrom the output. Simple constraints like minimal support condition can be pro-cessed during this step without increasing the original complexity. It should bedone only during the post-processing step, as the addition of a pair in the mainalgorithm can change the set of biclusters, and, respectively, the values used tocheck the conditions. Finally, if we need to fulfil more difficult constraints likeminimal density condition, the time complexity of the post-processing will behigher than that of the original algorithm, but it can be efficiently implemented.To remove the same biclusters we need to use an efficient hashing procedurethat can be improved by implementing it in the main algorithm. For this, forall prime sets, we need to keep their hash-values with them in the memory. Andfinally, when using hash-functions other than LSH function (Locality-SensitiveHashing) [24], we can calculate hash-values of prime sets as some function of theirelements (for example, exclusive disjunction or sum). Then, when we modifyprime sets, we just need to get the result of this function and the new element.In this case, the hash-value of the bicluster can be calculated as the same functionof the hash-values of its extent and intent.Then it would be enough to implement the bicluster set as a hash-set in orderto efficiently remove the additional entries of the same bicluster.Pseudo-code for the basic post-processing (Alg. 3).
Algorithm 3:
Post-processing for the online OA-biclustering algorithm.
Input: B = { b = ( ∗ X, ∗ Y ) } is a full set of biclusters; Output: B = { b = ( ∗ X, ∗ Y ) } is a processed hash-set of biclusters;1: for all b ∈ B do
2: Calculate hash ( b )3: if hash ( b ) (cid:54)∈ B then B := B ∪ { b } end if end for If the names (codes) of the objects and attributes are small enough (thetime complexity of computing their hash values is O (1)), the time complexityof the post-processing is O ( | I | ) if we do not need to calculate densities, and O ( | I || G || M | ) otherwise. Also, the basic version of the post-processing does notrequire any additional memory; so, its memory complexity is O (1).Finally, the algorithm can be easily paralleled by splitting the subset of inputpairs into several subsets, processing each of them independently, and mergingthe resulting sets afterwards, which may lead to distributed computing schemesfor larger datasets (cf. [25]). In case the output of the post-processing step is stored in a relational databasealong with the computed statistics and generating pairs, further usage of selec-tion operators [26] is convenient to consider only a specific subset of biclusters.We use the following operator resulting in a specific subset of biclusters σ ( α min ≤| A |≤ α max ) ∧ ( β min ≤| B |≤ β max ) ∧ ( ρ min ≤ ρ ( A,B ) ≤ ρ max ) ( B ) , where | A | is the extent size, | B | is the intent size, and ρ ( A, B ) is the density ofOA-bicluster b ∈ B , respectively. One more reason to use postprocessing is nei-ther monotonic nor anti-monotonic character of the minimal density constraintin the sense of constraints pushing in pattern mining [11,16]. Collection of patients with ischemic stroke and their clinical characterisationwere made at the Pirogov Russian National Research Medical University. TheDNA extraction and genotyping of the samples were described previously [27].The dataset contains samples corresponding to individuals with a geneticportrait for each and a group label. The former represents the genotypes de-termined at many SNPs all over the genome. The latter takes values 0 or 1depending on whether a person did not have or had a stroke. Each SNP is avector that components can take values from { , , , − } , where 0, 1, and 2denote the genotypes, and -1 indicates a missing value.We represent the dataset as a many-valued formal context. In the derivedcontext K = ( G, M, I ), where objects from G stand for samples and attributesfrom M stand for SNPs, gIm means that an individual g has a missing SNP m .The context has the following parameters | G | = 1 , | M | = 85 , | I | =45 ,
075 which represents the total number of attributes with missing values in thedataset and cover 0.491% of the whole data matrix. The number of attributeswithout missing values is 40,067.The genotypic data were obtained with DNA-microarrays. The dataset wascompiled from several experiments where different types of microarrays wereapplied. Not all genotypes are equally measured during the experiment. Thus,there is a certain instrumental error. The quality of DNA can also affect theoutput of the experiments. Fig. 2 shows how many individuals have exactly N missing genotypes per SNP in the dataset.For instance, many individuals have about 85 missing genotypes per SNP. The experimental results with OA-biclustering generation and processing wereobtained on an Intel(R) Core(TM) i5-8265U CPU @ 1.80 GHz with 8 GB of RAMand 64-bit Windows 10 Pro operating system. We used the following softwarereleases to perform our experiments: Python 3.7.4 and Conda 4.8.2. bject-Attribute Biclustering for Elimination of Missing Genotypes 11
Fig. 2.
The distribution of the number of missing SNP values by columns before elimination.
The following experiment was performed with ischemic stroke data collection:first of all, 383,733 OA-biclusters, with duplicates, were generated after applyingthe parallel biclustering algorithm to the dataset.As we can see from the graph in Figure 3, there is a reasonable amount ofbiclusters with a density value greater than 0.9. The distributions of biclustersby extent and intent show that the majority of biclusters have about 90 samplesand 2,600 SNPs, respectively.For the selection of large dense biclusters, we set the density constraint tobe ρ min = 0 .
9. Additional constraints were set as follows: 3 ≤ | m (cid:48) | ≤ , ≤ | g (cid:48) | ≤ ,
000 for the intent size. In total, we se-lected 98,529 OA-biclusters with missing values. For this selection, the graph inFig. 4 shows the selected peaks of large dense biclusters for different extent sizes.
Example 1.
Biclusters in the form ( patients, SN P s ).For generating pair ( g, m ) = (1102 , rs A ) we have that( m (cid:48) , g (cid:48) ) ∈ σ (3 ≤| A |≤ , ∧ (3 ≤| B |≤ , ∧ (0 . ≤ ρ ( A,B ) ≤ ( B ) , where( m (cid:48) g (cid:48) ) = ( { , , . . . , } , { rs G , rs A , . . . , rs A } ), ρ ( m (cid:48) , g (cid:48) ) ≈ . | m (cid:48) | = 14 individuals, | g (cid:48) | = 758 SNPs, 9,657 pairs out of 10,612correspond to missing SNP values.We studied further large dense biclusters and chose the densest ones withpossibly larger sizes of their extents and intents from each of the peaks identifiedin their distributions, respectively (Fig. 3).Here are some examples of these subsets with their associated graphs. Fig. 3.
The distribution of the number of biclusters by their density (top), extent(middle) and intent sizes (bottom).
Example 2.
We can further narrow down the number of patterns in the previousselection by looking at the distribution of biclusters by their extent size andchoosing proper boundaries. Thus, in Fig. 4, there is the third largest peak ofthe number of biclusters near the extent size 125. bject-Attribute Biclustering for Elimination of Missing Genotypes 13
Fig. 4.
The distribution of dense biclusters ( ρ min = 0 .
9) by their extent size.For the constraints below ρ min = 94 . ∧ ρ max = 100% ∧ | g (cid:48) | = 122 ∧ (3 ≤ | m (cid:48) | ≤ , Example 3.
The selection around the rightmost peak ( see Fig. 4 ) and furtherrefining of the minimal value density ρ min = 95 . ∧ ρ max = 100% ∧ ≤ | g (cid:48) | ≤ ∧ (3 ≤ | m (cid:48) | ≤ , After applying the proposed biclustering algorithm to the collected dataset, alllarge biclusters with missing genotypes were identified and eliminated. That resulted in a new data matrix ready for further analysis . We consolidate theevolution of the two datasets before and after removing missing values in Table 1. Table 1.
Basic statistics of the datasets before and after elimination of missingvalues. no. no. no. NaNssamples SNPs NaNs fractionBefore elimination 1,223 85,142 553,430 0.49%After elimination 1,472 82,690 388,052 0.31%
As seen from Table 1, the biclustering algorithm application resulted in im-provement in terms of entries corresponding to SNPs with missing genotypes, afraction of such entries is reduced by 29.88%. The total number of biclusters gen-erated before and after eliminating SNP with missing genotypes is 383,733 (withduplicates) and 259,440, respectively. The total amount of time for generatingthese biclusters before and after deleting missing data is 3433 . . We have conducted a number of machine learning experiments on our datasets tocheck the impact of eliminating missing data. Our proposed algorithm handledon the quality measures of supervised learning algorithms.We choose to use gradient boosting on decision trees (GBDT). For this pur-pose, we selected two libraries where it is already implemented,
CatBoost and
LightGBM . Both implementations can handle missing values.A genome can essentially be interpreted as a sequence of SNPs, so we madea decision to also use
Long-Short Term Memory Network [28] as a strongapproach to handling sequential data.
First dataset experiments.
Firstly, we applied GBDT algorithm from
Cat-Boost library to our initial dataset (before elimination of SNPs with missinggenotypes). The following parameters were taken for the classifier: – Maximum number of trees: 3; – Tree depth limit: 3; https://github.com/dimachine/OABicGWAS/ bject-Attribute Biclustering for Elimination of Missing Genotypes 15 Fig. 5.
Distribution of the number of SNPs with missing genotypes by columns after elimination. – Loss function: binary cross-entropy (log-loss/binary cross-entropy).We also applied LSTM approach the following way: the initial sequence wasresized to 100 elements by a fully-connected layer, then the layer output waspassed to the LSTM module element-wise. The hidden state of LSTM after thelast element was passed to a fully-connected classification layer.The scores on this dataset were evaluated with 3-fold cross-validation withstratified splits. Basic classification metrics’ scores are present in Table 2.
Table 2.
Classification scores on the test set before elimination of missing SNPvalues accuracy F1-score precision recall
CatBoostClassifier 0.966 0.9758 0.9558 0.9967FC+LSTM 0.890 0.926 0.880 0.982
These unexpectedly high scores were unrealistic since the GBDT model com-plexity had one of the lowest possible configurations, and the LSTM model, whichis handling the data in a different way, also achieved high accuracy. For a lot ofsamples, the model learned to “understand” on which chip it was analyzed bylooking at the patterns of missing genotypes, so the data leak was present.
Second dataset experiments.
This dataset was obtained after the identifica-tion of large dense biclusters by application of our proposed algorithm with sub-sequent elimination. Table 3 recaps the experiments conducted on the dataset.For the first and second experiments, we used
CatBoost classifier with train/test split in the proportion of 8:2 and 3-fold cross-validation, respectively, while main-taining the balance of classes for model validation. In the third experiment, weused
LGBMClassifier classifier with 3-fold cross-validation while maintaining thebalance of classes for model validation. In the fourth experiment, the describedearlier
LSTM classifier was used with the aforementioned cross-validation.
Table 3.
Scores results of different machine learning classifiers applied to thedataset after elimination of SNP with missing genotypes. no. trees depth accuracy F1-score precision recall
CatBoostClassifier 2 2 0.715 0.834 0.715
LGBMClassifier 5 3 0.753 0.852
FC+LSTM - - 0.731 0.839 0.735 0.981
From Table 3, one can see that scores are more realistic in comparison tothose of Table 2, thus showing us that data leak and subsequent overfittingeffects are gone. We realize that our proposed biclustering algorithm success-fully identified large submatrices with missing data, which we eliminated andsuccessfully removed the impact of data leak and overfitting.
In-Close4 is an open-source software tool [29], which provides a highly optimisedalgorithm from
CbO family [22,30] to construct the set of concepts satisfyinggiven constraints on sizes of extents and intents. In-Close4 takes as input a con-text and outputs a reduced concept lattice: all concepts satisfying the constraintsgiven by parameter values ( | A | ≥ m and | B | ≥ n , where A and B are extent andintent of an output formal concept, and m, n ∈ N ).To deal with our large real-world dataset, we changed the maximum defaultvalues used in the executable of In-Close4 parameters as follows: bject-Attribute Biclustering for Elimination of Missing Genotypes 17 study. When we set the extent size constraint to 5 with the input context beforeand the extent and the intent size constraint to 20 and 0, respectively, afterthe elimination of missing data, the software crashed. Meanwhile, our proposedbiclustering algorithms could manage to output all OA-biclusters in both cases.As the author of InClose suggested in private communication, the tool wasoptimised for “tall” contexts with a large number of objects rather than at-tributes, while in bioinformatics the contexts are often “wide” like in our casewhen the number of SNPs is almost 57 times larger than that of individuals.So, the results on the transposed context along with properly set compilationparameters allowed to process the whole context for m = 0 and n = 0 . Table 4.
The number of concepts and elapsed time generated by In-Close4 algorithm before eliminating SNPs with missing genotypes.Min intent size Min extent size Total Time, s No. of Concepts0 45 21.2 18,6170 40 23.6 34,4000 30 35.8 68,4770 20 46.1 165,8640 10 64.3 214,0070 5 188.3 1,220,5760 0 143.43 1,979,439
Table 5.
The number of concepts and elapsed time generated by In-Close4 al-gorithm after eliminating SNPs with missing genotypes.
Min intent size Min extent size Total Time, s Number of Concepts0 40 10.4 2,7430 30 10.6 4,1960 20 12.6 19,62030 0 5.8 352,25725 0 6.2 466,69520 0 7.4 695,96215 0 10.7 1,308,22210 0 18.3 3,226,277
Even if we do not know the number of output concepts for the context afterelimination of missing SNP values, their number is more than 10 times larger The last line in Table 4 and the last five lines in Table 5 corresponds to the experi-ments conducted for the final version of the paper on the transposed contexts.8 Dmitry I. Ignatov et al. than that of OA-biclusters, which might be considered as argument in favour oftheir usage for the studied problem with rather low or no size constraints.
A new approach to process the missing values in datasets of SNP genotypesobtained with DNA-microarrays is proposed. It is based on OA-biclustering.We applied the approach to the real-world datasets representing the genotypesof patients with ischemic stroke and healthy people. It allowed us to estimateand eliminate the SNPs carefully with missing genotypes. Results of the OA-biclustering algorithm showed the possibility of detecting relatively large densebiclusters, which significantly helped in removing the effects of data leaks andoverfitting while applying ML algorithms.We compared our algorithm with In-Close4. The number of OA-biclustersgenerated by our algorithm is significantly lower than the number of concepts(or biclusters) generated by In-Close4. Besides, our algorithm has the advantageof using OA-bicluster without the need to experiment with finding the best min-imum support, as in the case of using In-Close4 for generating formal concepts.Since survey [31] mentioned frequent itemset mining (FIM) as a tool to iden-tify strong associations between allelic combinations associated with diseases,the proposed algorithm needs further comparison with other approaches fromFIM like DeBi [32] and anytime discovery approaches like Alpine [33] tested onGEA datasets as well; though their use may get complicated if we need to keepinformation about object names for decision-makers. It also requires further timecomplexity improvements to increase the scalability and quality of the extensivebicluster finding process for massive datasets.Another venue for related studies delve in Boolean biclustering [34] and fac-torisation techniques [35].Speaking about other possible applications of biclustering, we suggest thedevelopment of a new imputation technique. Since biclustering has been recentlyapplied to impute the missing values in gene expression data [36] and both GEDand SNP genotyping data are obtained with DNA-microarrays and representedas an integer matrix, it can be potentially applied to impute the genotypes thatfacilitates statistical analyses and empowers ML algorithms.
Acknowledgements.
This study was implemented in the Basic Research Pro-gram’s framework at the National Research University Higher School of Eco-nomics and the Laboratory of Models and Methods of Computational Pragmat-ics in 2020. The authors thank prof. Alexei Fedorov (University of Toledo Collegeof Medicine, Ohio, USA) and prof. Svetlana Limborska (Institute of MolecularGenetics of National Research Centre “Kurchatov Institute”, Moscow, Russia)for insightful discussions of the results obtained, and anonymous reviewers.
Funding.
The study was funded by RFBR (Russian Foundation for Basic Re-search) according to the research project No 19-29-01151. The foundation had bject-Attribute Biclustering for Elimination of Missing Genotypes 19 no role in study design, data collection and analysis, writing the manuscript,and decision to publish.
References
1. Bumgarner, R.: Overview of DNA microarrays: types, applications, and theirfuture. Curr Protoc Mol Biol
Chapter 22 (Jan 2013) Unit 22.1.2. Dehghan, A.: Genome-Wide Association Studies. Methods Mol. Biol. (2018)37–493. Nicholls, H.L., John, C.R., Watson, D.S., Munroe, P.B., Barnes, M.R., Cabrera,C.P.: Reaching the End-Game for GWAS: Machine Learning Approaches for thePrioritization of Complex Disease Loci. Front Genet (2020) 3504. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclustersin gene expression data. Bioinformatics (suppl 1) (2002) S136–S1445. Mirkin, B.: Mathematical Classification and Clustering. Kluwer, Dordrecht (1996)6. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis:a survey. IEEE/ACM trans. on comp. biol. and bioinform. (1) (2004) 24–457. Cheng, Y., Church, G.M.: Biclustering of expression data. In et al., P.E.B.,ed.: Proceedings of the Eighth International Conference on Intelligent Systemsfor Molecular Biology, 2000, AAAI (2000) 93–1038. Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. Handbookof computational molecular biology (1-20) (2005) 122–1249. Busygin, S., Prokopyev, O., Pardalos, P.M.: Biclustering in data mining. Comput-ers & Operations Research (9) (2008) 2964–298710. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. 1stedn. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1999)11. Besson, J., Robardet, C., Boulicaut, J., Rome, S.: Constraint-based concept miningand its application to microarray data analysis. Int. Data Anal. (1) (2005) 59–8212. Blachon, S., Pensa, R.G., Besson, J., Robardet, C., Boulicaut, J., Gandrillon, O.:Clustering formal concepts to discover biologically relevant knowledge from geneexpression data. Silico Biol. (4-5) (2007) 467–48313. Kaytoue, M., Kuznetsov, S.O., Napoli, A., Duplessis, S.: Mining gene expressiondata with pattern structures in formal concept analysis. Inf. Sci. (10) (2011)1989–200114. Andrews, S., McLeod, K.: Gene co-expression in mouse embryo tissues. Int. J.Intell. Inf. Technol. (4) (2013) 55–6815. Ignatov, D.I., Kaminskaya, A.Y., Kuznetsov, S., Magizov, R.A.: Method of Biclus-terzation Based on Object and Attribute Closures. In: Proc. of 8-th internationalConference on Intellectualization of Information Processing (IIP 2011). Cyprus,Paphos, October 17–24, MAKS Press (2010) 140–143 (in Russian).16. Ignatov, D.I., Kuznetsov, S.O., Poelmans, J.: Concept-based biclustering for inter-net advertisement. In: 2012 IEEE 12th International Conference on Data MiningWorkshops, IEEE (2012) 123–13017. Andrews, S.: In-close2, a high performance formal concept miner. In et al., S.A.,ed.: Conceptual Structures for Discovering Knowledge - 19th Int. Conf. ConceptualStructures, ICCS 2011. Proceedings. Volume 6828 of LNCS., Springer (2011) 50–6218. Yevtushenko, S.A.: System of data analysis “Concept Explorer”. In: Proc. 7thNational Conference on Artificial Intelligence (KII’00). (2000) 127–1340 Dmitry I. Ignatov et al.19. Arnauld, A., Nicole, P.: La logique ou l’art de penser (Logique de Port Royal).Archives de la linguistique fran¸caise. Ch. Savreuf, Guignart (1662)20. Ignatov, D.: Models, Algorithms, and Software Tools of Biclustering Based onClosed Sets. PhD thesis, HSE University, Moscow (2010)21. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., B¨uhlmann, P., Gruissem, W.,Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation ofbiclustering methods for gene expression data. Bioinform. (9) (2006) 1122–112922. Kuznetsov, S.O.: Mathematical aspects of concept analysis. Journal of Mathemat-ical Sciences (2) (1996) 1654–169823. Gnatyshak, D., Ignatov, D.I., Kuznetsov, S.O., Nourine, L.: A one-pass tricluster-ing approach: Is there any room for big data? In Bertet, K., Rudolph, S., eds.: Proc.of the 11th Int. Conf. on Concept Lattices and Their Applications (CLA 2014).Volume 1252 of CEUR Workshop Proceedings., CEUR-WS.org (2014) 231–24224. Leskovec, J., Rajaraman, A., Ullman, J.D.: Finding Similar Items. In: Mining ofMassive Datasets, 3nd Ed. Cambridge University Press (2020) 73–13425. Ignatov, D.I., Tochilkin, D., Egurnov, D.: Multimodal clustering of boolean tensorson mapreduce: Experiments revisited. In et al., D.C., ed.: Suppl. Proceedingsof ICFCA 2019 Conference and Workshops. Volume 2378 of CEUR WorkshopProceedings., CEUR-WS.org (2019) 137–15126. Codd, E.F.: A relational model of data for large shared data banks. Commun.ACM (6) (June 1970) 377–38727. Shetova, I.M., Timofeev, D.I., Shamalov, N.A., Bondarenko, E.A., Slominski, P.A.,Limborskaia, S.A., Skvortsova, V.I.: The association between the DNA markerrs1842993 and risk for cardioembolic stroke in the Slavic population. Zh NevrolPsikhiatr Im S S Korsakova (3 Pt 2) (2012) 38–4128. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (12 1997) 1735–8029. Andrews, S.: Making Use of Empty Intersections to Improve the Performance ofCbO-Type Algorithms. In Bertet, K., Borchmann, D., Cellier, P., Ferr´e, S., eds.:Formal Concept Analysis - 14th International Conference, ICFCA 2017. Proceed-ings. Volume 10308 of LNCS., Springer (2017) 56–7130. Janostik, R., Konecny, J., Krajca, P.: LCM is well implemented CbO: Study ofLCM from FCA point of view. In Valverde-Albacete, F.J., Trnecka, M., eds.: Proc.of the Fifteenth Int. Conf. on Concept Lattices and Their Applications, 2020.Volume 2668 of CEUR Workshop Proceedings., CEUR-WS.org (2020) 47–5831. Naulaerts, S., Meysman, P., Bittremieux, W., Vu, T., Berghe, W.V., Goethals, B.,Laukens, K.: A primer to frequent itemset mining for bioinformatics. BriefingsBioinform. (2) (2015) 216–23132. Serin, A., Vingron, M.: Debi: Discovering differentially expressed biclusters usinga frequent itemset approach. Algorithms for Molecular Biology (1) (2011) 1833. Hu, Q., Imielinski, T.: ALPINE: progressive itemset mining with definite guaran-tees. In Chawla, N.V., Wang, W., eds.: Proceedings of the 2017 SIAM InternationalConference on Data Mining, SIAM (2017) 63–7134. Michalak, M., Slezak, D.: On Boolean Representation of Continuous Data Biclus-tering. Fundam. Informaticae (3) (2019) 193–21735. Belohl´avek, R., Outrata, J., Trnecka, M.: Factorizing Boolean matrices using for-mal concepts and iterative usage of essential entries. Inf. Sci.489