[PDF] Relevant Attributes in Formal Contexts

Abstract

Computing conceptual structures, like formal concept lattices, is in the age of massive data sets a challenging task. There are various approaches to deal with this, e.g., random sampling, parallelization, or attribute extraction. A so far not investigated method in the realm of formal concept analysis is attribute selection, as done in machine learning. Building up on this we introduce a method for attribute selection in formal contexts. To this end, we propose the notion of relevant attributes which enables us to define a relative relevance function, reflecting both the order structure of the concept lattice as well as distribution of objects on it. Finally, we overcome computational challenges for computing the relative relevance through an approximation approach based on information entropy.

Full PDF

RRelevant Attributes in Formal Contexts

Tom Hanika , , Maren Koyda , , and Gerd Stumme , Knowledge & Data Engineering Group, University of Kassel, Germany Interdisciplinary Research Center for Information System DesignUniversity of Kassel, Germany [email protected],[email protected], [email protected]

Abstract

Computing conceptual structures, like formal concept lattices,is in the age of massive data sets a challenging task. There are variousapproaches to deal with this, e.g., random sampling, parallelization, orattribute extraction. A so far not investigated method in the realm of formalconcept analysis is attribute selection, as done in machine learning. Buildingup on this we introduce a method for attribute selection in formal contexts.To this end, we propose the notion of relevant attributes which enables usto deﬁne a relative relevance function, reﬂecting both the order structureof the concept lattice as well as distribution of objects on it. Finally, weovercome computational challenges for computing the relative relevancethrough an approximation approach based on information entropy.

Keywords:

Formal Concept Analysis, Relvevant Features, Attribute Selection,Entropy, Label Function

The increasing number of features (attributes) in data sets poses a challenge formany procedures in the realm of knowledge discovery. In particular, methodsemployed in formal concept analysis (FCA) become more infeasible for large num-bers of attributes. Of peculiar interest there is the construction, visualization andinterpretation of formal concept lattices , an algebraic structure usually representedthrough line or order diagrams.The data structure used in FCA is a formal context , roughly a data table whereevery row represents an object associated to attributes described through columns.Contemporary such data sets consist of thousands of rows and columns. Since thecomputation of all formal concepts is at best possible with polynomial delay [11],thus sensitive to the output size, it is almost unattainable to be computed evenfor moderately large sized data sets. The problem for the computation of valid(attribute) implications is even more serious since enumerating them is not possiblewith polynomial delay [7] (in lexicographic order) and only few algorithms are

Authors are given in alphabetical order. No priority in authorship is implied. a r X i v : . [ c s . A I] D ec Tom Hanika, Maren Koyda, and Gerd Stumme known to compute them [18]. Furthermore, in many applications storage spaceis limited, e.g., mobile computing or decentralized embedded knowledge systems.To overcome both the computational infeasibility as well as the storage lim-itation one is required to select a sub-context resembling the original data setmost accurately. This can be done by selecting attributes, objects, or both. In thiswork we will focus on the identiﬁcation of relevant attributes. This is, due to theduality of formal contexts, similar to the problem of selecting relevant objects.There are several comparable works, e.g., [15], where the author investigated theapplicability of random projections. For supervised machine learning tasks thereare even more sophisticated methods utilizing ﬁlter approaches, which are based onthe distribution of labels [20]. Works more related to FCA resort, e.g., to conceptsampling [3] and concept selection [13]. Both approaches, however, either need tocompute the whole (possibly large) concept lattice or sample from it.In this work we overcome this limitation and present a feasible approach forselecting relevant attributes from a formal context using information entropy. Tothis end we introduce the notion for attribute relevance to the realm of FCA,based on a seminal work by Blum and Langley [2]. In there the authors addressa comprehensible theory for selecting most relevant features in supervised machinelearning settings. Building up on this we formalize a relative relevance measurein formal contexts in order to identify the most relevant attributes. However, thismeasure is still prone to the limitation for computing the concept lattice. Finally, wetackle this disadvantage by approximating the relative relevance measure throughan information entropy approach. Choosing attributes based on this approximationleads to signiﬁcantly more relevant selections than random sampling does, whichwe demonstrate in an empirical experiment.As for the structure of this paper, in Section 2 we give a short overview over theprevious works applied to relevant attribute selection. Subsequently we recall somebasic notions from FCA followed by our deﬁnitions of relevance and relative relevanceof attribute selections and its approximations. In Section 4 we illustrate and evaluateour notions through experiments showing the approximations are signiﬁcantlysuperior to random sampling. We conclude our work and give an outlook in Section 5.

In the ﬁeld of supervised machine learning there are numerous approaches for featureset selection. The authors from [10] introduced a beneﬁcial categorization for thosein two categories: wrapper models and ﬁlters. The wrapper models evaluate featuresubsets using the underlying learning algorithm. This allows to respond to redundantor correlated features. However, these models demand many computations andare prone to reproduce the procedural bias of the underlying learning algorithm.A representative for this model type is the class of selective Bayesian classiﬁers inLangley and Sage [16]. In there the authors extended the naive Bayesian classiﬁerby considering subsets of given feature sets only for predictions. The other categoryfrom [10] is ﬁlter models. Those work independently from the underlying learningalgorithm. Instead these methods make use of general characteristics like the elevant Attributes in Formal Contexts 3 attribute distribution with respect to the labels in order to weight an attribute’simportance. Hence, they are more eﬃcient but are likely to select redundant or futilefeatures with respect to an underlying machine learning procedure. A well-knownmethod representing this class is RELIEF [12], which denotes the relevance of allfeatures referring to the class label using a statistical method. An entropy basedapproach of a ﬁlter model was introduced by Koller et al. [14]. There the authorsintroduced selecting features based on the Kullback-Leibler-distance. All thesemethods incorporate an underlying notion of attribute relevance. This notion wascaptured and formalized in the seminal work by Blum and Langley in [2], on whichwe will base the notion for relevant attributes in formal contexts.There are some approaches in FCA to face the attribute selection problem. In [15]a procedure based on random projection was developed. Less related are methodsemployed after computing the formal concept lattice, e.g., concept sampling [3] andconcept selection [13]. Those could be compared to methods from [16], as they ﬁrstcompute the concept lattice. More related works originate from granular computingwith FCA. A basic idea there is to ﬁnd information granules based on entropy. To thisend the authors of [17] introduced an (object) entropy function for formal contexts,which we will utilize in this work as well. Their approach used the principles ofgranulation as in [21], which is based on merging attributes to reduce the data set.Since our focus is on selecting attributes, we turn away from this notion in general.

Before we start with our deﬁnition for relevant attributes of a formal context, wewant to recall some basic notions from formal concept analysis. For a thorough in-troduction we refer the reader to [8]. A formal context is triple K := ( G,M,I ) , where G and M are ﬁnite sets called object set and attribute set , respectively. Those areconnected through a binary relation I ⊆ G × M , called incidence . If ( g,m ) ∈ I for anobject g ∈ G and an attribute m ∈ M , we write gIm and say “object g has attribute m ”. On the power set of the objects (power set of the attributes) we introduce twooperators · (cid:48) : P ( G ) → P ( M ) , where A (cid:55)→ A (cid:48) := { m ∈ M | ∀ g ∈ A : ( g,m ) ∈ I } and · (cid:48) : P ( M ) → P ( G ) , where B (cid:55)→ B (cid:48) := { g ∈ G | ∀ m ∈ B : ( g,m ) ∈ I } . A pair ( A,B ) with A ⊆ G and B ⊆ M is called formal concept of the context ( G,M,I ) iﬀ A (cid:48) = B and B (cid:48) = A . For a formal concept c = ( A,B ) the set A is called the extent (ext ( c ) ) and B the intent (int ( c ) ). For two concepts ( A ,B ) and ( A ,B ) there is a natural partialorder given through ( A ,B ) ≤ ( A ,B ) iﬀ A ⊆ A . The set of all formal conceptsof some formal context K , denoted by B ( K ) , together with the just introducedpartial order constitutes the formal concept lattice B ( K ) := ( B ( K ) , ≤ ) .A severe computational problem in FCA is to compute the set of all formalconcepts, which resembles the CLIQUE problem [11]. Furthermore, the numberof formal concepts in a proper sized real-world data set tends to be very large, e.g.,238710 in the (small) mushroom data set, see Section 4.1. Hence, concept latticesfor contemporary sized data sets are hard to grasp and hard to cope with throughconsecutive measures and metrics. Thus, a need for selecting sub-contexts from datasets or sub-lattices is self-evident. This selection can be conducted in the formal Tom Hanika, Maren Koyda, and Gerd Stumme context as well as in the concept lattice. However, the computational feasible choiceis to do this in the formal context. Considering a induced sub-context can be donein general in three diﬀerent ways: One may consider only a subset ˆ G ⊆ G , a subset ˆ M ⊆ M , or a combination of those. Our goal for the rest of this work is to identifyrelevant attributes in a formal context. The notion for (attribute) relevance shallcover two aspects: the lattice structure and the distribution of objects on it. The taskat hand is to choose the most relevant attributes which do both reﬂect a large partof the lattice structure as well as the distribution of the objects on the concepts. Forthis we will introduce in the next section a notion for relevant attributes in a formalcontext. Due to the duality in FCA this can easily be translated to object relevance. There is plenitude of conceptions for describing the relevance of an attribute in adata set. Apparently, the relevance should depend on the particular machine learn-ing or knowledge discovery procedure. One very inﬂuential work in this directionwas done by Blum and Langley in [2], where the authors deﬁned the (weak/strong)relevance of an attribute in the realm of labeled data. In particular, for some dataset of examples D , described using features from some feature set F , where every d ∈ D has the label (distribution) (cid:96) ( d ) , the authors stated: A feature x ∈ F is relevant to a target concept-label if there exists a pair of examples a,b ∈ D such that a and b only diﬀer in their assignment of x and (cid:96) ( a ) (cid:54) = (cid:96) ( b ) . They further expandedtheir notion calling some attribute x is weakly relevant iﬀ it is possible to removea subset of the features (from a and b ) such that x becomes relevant.Since in the realm of formal concept analysis data is commonly unlabeled wemay not directly adapt the above notion to formal contexts. However, we maymotivate the following approach with it. We cope with the lack of a label function inthe following way. First, we identify the data set D with a formal context ( G,M,I ) ,where the elements of G are the examples and M are the features describing theexamples. Secondly, a formal concept lattice exhibits essentially two almost inde-pendent properties, the order structure and the distribution of objects (attributes)on it, cf. Example 3.1. Thus, a conceptual label function then shall reﬂect both theorder structure as well as the distribution of objects in this structure. To achievethis we propose the following. Deﬁnition 3.1 (Extent Label Function) . Let K := ( G,M,I ) be a formal contextand its concept lattice B ( K ) . The map (cid:96) K : G → N ,g (cid:55)→ |{ c ∈ B ( K ) | g ∈ ext ( c ) }| iscalled extent label function .One may deﬁne an intent label function analogously. Utilizing the just introducedlabel function we may now deﬁne the notion of relevant attributes in formal contexts. Deﬁnition 3.2 (Relevance) . Let K := ( G, M, I ) be a formal context. We sayan attribute m ∈ M is relevant to g ∈ G if and only if (cid:96) K { m } ( g ) < (cid:96) K ( g ) , where K { m } := ( G,M \ { m } ,I ∩ G × ( M \ { m } )) . Furthermore, m is relevant to a subset A ⊆ G iﬀ there is a g ∈ A such that m is relevant to g . And, we say m is relevantto the context K iﬀ m is relevant to G . elevant Attributes in Formal Contexts 5a b c d e f g h iLeach × × × Bream × × × ×

Frog × × × × ×

Dog × × × × ×

Spike-weed × × × ×

Bean × × × × a b c d B ream × × F rog × × × D og × × S pike-weed × × × a cd b DS FBFigure 1.

Sub-contexts of "Living Beings and Water" [8]. The attributes are: a: needswater to live, b: lives in water, c: lives on land, d: needs chlorophyll to produce food, e:two seed leaves, f: one seed leaf, g: can move around, h: has limbs, i: suckles its oﬀspring

Example 3.1.

Figure 1 (right) shows a formal context and its concept lattice. Theobjects from there are abbreviated by their ﬁrst letter in the following. The extentlabel function of the objects can easily be read from the lattice and is given by (cid:96) K ( B ) = 2 , (cid:96) K ( F ) = 4 , (cid:96) K ( D ) = 2 , (cid:96) K ( S ) = 3 . Additionally, one can deduct therelevant attributes. E.g., for attribute b the equality (cid:96) K { b } ( D ) = (cid:96) K ( D ) holds. Incontrast (cid:96) K { b } ( S ) < (cid:96) K ( S ) , cf. Figure 2. Hence, attribute b is not relevant to “Dog”but relevant to “Spike-weed”. Thus, b is relevant to K .There are two structural approaches in FCA to identify admissible attributes,namely attribute clarifying and reducibility . Those are based purely on the latticestructure. A formal context K := ( G,M,I ) is called attribute clariﬁed iﬀ for allattributes m,n ∈ M with m (cid:48) = n (cid:48) follows that m = n . If there is furthermore no m ∈ M and X ⊆ M with m (cid:48) = X (cid:48) the context is called attribute reduced . Analogously,the terms object clariﬁed and object reduced can be determined. An attribute andobject clariﬁed (reduced) context is simply called clariﬁed ( reduced ). The conceptlattice of the clariﬁed/reduced context is isomorphic to the concept lattice of theoriginal context. If one of these properties does not hold for an attribute (or anobject) the context can be can clariﬁed/reduced by eliminating all such attributes(objects). Obviously, the notion for relevant attributes is related to reducibility. Lemma 3.3 (Irreducible) . For m ∈ M in K = ( G,M,I ) holds m is relevant to K ⇐⇒ m is irreducible . Proof.

We ﬁrst show ( ⇒ ). We have to show that the following inequality holds: |{ c ∈ B ( K ) | g ∈ ext ( c ) }| ≤ |{ c ∈ B ( K { m } ) | g ∈ ext ( c ) }| . Since g ∈ ext ( c ) and for any c ∈ B ( K ) exists a unique concept ˆ c ∈ B ( K { m } ) with int (ˆ c ) ∪{ m } = int ( c ) , cf. [8, pg24], we have that g ∈ ( int (ˆ c ) ∪{ m } ) (cid:48) ⊆ int (ˆ c ) (cid:48) . For ( ⇐ ) we employ [8, Prop. 30], i.e.,there is a join preserving order embedding ( G,M \ m,I ∩ ( G × ( M \{ m } ))) → ( G,M,I ) with ( A,B ) (cid:55)→ ( A,A (cid:48) ) . Hence, every extent in B ( K { m } ) is also an extent in B ( K ) which implies for all g ∈ G that (cid:96) K { m } ( g ) < (cid:96) K ( g ) .The last lemma implies that no clariﬁable attributes would be considered asrelevant, even if the removal of all attributes that have the same closure wouldhave a huge impact on the structure of the concept lattice. Therefore a meaningful Tom Hanika, Maren Koyda, and Gerd Stumme cd b

DS FB cd a

S F,DB ad b

DS B,F cb a

S,B DFFigure 2.

Sub-lattices created through the removal of an attribute from Figure 1 (right).From left to right: removing a,b,c, or d. identiﬁcation of relevant attributes restrains to the identiﬁcation of meaningfulequivalence classes [ x ] K := { y ∈ M | x (cid:48) = y (cid:48) } for all y ∈ M . Accordingly we considerin the following only clariﬁed contexts. Transferring the relevance of an attribute m ∈ M to its equivalence class is an easy task which can be executed if necessary.So far we are only able to decide for the relevance of an attribute but notdiscriminate attributes upon their relevancy to the concept lattice. To overcomethis limitation we introduce in the following a measure which is able to comparethe relevancy of two given attributes in a clariﬁed formal context. We considerthe change in the object label distribution { ( g,(cid:96) K ( g )) | g ∈ G } going from K to K { m } as characteristic to the relevance of a relevant attribute m . To examine thischaracteristic in more detail and to make it graspable via an numeric value wepropose the following inequality: (cid:80) g ∈ G (cid:96) K { m } ( g ) < (cid:80) g ∈ G (cid:96) K ( g ) . This approachoﬀers not only the possibility to verify the existence of a change in the object labeldistribution but also to measure the extent of this change. We may quantify thisvia (cid:80) g ∈ G (cid:96) K { m } ( g ) / (cid:80) g ∈ G (cid:96) K ( g ) =: t ( m ) whence t ( m ) < for all attributes m ∈ M . Deﬁnition 3.4 (Relative Relevance) . Let K = ( G, M, I ) be a clariﬁed formalcontext. The attribute m ∈ M is relative relevant to K with r ( m ) := 1 − (cid:80) g ∈ G |{ c ∈ B ( K { m } ) | g ∈ ext ( c ) }| (cid:80) g ∈ G |{ c ∈ B ( K ) | g ∈ ext ( c ) }| = 1 − t ( m ) . The values of r ( m ) for an attribute are in [0 , . We say m ∈ M is more relevant to K than n ∈ M iﬀ r ( n ) < r ( m ) . Double counting leads to the following proposition. Proposition 3.5.

Let K = ( G,M,I ) be a formal context. For all m ∈ M holds r ( m ) = 1 − (cid:80) c ∈ B ( K ) { m } | ext ( c ) | (cid:80) c ∈ B | ext ( c ) | with B ( K ) { m } = { c ∈ B | ( int ( c ) \{ m } ) (cid:48) = ext ( c ) } . This statement reveals an interesting property of the just deﬁned relativerelevance. In fact, an attribute m ∈ M is more relevant to an formal context K if the join preserving sub-lattice, which one does obtain by removing m from K , elevant Attributes in Formal Contexts 7 does exhibit a smaller sum of all extent sizes. This will enable us to ﬁnd properapproximations to the relative relevance in Section 3.2. Example 3.2.

Excluding one attribute from the running example in Figure 1 (right)results in the sub-lattices in Figure 2. The relative relevance of the attributes tothe original context is given by r ( a ) = 0 , r ( b ) = 4 / , r ( c ) = 3 / , and r ( d ) = 1 / .By means of r ( · ) it is also possible to measure the relative relevance of a set N ⊆ M . We simply lift 3.5 by r ( N ) = 1 − (cid:80) c ∈ B ( K ) N | ext ( c ) | / (cid:80) c ∈ B ( K ) | ext ( c ) | with B ( K ) N = { c ∈ B ( K ) | ( int ( c ) \{ N } ) (cid:48) = ext ( c ) } . Lemma 3.6.

Let K = ( G,M,I ) be a formal context and S,T ⊆ M attribute sets. Theni) S ⊆ T = ⇒ r ( S ) ≤ r ( T ) , andii) r ( S ∪ T ) ≤ r ( T )+ r ( S ) .Proof. We prove i) by showing (cid:80) c ∈ B S | ext ( c ) | > (cid:80) c ∈ B T | ext ( c ) | . Since ∀ c ∈ B ( K ) we have ( int ( c ) \ T ) (cid:48) ⊇ ( int ( c ) \ S ) (cid:48) ⊇ ext ( c ) we obtain B ( K ) S ⊇ B ( K ) T , as required.For ii) we will use the identity ( (cid:63) ): B ( K ) S ∩ B ( K ) T = B ( K ) S ∪ T , which followsfrom ( int ( c ) \ S ) (cid:48) = ext ( c ) ∧ ( int ( c ) \ T ) (cid:48) = ext ( c ) ⇐⇒ ( int ( c ) \ ( S ∪ T )) (cid:48) = ext ( c ) forall c ∈ B ( K ) . This equivalence is true since ( ⇒ ): ( int ( c ) \ ( S ∪ T )) (cid:48) = (( int ( c ) \ S ) ∩ ( int ( c ) \ T )) (cid:48) = ( int ( c ) \ S ) (cid:48) ∪ ( int ( c ) \ T ) (cid:48) = ext ( c ) ∪ ext ( c ) = ext ( c ) ( ⇐ ): From ( int ( c ) \ ( S ∪ T )) (cid:48) ⊇ ( int ( c ) \ S ) (cid:48) and ( int ( c ) \ ( S ∪ T )) (cid:48) ⊇ ( int ( c ) \ T ) (cid:48) weobtain with i) that ( int ( c ) \ S ) (cid:48) = ext ( c ) . We now show ii) by proving the inequal-ity (cid:80) B ( K ) S | ext ( c ) | + (cid:80) B ( K ) T | ext ( c ) | ≤ (cid:80) B ( K ) | ext ( c ) | + (cid:80) B ( K ) S ∪ T | ext ( c ) | . Using B ( K ) S \ B ( K ) S ∪ T ∪ B ( K ) S ∪ T = B ( K ) S where B ( K ) S \ B ( K ) S ∪ T ∩ B ( K ) S ∪ T = ∅ we ﬁnd an equivalent equation employing ( (cid:63) ): (cid:88) B S \ B S ∪ T | ext ( c ) | + (cid:88) B T \ B S ∪ T | ext ( c ) | +2 · (cid:88) B S ∪ T | ext ( c ) | ≤ (cid:88) B S \ B S ∪ T | ext ( c ) | + (cid:88) B T \ B S ∪ T | ext ( c ) | + (cid:88) B \ ( B S ∪ B T ) | ext ( c ) | +2 · (cid:88) B S ∪ T | ext ( c ) | ≤ (cid:88) B \ ( B S ∪ B T ) | ext ( c ) | where B X is short for B ( K ) X .Equipped with the notion for relative relevance and some basic observationswe are ready to state the associated computational problem. We imagine that inreal-world applications attribute selection is a task to identify a set N ⊆ M ofthe most relevant attributes for a given cardinality n ∈ N , i.e., an element from { N ⊆ M | | N | = n ∧ r ( N ) maximal } . We call such a set N a maximal relevant set . Problem 3.1 (Relative Relevance Problem (RRP)).

Let K = ( G,M,I ) be a formalcontext and n ∈ N with n < | M | . Find a subset N ⊆ M with | N | = n such that r ( N ) ≥ r ( X ) for all X ⊆ M where | X | = n . Tom Hanika, Maren Koyda, and Gerd Stumme

Solving 3.1 is twofold infeasible. First, as n increases does the number of possiblesubset combinations. The determination of a maximal relevant set requires the com-putation and comparison of (cid:0) | M || N | (cid:1) diﬀerent relative relevances, which presents itselfinfeasible. Secondly, does the computation of the relative relevance presume thatthe set of formal concepts is computed. This states also an intractable problem forlarge formal contexts, which are the focus for applications of the proposed relevanceselection method. To overcome the ﬁrst limitation we suggest an iterative approach.Instead of testings every subset of size n we construct N ⊆ M by ﬁrst consideringall singleton sets { m } ⊆ M . Consecutively, in every step i where X is the so farconstructed set we ﬁnd x ∈ M such that r ( X ∪{ x } ) ≥ r ( X ∪{ m } ) for all m ∈ M .This approach requires the computation of only (cid:80) | M | i = | M |−| n | +1 i diﬀerent relativerelevances and their comparisons, which is simpliﬁed n ·| M |− ( n − · n/ . We call aset obtained through this approach an iterative maximal relevant set IMRS. In factthe IMRS does not always correspond to the maximal relevant set. In ( G,M,I ) where G = { , , , } , M = { a,b,c,d } and I = { (1 ,a ) , (1 ,c ) , (1 ,d ) , (2 ,a ) , (2 ,b ) , (3 ,b ) , (3 ,c ) , (4 ,d ) } is b the most relevant attribute, i.e., r ( b ) > r ( x ) for all x ∈ M \{ b } . However, we ﬁnd r ( { a,c } ) > r ( { b,x } ) for all x ∈ M \{ b } . Hence, the relative relevance of an IMRSindicates a lower bound for the relative relevance of the maximal relevant set. Motivated by the computational infeasibility of 3.1 we investigate in this sectionthe possibility of approximating RRP, more speciﬁcally the IMRS. Approachesfor this approximation have to incorporate both aspects of the relative relevancethe structure of the concept lattice and the distribution of the objects. Consid-ering the former is not complicated due to [8, Proposition 30], which states thatfor any ( G,M,I ) is B (( G,N,I ∩ ( G × N ))) join preserving order embeddable into B (( G,M,I )) for any N ⊆ M . Thus, this aspect can be represented through a quotient | B ( K ) M \ N ) | / | B ( K ) | , which is a special case of the maximal common sub-graphdistance, see [5]. Hence, whenever searching for the largest B (( G,N,I ∩ ( G × N ))) the obvious choice is to optimize for large contra-nominal scales in sub-contexts of ( G,M,I ) . For example, when selecting three attributes in Figure 1 (left) the largestjoin preserving order embeddable lattice would be generated by the set { b,c,d } .However, the relative relevance of { b, c, g } is signiﬁcantly larger, in particular, r ( { b,c,d } ) = 17 / and r ( { b,c,g } ) = 19 / .Considering the second requirement, the distribution of the objects on theconcept lattice, the sizes of the concept extents have to be incorporated. Sincethey are unknown, unless we compute the concept lattice, we need a proxy forestimating the inﬂuence of those. Accordingly, we want to reﬂect this with thequotient E ( K M \ N ) /E ( K ) , which estimates the change of the object distributionon the concept lattices when selecting a set N ⊆ M . This quotient does employa mapping E : K → R , K (cid:55)→ E ( K ) , which is to be found. A natural candidate forthis mapping would be information entropy, as introduced by Shannon in [19]. Hedeﬁned the entropy of a discrete set of probabilities p ,...,p n as H = − (cid:80) i ∈ I p i log p i .We adapt this formula to the realm of formal contexts as follows. elevant Attributes in Formal Contexts 9 Deﬁnition 3.7.

Let K = ( G,M,I ) be a formal context. Then the Shannon objectinformation entropy of K is given as follows. E SE ( K ) = (cid:88) g ∈ G − | g (cid:48)(cid:48) || G | log (cid:18) | g (cid:48)(cid:48) || G | (cid:19) For this entropy function we employ the quotient | g (cid:48)(cid:48) | / | G | , which does reﬂect theextent sizes of the object concepts of K . Obviously this choice does not consider allconcept extents. However, since every extent in a concept lattice is either the extentof a object concept or the intersection of ﬁnitely many extents of object conceptswe see that Shannon object information entropy does relate to all extents to somedegree. We found another candidate for E in the literature [17]. The authors thereintroduced an entropy function which is roughly speaking the mean distance ofthe extents of object concepts to the complete set objects. Deﬁnition 3.8.

Let K = ( G,M,I ) be a formal context. Then the object informationentropy of K is given as follows. E OE ( K ) = 1 | G | (cid:88) g ∈ G (cid:18) − | g (cid:48)(cid:48) || G | (cid:19) We directly observe that this entropy decreases as the number of objects havingsimilar attribute sets increases. Furthermore, we recognize an essential diﬀerencefor E OE compared to E SE . The Shannon object information entropy reﬂects on thenumber of necessary bits to encode the formal context. In contrary does the objectinformation entropy reﬂect on the average number of bits to encode an object fromthe formal context. To enhance the ﬁrst grasp of the just introduced functions aswell as the relative relevance deﬁned in Deﬁnition 3.4 we want to investigate themon well known contextual scales. In particular, the ordinal scale O n := ([ n ] , [ n ] , ≤ ) ,the nominal scale N n := ([ n ] , [ n ] , =) , and the contranominal scale C n := ([ n ] , [ n ] , (cid:54) =) ,where [ n ] := { ,...,n } . Since there is a bijection between the set { ,...,n } to theextent sizes | g (cid:48)(cid:48) | in an ordinal scale we obtain that E SE ( O n ) = − (cid:80) ni =1 in log (cid:0) in (cid:1) and E OE ( O n ) = n (cid:80) ni =1 (cid:0) − in (cid:1) = n n ( n +1)2 n = n +12 n . The former diverges to ∞ whereas thelatter converges to / . Based on the linear structure of B ( O n ) we conclude that theset B ( K ) \ B ( K ) { m } = { ( m (cid:48) ,m (cid:48)(cid:48) ) } for all m ∈ M . So the relative relevance of the at-tribute m ∈ M amounts to r ( m ) = 1 − ( (cid:80) ni =1 i −| m (cid:48)(cid:48) | ) / (cid:80) ni =1 i = 2 | m (cid:48)(cid:48) | / ( n · ( n +1)) .Both the nominal scale as well as the contranominal scale satisfy g (cid:48)(cid:48) = g for all g ∈ G for diﬀerent reasons. We conclude that E SE and E OE evaluate respectivelyequally for N n and C n . In detail, E SE ( N n ) = E SE ( C n ) = − (cid:80) g ∈ G n log (cid:0) n (cid:1) = log ( n ) and E OE ( N ) = E OE ( C ) = n (cid:80) g ∈ G (cid:0) − n (cid:1) = n − n . For the relative relevance weobserve that r ( m ) = r ( n ) for all m,n ∈ M in the case of the nominal/contranominalscale. This is due to the fact that every attribute is part of the same number ofconcepts. For the nominal scale holds r ( m ) = 1 − n − n for all m ∈ M . Hence, as thenumber of attributes increases does the relevance of a single attribute convergeto zero. The relative relevance of an objects in the case of the contranominal scaleis r ( m ) = 1 − (cid:80) nk =0 ( nk ) ( n − k ) − (cid:80) n − k =0 ( n − k ) ( n − − k ) (cid:80) nk =0 ( nk ) ( n − k ) for all m ∈ M . | N | . . . . . R e l a t i v e R e l e v a n c e Relative Relevance Comparison Water

SEOEIRRA 0 1 2 3 4 5 6 7 | N | . . . . . R e l a t i v e R e l e v a n c e Relative Relevance Comparison Zoo

SEOEIRRA

Figure 3.

Relevance of attribute selections through entropy (SE,OE), IMRS (IR), andrandom selection (RA) for the “Living beings in water” (left) and the zoo context (right).

Example 3.3.

Revisiting our running example Figure 1 (right). This context hasfour objects with { B } (cid:48)(cid:48) = { B,F,S } , { F } (cid:48)(cid:48) = { F } , { D } (cid:48)(cid:48) = { F,D } and { S } (cid:48)(cid:48) = { S } .Its entropies are given by E OE ( K ) = (cid:80) g ∈ G (cid:16) − | g (cid:48)(cid:48) | (cid:17) ≈ . and E SE ( K ) ≈ . .Considering both aspects discussed in this section we now want to introducea function which shall be capable of approximating RRP. Deﬁnition 3.9.

Let K = ( G,M,I ) and K N := ( G,N,I ∩ ( G × N )) be formal contextswith N ⊆ M . The entropic relevance approximation (ERA) of N is deﬁned as ERA ( N ) := | B ( K N ) || B ( K ) | · E ( K N ) E ( K ) . First, the ERA compares the number of concepts in a given formal context to thenumber of concepts in a sub-context on N ⊆ M . This reﬂects the structural impactwhen restricting the attribute set. Secondly, an quotient is evaluated where theentropy of K N is compared to the entropy of K . When using Deﬁnition 3.9 for ﬁndinga subset N ⊆ M with maximal (entropic) relevance it suﬃces to compute N such that B ( K N ) · E ( K N ) is minimal. This task is essentially less complicated since we onlyhave to compute B ( K N ) and E ( K N ) for some comparable small formal context K N . To assess the ability for approximating relative relevance through Deﬁnition 3.9we carried out several experiments in the following fashion. For all data set wecomputed the iterative maximal relevant subsets of M of sizes one to seven (orten) in the obvious manner. We decided for those ﬁxed numbers for two reasons.First, using a relative number, e.g., 10% of all attributes, would still lead to aninfeasible computation when the initial formal context is very large. Secondly,formal contexts with up to ten attributes permit a plenitude of research methodsthat are impracticable for larger contexts, in particular, human evaluation. elevant Attributes in Formal Contexts 11 | N | . . . . . R e l a t i v e R e l e v a n c e Relative Relevance Comparison Mushroom

SEOEIRRA 0 2 4 6 8 10 | N | . . . . . R e l a t i v e R e l e v a n c e Relative Relevance Comparison Wiki44k

SEOEIRRA

Figure 4.

Relevance of attribute selections through entropy (SE, OE), IMRS (IR), andrandom selection (RA) for the mushroom (left) and the wiki44k context (right).

Then we computed subsets of M using ERA, for which we used both introducedentropy functions, and their relative relevance. Finally, we sampled subsets of M randomly at least | M |· many times and computed their average relative relevanceas well as the standard deviation in relative relevance. A total of 2678 formal contexts were considered in this experimental study. Fromthose were 2674 contexts excerpts from the BibSonomy platform as describedin [1]. All those contexts are equipped with an attribute set of twelve elementsand a varying number of objects. The particular extraction method is describedin detail in [4]. For the rest we revisited three data sets well known in the realm offormal concept analysis, i.e., mushroom, zoo, water [6, 8], and additionally a dataset wiki44k introduced in [9], which is based on a 2014 Wikidata database dump.The well-known mushroom data set is a collection of 8124 mushrooms describedby 119 (scaled) attributes and exhibits 238710 formal concepts. The zoo data setpossesses 101 animal descriptions using 43 (scaled) attributes and exhibits 4579formal concepts. The water data set, more formally “Living beings and water”,has eight objects and nine attributes and exhibits 19 formal concepts. Finally, the wiki44k has 45021 objects and 101 attribute exhibiting 21923 formal concepts. In Figures 3 to 5 we depicted the results of our computations. We observe in allexperiments that the relative relevance of the subsets found through the iterative ap-proach are an upper bound for the relative relevance of all subsets computed throughentropic relevance approximation or random selection, with respect to the same sizeof subset. In particular we ﬁnd IMRS of cardinality seven and above have a relativerelevance of at least 0.8. Moreover, the relative relevance of the attribute subsets | N | . . . . . . . R e l a t i v e R e l e v a n c e D i s t a n c e Relative Relevance Distance Comparison BibSonomy

SEOERA

Figure 5.

Average distance and standard deviation to IMRS for entropy and randombased selections of | N | attributes for 2674 formal contexts from BibSonomy. selected by both ERA versions (SE or OE) exceed the relative relevance of the ran-domly selected subsets except for the Shannon object information entropy for |N|=1and |N|=2 in the zoo context. Principally we ﬁnd for contexts containing a smallnumber of attributes (Figure 3) a large increase of the distance between the relativerelevance of the randomly selected attributes and the attribute sets selected throughthe entropy approach. This characteristic manifests in the relative relevance of bothERA selections excelling not only the mean relative relevance of randomly chosenattribute sets but also the standard deviation for subset sizes of | N | = 4 and above.In the case of contexts containing a huge number of attributes this observation canbe made for selections with | N | = 1 , already. Furthermore, the interval between therelative relevance of the attribute subsets selected by both ERA versions and the rela-tive relevance of the randomly selected subsets is signiﬁcantly larger than in the caseof contexts with small attribute set sizes. In general we may point out that neither ofthe entropies seems preferable over the other in terms of performance. In Figure 5 weshow the results for the experiment with the 2674 formal contexts from BibSonomy.We plotted for all three methods, ERA-OE/SE and random, the mean distance inrelative relevance to the IMRS of the same size together with the standard deviation.We detect a signiﬁcant diﬀerence for randomly chosen and ERA chosen sets with re-spect to their relative relevance. The deviation for both ERA is bound by 0 and 0.12 .In contrast, the relative relevance for randomly selected sets is bound by 0.09 and 0.6. We found in our investigation that attribute sets obtained through the iterativeapproach for relative relevance do have a high relevance value. Even though their rel-ative relevance is only an lower bound compared to the maximal relevant set they do

EFERENCES 13 exhibit a relative relevance of 0.8 for attribute set sizes seven and above. We concludefrom this that iterative approach is a suﬃcient solution to the relative relevanceproblem. Based on this we may deduct that entropic relative approximation is also agood approximation for a solution to the RRP. In particular, in large formal contextsinvestigated in this work the approximation was even better than in the smaller ones.

By deﬁning the relative relevance of attribute sets in formal contexts we introduceda novel notion for attribute selection. This notion respects both the structure of theconcept lattice and the distribution of the objects on it. To overcome computationallimitations, which arised from the notion of relative relevance, we introduced anapproximation based on two diﬀerent entropy functions adapted to formal contexts.For this we used a combination of two factors. The change in the number of conceptsand the change in entropy that arise by the selection of an attribute subset. Theexperimental evaluation for relative relevance as well as the entropic approximationseem to comply with the theoretical modeling.We may conclude our work with two open questions. First, even though IMRSseems a good choice for relevant attributes we suspect that computing the maximalrelevant set, with respect to RRP, can be achieved more feasible as presented in thiswork. Secondly, so far our justiﬁcation for RRP is based on theoretical assumptionsand a basic experimental study. We imagine, and are curious, if maximal relevantattribute sets are also employable in supervised machine learning setups. Forexample, one may perceive the task of adding a new object to a given formal contextas instance of such a setup. The question is, how capable is the context to add thisobject to an already existing concept.

References [1] D. Benz et al. “The Social Bookmark and Publication Management SystemBibSonomy.” In:

The VLDB Journal

Artiﬁcial Intelligence

Proceedings of the2010 SIAM International Conference on Data Mining , pp. 177–188.[4] D. Borchmann and T. Hanika. “Some Experimental Results on RandomlyGenerating Formal Contexts.” In:

CLA . Ed. by Marianne Huchard and SergeiKuznetsov. Vol. 1624. CEUR Proceedings. CEUR-WS.org, 2016, pp. 57–69.[5] H. Bunke and K. Shearer. “A graph distance metric based on the maximalcommon subgraph.” In:

Pattern Recognition Letters

UCI Machine Learning Repository .2017. [7] F. Distel and B. Sertkaya. “On the complexity of enumerating pseudo-intents.”In:

Discrete Applied Mathematics

Formal Concept Analysis: Mathematical Foundations .Springer-Verlag, Berlin, 1999, pp. x+284.[9] V. T. Ho et al. “Rule Learning from Knowledge Graphs Guided by EmbeddingModels.” In:

The Semantic Web - ISWC 2018 - 17th International SemanticWeb Conference, Monterey, CA, USA, Proceedings, Part I . 2018, pp. 72–90.[10] G. John, R. Kohavi, and K. Pﬂeger. “Irrelevant Features and the SubsetSelection Problem.” In:

Machine Learning Proceedings 1994 . Ed. by W. Cohenand H. Hirsh. San Francisco (CA): Morgan Kaufmann, 1994, pp. 121 –129.[11] D. S. Johnson, M. Yannakakis, and C. H. Papadimitriou. “On generating allmaximal independent sets.” In:

Information Processing Letters

Proceedings of the Tenth National Conference onArtiﬁcial Intelligence . AAAI’92. AAAI Press, 1992, pp. 129–134.[13] M. Klimushkin, S. Obiedkov, and C. Roth. “Approaches to the Selectionof Relevant Concepts in the Case of Noisy Data.” In: . Ed. by L. Kwuida and B. Sertkaya. Springer Berlin/ Heidelberg, 2010, pp. 255–266.[14] D. Koller and M. Sahami. “Toward Optimal Feature Selection.” In:

Proceedingsof the Thirteenth International Conference on International Conference onMachine Learning . ICML’96. Bari, Italy: Morgan Kaufmann Publishers Inc.,1996, pp. 284–292.[15] C. Kumar. “Knowledge Discovery in Data Using Formal Concept Analysisand Random Projections.” In:

Int. J. Appl. Math. Comput. Sci.

Proceedings of the Tenth International Conference on Uncertainty in ArtiﬁcialIntelligence . UAI’94. San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 1994, pp. 399–406.[17] V. Loia, F. Orciuoli, and W. Pedrycz. “Towards a granular computingapproach based on Formal Concept Analysis for discovering periodicities indata.” In:

Knowledge-Based Systems

146 (2018), pp. 1 –11.[18] S. Obiedkov and V. Duquenne. “Attribute-incremental construction of thecanonical implication basis.” In:

Annals of Mathematics and Artiﬁcial Intel-ligence

The BellSystem Technical Journal

Proceedings of the 20th internationalconference on machine learning (ICML-03) . 2003, pp. 856–863.[21] L. A. Zadeh. “Toward a theory of fuzzy information granulation and itscentrality in human reasoning and fuzzy logic.” In: