[PDF] Formalising Concepts as Grounded Abstractions

Abstract

The notion of concept has been studied for centuries, by philosophers, linguists, cognitive scientists, and researchers in artificial intelligence (Margolis & Laurence, 1999). There is a large literature on formal, mathematical models of concepts, including a whole sub-field of AI -- Formal Concept Analysis -- devoted to this topic (Ganter & Obiedkov, 2016). Recently, researchers in machine learning have begun to investigate how methods from representation learning can be used to induce concepts from raw perceptual data (Higgins, Sonnerat, et al., 2018). The goal of this report is to provide a formal account of concepts which is compatible with this latest work in deep learning. The main technical goal of this report is to show how techniques from representation learning can be married with a lattice-theoretic formulation of conceptual spaces. The mathematics of partial orders and lattices is a standard tool for modelling conceptual spaces (Ch.2, Mitchell (1997), Ganter and Obiedkov (2016)); however, there is no formal work that we are aware of which defines a conceptual lattice on top of a representation that is induced using unsupervised deep learning (Goodfellow et al., 2016). The advantages of partially-ordered lattice structures are that these provide natural mechanisms for use in concept discovery algorithms, through the meets and joins of the lattice.

Full PDF

FFormalising Concepts as Grounded Abstractions

Stephen Clark, Alexander LerchnerTamara von Glehn, Olivier Tieleman, Richard TanburnMisha Dashevskiy, Matko BoˇsnjakDeepMind, London, UKJanuary 2021 a r X i v : . [ c s . A I] J a n Introduction

The notion of concept has been studied for centuries, by philosophers, linguists,cognitive scientists, and researchers in artiﬁcial intelligence (Margolis & Lau-rence, 1999). There is a large literature on formal, mathematical models of con-cepts, including a whole sub-ﬁeld of AI—Formal Concept Analysis—devoted tothis topic (Ganter & Obiedkov, 2016). Recently, researchers in machine learn-ing have begun to investigate how methods from representation learning can beused to induce concepts from raw perceptual data (Higgins, Sonnerat, et al.,2018). The goal of this report is to provide a formal account of concepts whichis compatible with this latest work in deep learning.Since the concepts literature is so large, and covers so many disciplines, wewill not attempt to survey the whole ﬁeld, but rather provide links to the partsof the literature which are especially relevant to our own work. Good placesto start for an introduction to concepts include Margolis and Laurence (1999,2015, 2019), Murphy (2002), and G¨ardenfors (2014).The main technical goal of this report is to show how techniques from repre-sentation learning can be married with a lattice-theoretic formulation of concep-tual spaces. The mathematics of partial orders and lattices is a standard toolfor modelling conceptual spaces (Ch.2, Mitchell (1997), Ganter and Obiedkov(2016)); however, there is no formal work that we are aware of which deﬁnesa conceptual lattice on top of a representation that is induced using unsuper-vised deep learning (Goodfellow et al., 2016). Higgins, Sonnerat, et al. (2018)do this to a degree, but here we provide a much more comprehensive and for-mal account. G¨ardenfors (2000) oﬀers a geometric account which ﬁts naturallywith representation learning, and we will be drawing some inspiration fromG¨ardenfors’ work, but with more of a focus on how concepts can be ordered.The advantages of partially-ordered lattice structures are that these providenatural mechanisms for use in concept discovery algorithms, through the meetsand joins of the lattice. Finally, although we do not provide much backgroundin terms of the concepts literature, we do attempt some rigour in the mathe-matical presentation of lattices and partial orders, which will be based heavilyon Davey and Priestley (2002).Overall, our aim is to provide a formal framework for developing practicalconceptual discovery and reasoning systems which are grounded in perceptionand action (Harnad, 1990), thereby overcoming a fundamental deﬁciency in for-mal representation systems which are either constructed manually by a knowl-edge engineer (Ch.8, Russell and Norvig (2003)) or induced automatically frompurely text-based resources (Banko et al., 2007). Note that the main aim ofthis report is a mathematical one; how to realize the framework in practice, andhow the framework relates to the vast literature on concepts—across philosophy,psychology, linguistics and AI—are questions left largely for future work.The rest of the report is organised as follows. Section 2 introduces themathematics of partial orders and lattices by ﬁrst considering the case of dis-crete feature values. Section 3 then introduces representation learning into thepicture, by deﬁning the notion of an instance space , which is the space that2n intelligent agent uses to represent its environment. Instances are the repre-sentations over which abstraction takes place, and abstraction is the operationwhich leads to the conceptual lattice structure. This section will also extend thetreatment of ﬁnite feature values to the continuous case, showing how groupingvalues into sets naturally maintains a partial order. Section 3.4 provides themain mathematical result of this work, which is that concepts can be deﬁnedas elements of a complete partial order (CPO), with instances as limiting casesof concepts (maximal elements of the CPO). Finally, Section 4 suggests ways inwhich probabilities can be introduced into the mix, which is an important ques-tion given that the representation learning techniques we consider are inherentlyprobabilistic. 3

Conceptual Spaces as Discrete Feature Lat-tices

The classical theory of concepts is based on the idea that concepts are essen-tially deﬁnitions, providing necessary and suﬃcient conditions for membershipof the extension of a concept (Margolis & Laurence, 2019). These conditionsare often given in terms of deﬁning features ; a standard example is the conceptof bachelor having the features human , unmarried and male . One of the issueswith the classical view is the question of where these features come from. Herewe assume that recent developments in representation learning can answer thatquestion, by inducing a representation space with separable dimensions (or moregenerally separable sub-spaces) which provide the conceptual features (Higginset al., 2017; Higgins, Amos, et al., 2018).We certainly don’t want to commit to the classical view in general, in par-ticular because we do not deﬁne concepts as deﬁnitions. Moreover, there are anumber of additional issues with the classical view, which motivate the othermain concept theories (the prototype , exemplar , and theory theories (Margolis& Laurence, 2019; Murphy, 2002)). The deﬁning characteristic of our approachis that concepts are grounded abstractions (Higgins, Sonnerat, et al., 2018); grounded because we assume a mechanism for inducing concepts from percep-tual input, and this explains how concepts are related to the external world; and abstractions because these allow for eﬃcient, combinatorial planning and rea-soning in an intelligent agent (Lake, Ullman, Tenenbaum, & Gershman, 2017). From the mathematical perspective, we need a formalism which can representsets of feature-value pairs, and also the operations which combine two such sets.There are a number of areas of AI which have made extensive use of featurestructures, for example knowledge representation (Ch.2, Mitchell (1997)), for-mal concept analysis (Ganter & Obiedkov, 2016), and computational linguistics(Carpenter, 1992). In all these cases, the underlying mathematical structuresare based on partial orders, and more speciﬁcally lattices. Theoretical computerscience is another area that has made extensive use of such structures (Abramsky& Jung, 1994). A useful intuition to start with is the idea that, when combiningtwo concepts, the result should be a) consistent with both of the combining con-cepts; and b) have no more additional information in it than is already presentin those concepts. Any reader familiar with formal frameworks for knowledgerepresentation, or perhaps logic programming, may recognise this as an informaldescription of uniﬁcation (Ch.9, Russell and Norvig (2003)), which also relieson the mathematics of partial orders.Much of this section follows the presentation of lattices and partial orders inDavey and Priestley (2002) (D&P). Note also that there is nothing particularlynew in this section, and our presentation of discrete feature lattices is similarto the many other presentations found in the ﬁelds mentioned above. However,4he standard discrete case forms a useful basis on which to develop some of thecontinuous conceptual lattices described later.To begin with, let’s assume a ﬁnite set of features

Feat and a ﬁnite set ofvalues

Val . In Section 3 we’ll relax the ﬁnite values assumption and consideran inﬁnite set of (potentially continuous) values. One practical way to obtaina ﬁnite set of values, at least from a (totally) ordered inﬁnite set, is to createequivalence classes using boundary cutoﬀs. For example, suppose we have somefeature which has values in R , then we can create a ﬁnite number of values bypartitioning the real number line into “buckets”; e.g. bucket 1 has all valuesless than some threshold x , bucket 2 all values in [ x , x ), and bucket N allvalues greater than or equal to x N , where x i < x k , for i < k .To make the discussion more concrete, consider the following set of features: Feat = { Color , Shape , Weight , Position } . Each feature has an associated set ofpossible values, say: Val ( Color ) = { Red , Blue , Green , Black , White } , Val ( Shape ) = { Circle , Square , Triangle , Diamond } , Val ( Weight ) = { Light , Medium , Heavy } , Val ( Position ) = { Center , TopLeft , TopRight , BottomLeft , BottomRight } .We’ll denote the union of all the feature values as Val .The key aspect of an abstraction is that it is missing some information.Phrases in D&P used to express this notion include greater or lesser informationcontent , and being more or less informative than . One concept, or abstraction,that we can form from the features above is Cannonball , which we mightdeﬁne as a heavy, black circle. Since

Cannonball has 3 out of a possible 4features deﬁned, this would be relatively informative or have high informationcontent.Mathematically, the way to express an abstraction based on these intuitionsis with a partial map (or partial function ). Let X and Y be non-emptysets and f : X → Y a map; f can be thought of as a process which assigns amember f ( x ) of Y to each x ∈ X , or equivalently as a graph , i.e. the set ofargument-value pairs deﬁning the map: { ( x, f ( x )) | x ∈ X } .“If the values of f are given on some subset S of X , we have partial infor-mation towards determining f .” Deﬁnition 1. A partial map from X to Y is a map σ : S → Y , where dom σ ,the domain of σ , is a subset S of X ( S can be ∅ ). (p.7, D&P)If dom σ = X , then σ is a map (or a total map ) from X to Y . The set of partialmaps (which includes the total maps) from X to Y is denoted ( X (cid:40) → Y ).Now we are in a position to deﬁne a concept as an abstraction: All mathematical deﬁnitions in this section, and some of the accompanying text (in quoteswhen taken verbatim), are from D&P. There is a technicality here in that the range of the partial map is partitioned into setscorresponding to diﬀerent values of the domain (e.g. the value of the

Color feature cannot be

TopLeft ), but we’ll gloss over that for now. eﬁnition 2. Assuming a ﬁnite set of features

Feat and a ﬁnite set of values

Val , a concept C is a partial map from Feat to Val , i.e. a map δ C : SubFeat → Val , where

SubFeat ⊆ Feat . Equivalently, each concept is a set of feature-value pairs (the correspondinggraph). Continuing the earlier example, the concept of a cannonball would bedeﬁned as:

Cannonball = {(cid:104) Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)} . Notice this is an abstraction (partial map) over

Feat since

Position has no value.What sorts of binary combination operations might we want to perform overthese sets of feature-value pairs? The obvious ones are set union, intersectionand diﬀerence. For example: {(cid:104)

Color , Red (cid:105)} ∪ {(cid:104)

Shape , Circle (cid:105)} = {(cid:104) Color , Red (cid:105) , (cid:104) Shape , Circle (cid:105)}{(cid:104)

Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105)} ∩ {(cid:104)

Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)} = {(cid:104) Shape , Circle (cid:105)}{(cid:104)

Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105)} \ {(cid:104)

Color , Black (cid:105)} = {(cid:104) Shape , Circle (cid:105)}

The intuition behind the ﬁrst example is that, given the concepts

Red and

Circle , we can, through the set union operation, form a new concept

RedCir-cle . Similarly, for the second example, from

BlackCircle and

HeavyCircle ,and the application of set intersection, we can form

Circle . And ﬁnally, from

BlackCircle and

Black , and the application of set diﬀerence, we can form

Circle . What is the algebraic structure underlying these operations? Set union andintersection applied to the elements of the power set of some set provides atextbook example of a lattice , which relies on the notion of a partial order . Deﬁnition 3.

Let P be a set. An order or partial order on P is a binaryrelation ≤ on P such that, for all x, y, z ∈ P ,(i) x ≤ x ,(ii) x ≤ y and y ≤ x imply x = y ,(iii) x ≤ y and y ≤ z imply x ≤ z . (p.2, D&P)These conditions deﬁne a partial order as a relation that is i) reﬂexive, ii)antisymmetric, and iii) transitive. A set P equipped with an order relation ≤ is said to be a (partially) ordered set, or poset .There are other types of order resulting from diﬀerent sets of constraints.For example, a relation ≤ on a set P which is reﬂexive and transitive but notnecessarily antisymmetric is a quasi-order or pre-order . A partially ordered setwhich has the additional condition that, for all x, y ∈ P , either x ≤ y or y ≤ x P are comparable), is called a chain or linearly orderedset or totally ordered set . If two elements x and y are not comparable in theorder, i.e. x (cid:2) y and y (cid:2) x , then we write x (cid:107) y . Deﬁnition 4.

Bottom and top.

Let P be an ordered set. We say P has abottom element if there exists ⊥ ∈ P (called bottom ) with the property that ⊥ ≤ x for all x ∈ P . Dually, P has a top element if there exists (cid:62) ∈ P suchthat x ≤ (cid:62) for all x ∈ P . (p.15, D&P)Consider a set P = P ( X ), the powerset of some set X , then (cid:104) P ; ⊆(cid:105) is a par-tial order. It is easy to check that the subset relation is reﬂexive, antisymmetric,and transitive. We also have that in (cid:104) P ; ⊆(cid:105) , ⊥ = ∅ and (cid:62) = X .In order to progress to lattices, we need the notions of least upper boundand greatest lower bound. Here is how these are deﬁned in D&P: Deﬁnition 5.

Let P be an ordered set and let S ⊆ P . An element x ∈ P is an upper bound of S if s ≤ x for all s ∈ S . A lower bound is deﬁned dually.The set of all upper bounds of S is denoted by S u (read as ‘S upper’) and theset of all lower bounds by S l (read as ‘S lower’): S u = { x ∈ P | ( ∀ s ∈ S ) s ≤ x } and S l = { x ∈ P | ( ∀ s ∈ S ) s ≥ x } “If S u has a least element x , then x is called the least upper bound of S .Equivalently, x is the least upper bound of S if(i) x is an upper bound of S , and(ii) x ≤ y for all upper bounds y of S .Dually, if S l has a greatest element, x , then x is called the greatest lowerbound of S . Since least elements and greatest elements are unique, least upperbounds and greatest lower bounds are unique when they exist.” (p.33, D&P)There are some alternative notations and terminology for referring to bounds.The least upper bound of S is also called the supremum of S and is denotedby sup S ; the greatest lower bound of S is also called the inﬁmum of S andis denoted inf S . The notation and terminology we will use in the rest of thereport is based on the terms meet and join . We will write x ∨ y (‘ x join y ’)instead of sup { x, y } and x ∧ y (‘ x meet y ’) instead of inf { x, y } . We can alsowrite (cid:87) S (the ‘ join of S ’) and (cid:86) S (the ‘ meet of S ’) instead of sup S and inf S . Lattices are particular cases of ordered sets in which x ∨ y and x ∧ y exist forall x, y ∈ P . These are the mathematical structures that will form the basis ofall the conceptual feature lattices described in the rest of the report. Deﬁnition 6.

Let P be a non-empty ordered set.(i) If x ∨ y and x ∧ y exist for all x, y ∈ P , then P is called a lattice . (ii) If (cid:87) S and (cid:86) S exist for all S ⊆ P , then P is called a complete lattice . (p.34, D&P)As an example, for any set X , and some indexing set I which picks outelements of P ( X ), the poset (cid:104) P ( X ); ⊆(cid:105) is a complete lattice in which (cid:95) { A i | i ∈ I } = (cid:91) { A i | i ∈ I } (cid:94) { A i | i ∈ I } = (cid:92) { A i | i ∈ I } . Figure 1 shows a diagram of the subset relation, in the form of what is known asa

Hasse diagram . The lines represent the covering relation underlying thepartial order, with the ordering going upwards on the page, and with impliedtransitivity. So the fact that there is a line from { b } to { a, b } , for example(since { b } ⊆ { a, b } ) and from { a, b } to { a, b, c } (since { a, b } ⊆ { a, b, c } ), impliesthat { b } ⊆ { a, b, c } .It’s easy to see from the diagram why (cid:104) P ( X ); ⊆(cid:105) is a lattice: take any pairin the diagram, and follow the lines upward from each element of the pair. Inall cases, the lines will intersect, and where there is more than one intersectionpoint, one of those points will be lower than the other intersection points, whichis the least upper bound of the pair (or the join). In the most extreme case, thejoin will be at the top ( (cid:62) = { a, b, c } ). A similar comment applies to followinglines downward, in which case two of the lines will intersect at the greatest lowerbound, or the meet – in the most extreme case at the bottom ( ⊥ = ∅ ).A lattice with a top (cid:62) and bottom ⊥ element is called bounded . A ﬁnitelattice L is automatically bounded, with (cid:62) = (cid:87) L and ⊥ = (cid:86) L . All ﬁnitelattices are also complete. We could orientate the diagram in the other direction, but having the lesser elementsat the bottom is consistent with D&P, and ﬁts with the intuition that lesser in the orderingcorresponds to lower on the page. Another way to reverse the orientation is to use the dualrelation, in this case the superset relation. .3.1 Set Diﬀerence We might also like to apply set diﬀerence (also known as set minus) to twoconcepts (as we did in the

BlackCircle example above), and hence give adeﬁnition of set diﬀerence in terms of partial orders. An initial thought mightbe that we can deﬁne set diﬀerence in terms of union and intersection, but thatis not possible: we also need set complement. D&P do deﬁne complementsin lattices, but do not deal with the case of set diﬀerence. The deﬁnition ofcomplements is as follows.

Deﬁnition 7.

Complements . Let L be a bounded lattice with ⊥ and (cid:62) ele-ments. For x ∈ L , we say z is a complement of x if x ∧ z = ⊥ and x ∨ z = (cid:62) .Note that complements are not necessarily unique, and some elements may haveno complement. To get to set diﬀerence we need to extend the deﬁnition to relative comple-ments , which we can do as follows (Partee et al., 1990).

Deﬁnition 8.

Relative complements . Let L be a bounded lattice with ⊥ and (cid:62) elements. For x, y ∈ L , where x ≤ y , we say z is a complement of x relative to y if z is a complement of x in the sub-partial order below y , i.e. x ∧ z = ⊥ and x ∨ z = y . Now consider the lattice (cid:104) P ( U ); ⊆(cid:105) , for some set U (an example of which isin Figure 1), and the set diﬀerence Y − X for X, Y ∈ P ( U ) and X ⊆ Y . Thecomplement of X relative to Y is the set diﬀerence in this case: Y \ X = { x ∈ U | x ∈ Y, x / ∈ X } Note that X ∩ ( Y \ X ) = ∅ and X ∪ ( Y \ X ) = Y .In the more general case, we may not have x ≤ y (or X ⊆ Y ). For example, { a, b } \ { b, c } = { a } according to the deﬁnition of set diﬀerence above. In thiscase, we take the complement of y ∧ x relative to y . So for { a, b } \ { b, c } , wetake the complement of { a, b } ∩ { b, c } = { b } , relative to { a, b } , which is { a } . Let’s go back to the concept example from earlier, and, in order that we caneasily draw a diagram, we’ll reduce the number of values for each feature:

Val ( Color ) = { Red , Blue } , Val ( Shape ) = { Circle , Square } , Val ( Weight ) = { Light , Heavy } , Val ( Position ) = { Top , Bottom } .We would like to order the concepts to produce a diagram like that in Fig-ure 1, which essentially tells us how to combine two concepts (by ﬁnding eitherthe meet or the join). But there’s a problem: what do we do in the case where9igure 2: Feature (meet-)semilattice with the subset relation over feature-valuepairs (partial maps).the values of a feature clash? For example, what is the order between theconcepts Cannonball and

RedBalloon : {(cid:104) Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)}{(cid:104)

Color , Red (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Light (cid:105)} ?D&P provide the following order, which neatly captures the idea that, notonly do we want to take set unions (and intersections) of the feature-value pairs,but we also need the combining concepts (partial maps) to be compatible.

Deﬁnition 9.

The set of partial maps ( X (cid:40) → Y ) is ordered as follows. Given σ, τ ∈ ( X (cid:40) → Y ) , deﬁne σ ≤ τ if and only if dom σ ⊆ dom τ and σ ( x ) = τ ( x ) for all x ∈ dom σ . Equivalently, σ ≤ τ if and only if graph σ ⊆ graph τ in P ( X × Y ) . (p.7, D&P)The ﬁrst formulation says that, for σ ≤ τ , if a feature f has a value in σ , then f also has a value in τ and σ ( f ) = τ ( f ). The second, equivalent formulation saysthat the set of feature-value pairs deﬁning σ is a subset of the set of feature-valuepairs deﬁning τ . So note the similarity with how the subset relation deﬁned thelattice of set unions and intersections given earlier.Figure 2 shows part of the Hasse diagram for the set of feature values listedat the start of this subsection. It is incomplete in that not all nodes are shown,and, for those nodes that are in the diagram, not all links between nodes areshown. The blue letters correspond to the 4 features, and the red letters to10he values. Note again the convention—following D&P—of having the leastelements towards the bottom of the page.The lattice has a bottom element ( ⊥ ): the “empty” concept at the bottomof the diagram (and if we consider each concept as a set of feature-value pairs,then it is correct to say ⊥ = ∅ ). But note how there is no single top element( (cid:62) ), since there is no unique way of ﬁlling out all the feature values which leadsto a set of feature-value pairs which is a superset of all concepts. The structurein Figure 2 is a meet-semilattice , since every pair of elements has a meet(greatest lower bound), but not every pair has a join (least upper bound).It’s perhaps worth giving the bottom element a name in our case, so let’scall it the universal concept . Deﬁnition 10.

The universal concept is a partial map σ from Feat to Val where dom σ = ∅ . Earlier we referred to this concept as the “empty” concept, but “universal”concept is better since intuitively this concept applies to all instances.How do we combine pairs of concepts in this structure? For the equivalentof set intersection, which corresponds to going down the page, follow the linksfrom the two concepts until they meet (which is guaranteed to happen in ameet-semilattice). Since any node can have more than one child in the diagram,there are potentially many paths to follow, and potentially many ways for pathsto cross, but there will be one unique meeting point which is the highest in thelattice (since uniqueness of the greatest lower bound is guaranteed).For the equivalent of set union, follow the links from the two concepts up-wards until they join (which is not guaranteed to happen in a meet-semilattice).If there are no joining points, then the operation is undeﬁned. If there is at leastone joining point, then one of them will be uniquely the lowest in the lattice.

In the case where there are no clashes in two concepts (i.e. no features gettingdiﬀerent values), then set diﬀerence is deﬁned as one would expect (as the setdiﬀerence of the sets of feature-value pairs). What about when there are clashes?For example, what is the result of

Cannonball \ Balloon , where the conceptsare, respectively: {(cid:104)

Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)} \{(cid:104)

Shape , Circle (cid:105) , (cid:104) Weight , Light (cid:105)} ?Section 2.3.1 provided the solution, in terms of relative complements. First wetake the meet of the two concepts, which is {(cid:104)

Shape , Circle (cid:105)} : {(cid:104) Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)}∧{(cid:104)

Shape , Circle (cid:105) , (cid:104) Weight , Light (cid:105)} = {(cid:104) Shape , Circle (cid:105)} {(cid:104)

Shape , Circle (cid:105)} relative to

Cannonball :compl. of {(cid:104)

Shape , Circle (cid:105)} rel. to {(cid:104)

Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105) , (cid:104) Weight , Heavy (cid:105)} = {(cid:104) Color , Black (cid:105) , (cid:104) Weight , Heavy (cid:105)}

Note that set diﬀerence is “well-behaved” here also, as was the case forintersection (but not union): for concepts C and D , C \ D is the set of feature-value pairs that are in C but not D . (cid:62) As discussed, the feature lattice in Figure 2 does not have a top. We could justinsert one, with all the total maps from

Feat to Val (the top row in the diagram)pointing at (cid:62) . Then we would have a lattice, and not just a semilattice. Oneargument for adding a top element comes from considering the correspondingset of instances (the extension of a concept). Here it is natural to think of thetop element—which we might call the impossible concept—as having the emptyset as its corresponding set of instances. However, (cid:62) would not correspond toa legitimate concept, since it would not be a graph of a partial map; hence wechoose not to have a top in our feature lattices.

D&P do not name the ≤ ordering relation in Deﬁnition 9. It would be useful tohave a name, since we’re going to consider ways to extend this relation below.One option is to adopt the name used in other areas of computer science, in-cluding logic programming and computational linguistics, which is subsumption ,usually denoted (cid:118) .We can also name the equivalent of the set union and intersection opera-tions, again since we’d like to extend these to the case where the feature valuesthemselves are ordered (and not just related through equality). The extensionof the union operation is called uniﬁcation . Deﬁnition 11.

The uniﬁcation of two feature structures (partial maps) F and F (cid:48) , F (cid:116) F (cid:48) , is the join of F and F (cid:48) in the set of feature structures ordered bysubsumption ( (cid:118) ). If F ∨ (cid:118) F (cid:48) is undeﬁned, we say the uniﬁcation has failed. We can also deﬁne an extension of the intersection operation. There doesnot appear to be an accepted term in the literature, but one candidate is gen-eralisation . Deﬁnition 12.

The generalisation of two feature structures (partial maps) F and F (cid:48) , F (cid:117) F (cid:48) , is the meet of F and F (cid:48) in the set of feature structures orderedby subsumption ( (cid:118) ). .4.4 Some Useful Intuitions One reasonable question to ask at this point is: if a partial map is just a setof feature-value pairs, then why are the relevant operations not just the usualset union, intersection and diﬀerence on those sets of pairs? In fact, for theequivalent of intersection (generalisation) and set diﬀerence, they are: the gen-eralisation of two concepts X and Y just is the intersection of the correspondingsets of pairs; similarly for set diﬀerence (as we have seen in the examples above).In terms of Fig. 2, “moving down” the diagram maintains these set operations.The only change is when “moving up” the diagram, when applying uniﬁca-tion (or taking unions). Since the values for a feature can diﬀer across concepts,simply taking the union could result in a set of feature-value pairs where onefeature has more than one value, which is not a partial map over Feat , and sonot a concept.The feature structures we have been describing are an instance of a well-studied structure in pure mathematics, which turns out to have many applica-tions, namely an intersection structure . D&P (p.48) provide a deﬁnition andmore discussion of the potential applications. As D&P say:Intersection structures which occur in computer science are usuallytopless while those in algebra are almost invariably topped. In acomplete lattice . . . of this type, the meet is just set intersection,but in general the join is not set union. (p.48, D&P)

We ﬁnish oﬀ this section by considering disjunctive feature lattices with a ﬁniteset of values. We might want to specify that a feature can have one valueor another. For example, a swan can be black or white; a snooker ball can bewhite, red, yellow, green, brown, blue, pink or black. In the case of ﬁnite featurevalues, the natural way to represent the alternatives is with a set, in which casethe value space becomes a power set.

Deﬁnition 13.

Let

Val be a ﬁnite set of values, then the disjunctive featureset associated with

Val , Val disj = P ( Val ) \ ∅ . Val disj is ordered according to thesuperset relation.

Note that the disjunctive set of values does not contain the empty set, andthe ordering is according to the superset, not subset, relation. Figure 3 showsthe meet-semilattice deﬁned by this ordering for an example value set of discretecolors. Why is the ordering the superset, and not subset, relation? If we considerthe corresponding sets of instances (the concept extensions), then we’d like thesesets to get smaller when moving “up” the ordering. Since the sets in Figure 3represent disjunctions, the sets of instances do get smaller when going from thebottom to the top of the semilattice.The semilattice does have a bottom—the universal set of colors—but no top.Why not have the empty set as the top of the semilattice, and then have the full13igure 3: The meet-semilattice for

Color disj , with the set of possible colors,

Color = { Black , White , Red , Blue } .powerset lattice? The reason is that we’d like some uniﬁcations (ﬁnding joins)to be undeﬁned, for example: {(cid:104) Color , { Black , White }(cid:105)} (cid:116) {(cid:104)

Color , { Red , Blue }(cid:105)} = Undeﬁned . If the color semilattice had a top, the uniﬁcation above would be deﬁned, withthe resulting color value being the empty set. Consider again the above case, but this time with generalisation (ﬁnding themeet) rather than uniﬁcation: {(cid:104)

Color , { Black , White }(cid:105)} (cid:117) {(cid:104)

Color , { Red , Blue }(cid:105)} = {(cid:104) Color , { Black , White , Red , Blue }(cid:105)} . The value of the color feature is now the universe of colors, i.e. any color ispossible. This case is interesting because we can ask whether a disjunction overthe universe of colors is any diﬀerent semantically to the unspeciﬁed featurevalue in our original feature lattice (Figure 2). Or, to give a particular example,what is the diﬀerence between the following two concepts, where

Unspeciﬁed means there is no feature-value pair for this particular feature: {(cid:104)

Color , { Black , White , Red , Blue }(cid:105)} = {(cid:104) Color , Unspeciﬁed (cid:105)} ? It’s interesting to consider what the interpretation of the empty set would be in this case,if we were to add it: might it signify the “positive” absence of color, or a concept where thelack of color is important (as opposed to color simply not being speciﬁed)?

14f the answer is that there is no diﬀerence, then the feature semilattice inFigure 2 can be replaced with one where all concepts are fully speciﬁed—sothere is only the top row in the diagram—but the values themselves have asemilattice structure, as in Figure 3. (Imagine ﬁlling in all the empty slots ineach partial map with the universe of colors.) One possible reason to maintain adiﬀerence, and hence maintain both lattice structures, is to argue that semanti-cally there is a diﬀerence between a concept where the color is important, but itcan be any color, and one where the color is irrelevant and hence unspeciﬁed. Acounter-argument is to note that the extensions—the set of instances to whichthe concepts apply—are the same in both cases. We choose to keep the partialmaps since these extend naturally to the conceptual lattices in later sections,where there may not be a universal set covering all values (for example there isno ﬁnite real interval covering all possible intervals; see Section 3.2.2).15

Abstractions over Instances

The standard presentation of feature lattices in the previous section assumeda ﬁnite set of features and values. But where do these features come from?Classical approaches to AI have typically assumed that the features are eitherprovided manually by a knowledge engineer (Ch.8, Russell and Norvig (2003))or induced automatically from text-based resources (e.g. Banko et al. (2007)).However, it is now recognised that one of the diﬃculties in the whole GOFAIenterprise was the challenge in connecting such features to perception and action(Cantwell Smith, 2019), which is necessary in order that the features be properly grounded (Harnad, 1990). Recent progress in unsupervised representation learning, especially tech-niques such as the variational autoencoder (VAE) (Rezende et al., 2014; Kingma& Welling, 2014), holds promise for inducing features automatically from rawperceptual data. In this report we consider VAE and its variants, such as β -VAE(Higgins et al., 2017), to provide the representation space over which abstractiontakes place and upon which conceptual spaces are built. β -VAE in particularhas been designed with an inductive bias towards separable representations.The representation space , or instance space , is the internal space that anintelligent agent uses to represent its environment. We will assume that therepresentation space has separable dimensions (or more generally separable sub-spaces), which is necessary for the process of abstraction—dropping whole fea-ture dimensions—to be meaningful and useful for building combinatorial ab-stractions. Since we are considering the representation space to be deﬁned by a β -VAE, we assume that points in the space are arrays of real values. Deﬁnition 14. An instance in representation space is a point x ∈ R K . The probabilistic nature of the VAE means that the instances have someuncertainty associated with them. For now, we consider the deterministic casein which instances are known with certainty; suggestions for how to deal witha posterior distribution over instances are given in Section 4.Instances correspond to the situations to which concepts apply, and, as wewill see later, provide extensions to concepts. Why stop at representationalinstances? Do we not need extensions to be situations in the world? One reasonto have instances be agent-internal is that this ﬁts with our philosophy of amentalist semantics where “meanings are in the head”. A second reason is thatconcepts can be generated on the ﬂy, such as

PinkElephant . If extensionswere in the world, the extension of this concept would be the empty set.Once we are committed to this mentalist stance, we need to explain howconcepts are grounded in the world. Here we say that concepts inherit their The recent successes of large-scale language models such as GPT-3 suggest that suchgrounding may not be necessary for purely text-based applications such as machine translationor summarisation, but if the goal is an embodied, situated agent perceiving and acting in aworld, then clearly its language use needs to be grounded in that world. In general, we may not wish to restrict the representation space in this way. For example,we may have a representation learning process which provides discrete feature values, ratherthan reals, but from here onwards we only consider real-valued spaces. base concepts . The features that arelikely to be induced by a representation learning algorithm such as β -VAE willbe the basic features underlying color, shape, position, and so on. G¨ardenfors(2000) uses the term quality dimensions for such features, with examples suchas temperature , brightness , weight , and pitch . There is no suggestion that morecomplex concepts such as Tiger would necessarily be members of the concep-tual lattices described in this report. Rather, complex concepts would need tobe built up out of the base concepts, which we consider to have the combina-torial properties needed to create the exponentially large set of concepts thatmake up the human conceptual system. The meets and joins of the conceptuallattices provide a minimal mechanism for combining base concepts and movingbeyond single elements from the basic feature set (consider the

Cannonball example from earlier), but the question of what the more complex combinatorialprocesses are that would be required to build

Tiger is left for future work, asis an account of how base concepts, founded on perception, can form the basisfor abstract concepts such as

Democracy (Lakoﬀ & Johnson, 1980).One last point is that the presentation here is not intended to provide a recipefor how to develop base conceptual spaces in practice; it is merely explanatoryby showing what the possible spaces are in theory. It is likely that some of thespaces, e.g. the one based on arbitrary sets of points, will not be very useful toan actual agent.

Figure 4 shows the meet-semilattice that results when features are removed fromreal-valued instances. The top row contains the instances (of which there areinﬁnitely many), which are fully-speciﬁed feature structures where each featuredimension f i has a real value (the small circles in the diagram). The second rowcontains all those concepts where a single feature has been removed, and so ondown to the bottom row containing concepts with just a single feature speciﬁed,and the bottom element sitting beneath all the concepts in the partial order.In fact, this picture is essentially the same as that in Figure 2, but with thediscrete feature values replaced with real numbers. We will refer to an elementin this real-valued lattice as a point concept .What might be wrong with this picture as a proposal for a conceptual lat-tice that an agent could use in practice? Consider two instances I and J wherethe values on some feature dimension are close, say 3.15 and 3.16 on a “color” The examples in this report tend to focus on the visual modality, but the representa-tion space can cover a number of modalities, including action/motor/force representations,representation of internal body states, and representations of emotions. If we take the meet of I and J , which is the operation that de-termines what two instances have in common (set intersection in the discretecase), then, assuming that no other feature values are equal, the result would bethe empty set, or the ⊥ concept. However, it feels as though two values whichare this close (assuming a suitable range of color values) should be treated asequal as far as any concept discovery algorithm is concerned. The diﬃcultyis that, whilst the VAE has provided us with feature dimensions for which ab-straction is meaningful—i.e. it has provided some useful structure “across” thefeature dimensions—there is very little structure “along” or “within” the featuredimensions themselves, each of which is just the real number line. A naturalmechanism for injecting some structure into the feature values is to group theminto sets. Since the feature dimensions are being discovered automatically by a representation learn-ing algorithm, the interpretation of a dimension as corresponding to some aspect of color, sayhue, would need to be carried out by a human through analysis of the representation space. Finding meets is assumed in this report to be the operation that would form the basis ofany concept discovery algorithm, given a set of instances as input, but in practice a numberof other factors would need to be taken into account, such as how useful a potential conceptis to an agent. .2 Grouping Feature Values into Sets There are many ways in which real values can be grouped into sets. In orderto maintain the lattice structure with any grouping mechanism, we need to a)deﬁne a partial order on the value sets themselves; and b) extend the deﬁnitionof the partial order on partial maps (Defn. 9). Let’s extend the deﬁnition ﬁrst:

Deﬁnition 15.

The set of partial maps ( X (cid:40) → Y ) is ordered as follows. Given σ, τ ∈ ( X (cid:40) → Y ) , deﬁne σ ≤ τ if and only if dom σ ⊆ dom τ and σ ( x ) ≤ τ ( x ) for all x ∈ dom σ . All that has changed is that the equality relation between the values σ ( x ) , τ ( x )has been replaced with an ordering relation. Note that, because of this change,the equivalent formulation in terms of subsets of the corresponding graphs (setsof feature-value pairs) no longer holds. Perhaps the simplest grouping mechanism is to have just arbitrary ﬁnite setsof reals as feature values. If we were to use the terminology introduced inSection 2.4.3, then for b) above we would say that the subsumption relationover features needs deﬁning also for these value sets. In the case of arbitrarysets of reals, the natural subsumption ordering is given by set inclusion. As anexample for our hypothetical color dimension, the set of reals corresponding to

DarkRed would lie above the set of reals corresponding to

Red (assuming thatthe former is a proper subset of the latter).What happens to meets and joins in this case? It is useful now to think ofmeets and joins happening “within” each feature dimension, as well as “across”dimensions. Within each feature dimension, meets are given by set union andjoins by set intersection. Note that this is eﬀectively in the “opposite direction”to meets and joins “across” the features (for the discrete value case), whereintersections “happened up the page” and unions “down the page”. The reasonthat the value sets get larger when moving down the page is that larger setscorrespond to larger sets of instances, and extensions grow larger when movingdown the order. The limiting case of an instance, which can be thought of as asingleton set for each dimension, is at the top of the order.One useful intuition to take from this space is how concepts are disjunctionsof point concepts . So a point concept is a conjunction of feature values, wherethe feature values are real numbers, and the conjunction is being taken “across”the feature dimensions. The disjunctions happen “within” the dimensions, andarise from grouping values into sets. To give another color example, consideragain the concept red corresponding to a set of reals on a single color dimension,then red can be thought of as a disjunction of all particular shades of red (whichare all point concepts).Another useful intuition is that the concepts are eﬃcient descriptions ofconcept extensions , i.e. the set of instances to which a concept applies; later inSection 3.3 we’ll call these descriptions concept intensions (and we will also for-mally deﬁne a concept’s extension in Section 3.4). The descriptions are eﬃcient19ecause, for the missing dimensions in any abstraction, there is an assumptionthat all points along those dimensions form part of the extension, so there is noneed to explicitly store or represent those points.Let’s consider taking the meet of two instances in this conceptual space. Letthe two instances beInst1 = { ( f , u , ( f , u , . . . , ( f n , u n) } , Inst2 = { ( f , v , ( f , v , . . . , ( f n , v n) } , thenmeet(Inst1,Inst2) = { ( f , { u , v } ) , ( f , { u , v } ) , . . . , ( f n , { u n , v n } ) } . First note how the instances are “complete” point concepts (without anyabstraction having taken place), and how the resulting concept from the meetoperation can be represented as a disjunction of point concepts (in this case adisjunction of instances). An equivalent way to represent the concept is not asa disjunction of conjunctions, but as a conjunction of disjunctions, which showshow the values along each feature dimension are grouped into sets. The meetcan be written as( u or v and ( u or v . . . and ( u n or v n) . The meet operation has grouped the values into sets, but note that no ab-straction has taken place, since none of the dimensions have been dropped. Theproblem, from the perspective of wanting to ﬁnd useful abstractions, is that norestrictions have been placed on the value sets. A natural restriction is to onlyallow sets as values in which the maximum and minimum elements are withinsome distance (cid:15) of each other. That way, only values which are relatively close,or “similar”, will get grouped together. If we now assume, for this particularcase, that | u i − v i | < (cid:15) for i = 1 only, then meet(Inst1,Inst2) = { ( f , { u , v } ) } .Notice how placing this restriction on the feature values has neatly led to a pro-cess of abstraction happening “for free” when performing the meet operation.Figure 5 shows the outcome of the meet operation when there is no restrictionon the sets, whereas Figure 6 shows what happens when the sets only containvalues that are suﬃciently close.Are there any limitations of this conceptual space? Meets appear to workwell (with the maximum distance requirement), and the grouping operation addsminimal additional structure to the original instance space. There is a problem,however, with the join operation. To provide some intuition, and adapting anexample from earlier, suppose that we have the following two concepts: {(cid:104) Color , Black (cid:105) , (cid:104) Shape , Circle (cid:105)} , {(cid:104) Color , Black (cid:105) , (cid:104) Weight , Heavy (cid:105)} . We’d like to unify these two concepts, i.e. take the join, in order to create theconcept

Cannonball (a heavy black circle). However, the join will only be Note that the instances in the disjunction are not just Inst1 and Inst2, but all instancesformed by taking conjunctions of the u ’s and v ’s. Black ∩ Black is non-empty. Note that meets work ﬁne, even in thegeneral case: the meet of these two concepts is {(cid:104) Color , Black ∪ Black (cid:105)} . Butfor the joins, it feels as though an additional assumption is needed to get theoutcome we’d like. If Black and Black are disjoint, but contain values that areclose, then we’d like the uniﬁcation of these two concepts to be well-deﬁned. We made the assumption above that only points that are suﬃciently close alonga dimension can form part of a concept. A natural extension of that assumptionis the following:

Proposition 1. Convexity condition:

If points A and B (within a featuredimension) both form part of a concept C , then all points between A and B arealso part of C . There is a close link here with G¨ardenfors (2000), who also imposes a con-vexity condition (see Section 3.3.2 below). However, G¨ardenfors assumes sucha condition applies to all concepts, whereas we are only applying it to base con-cepts (as explained in the introduction to this section). There is no suggestionthat more complex concepts need to be convex; in fact, the disjunctive andnegated concepts described below, for example, are not.Proposition 1 leads to feature values which are closed intervals of real values,i.e.

Val = { [ x, y ] | x, y ∈ R , x ≤ y } , with the following ordering. Deﬁnition 16.

The set of real-valued intervals is ordered as follows. Given ρ, π ∈ { [ x, y ] | x, y ∈ R , x ≤ y } , deﬁne ρ ≤ π if and only if ρ ⊇ π . Let ρ = [ ρ , ρ ] and π = [ π , π ] , then equivalently ρ ≤ π if and only if ρ ≤ π and π ≤ ρ . So if we think of an interval as a set of real numbers, then one interval ρ is belowanother π in the order if and only if π is a proper subset of ρ , which means that π is fully contained within ρ .One potential confusion here is that, if we think of the ordering on thefeatures in Deﬁnition 15, i.e. the ordering on the domains of the maps, then thedomains become more speciﬁc as the domains get larger; whereas, for the valuesas intervals, the ordering is the reverse: intervals which are larger and containother intervals are less speciﬁc, and hence earlier in the ordering (as was the casefor the arbitrary sets of points). One way to clarify the confusion is to considerhow informative the respective sets are. On the features side, a concept whichhas many features is more informative than one that doesn’t, in the sense that itcorresponds to a smaller set of possible instances, and determines more aspectsof the instance space. Conversely, a feature value with a large range is lessinformative than a value with a small range, since the larger range correspondsto a larger set of possible instances.What happens to meets and joins in this case? The meet of two intervals(within a feature dimension) is the convex hull and the join is the intersection(leading to an undeﬁned join if the intersection is empty).22 roposition 2. Let R be the set of real-valued intervals, Feat a ﬁnite set offeatures, and

Val = R . Assuming the ordering on R from Deﬁnition 16, andgiven F = { ( f i , v i ) | f i ∈ Feat , v i ∈ Val } i and F (cid:48) = { ( f (cid:48) i , v (cid:48) i ) | f (cid:48) i ∈ Feat , v (cid:48) i ∈ Val } i , then F (cid:116) F (cid:48) (if deﬁned) is: { ( f i , v i ) | f i ∈ dom F, f i / ∈ dom F (cid:48) } ∪ { ( f (cid:48) i , v (cid:48) i ) | f (cid:48) i ∈ dom F (cid:48) , f (cid:48) i / ∈ dom F } ∪{ ( f i , v i ∩ v (cid:48) i ) | ( f i , v i ) ∈ F, ( f i , v (cid:48) i ) ∈ F (cid:48) } The uniﬁcation F (cid:116) F (cid:48) will be undeﬁned if F and F (cid:48) share a feature where thecorresponding values do not overlap (i.e. the intersection of the correspondingintervals is the empty set). Where F and F (cid:48) share a feature where the corre-sponding values do overlap, then the value of the uniﬁcation for that featurewill be the intersection of the two intervals. Proposition 3.

Assuming the ordering on R from Deﬁnition 16, and F and F (cid:48) as above, then F (cid:117) F (cid:48) is: { ( f, [min { a , a } , max { b , b } ]) | ( f, [ a , b ]) ∈ F, ( f, [ a , b ]) ∈ F (cid:48) } The generalisation F (cid:117) F (cid:48) will be the empty set, i.e. the universal concept,if F and F (cid:48) do not share any features in common. Where F and F (cid:48) do share afeature, then the value of the generalisation for that feature will be the convexhull of the two intervals, i.e. the smallest interval containing both.In order for the meet operation to have the capacity to return an abstraction,i.e. for some dimensions to be dropped, the intervals again need to be restrictedin length, as for the arbitrary sets of points. In this case, the meet only returnsan interval if the convex hull is within the limit (and if no convex hull is withinthe limit across all dimensions then the meet is the universal concept at thebottom of the ordering).Section 2.4.5 described a conceptual space of disjunctive concepts when thefeature values are discrete. In this section we extend that analysis to the caseof closed-interval values, as well as negation. Disjunctive Concepts

The ﬁrst point to make is that all the conceptualspaces we have considered so far are essentially disjunctive, in the sense thatall concepts can be thought of as “conjunctions of disjunctions”, as well as dis-junctions of point concepts, as described in Section 3.2.1. The fact that theconceptual space has already been factorised into separable dimensions imme-diately leads to this form.So how do we add disjunctive concepts to the space of interval values? Sincewe can have disjunctions of point concepts, then the space of arbitrary sets ofpoints is required. Note how this contains all sets of closed intervals, allowingdisjunctions of property intervals, for example green or blue or red (assum-ing each color is a closed interval on some color dimension). However, didn’t wedescribe a problem when taking joins in such a space, leading to the convexityrequirement? If so, we wouldn’t necessarily want the disjunctive space to be anagent’s primary conceptual space for concept discovery and reasoning.23he answer is to only take meets, as part of some concept discovery process,on the original space containing the closed intervals. Intuitively this makessense: suppose I have two similar instances I and J (along some dimension, i.e. I, J ∈ R ). If I’m allowed to form disjunctive concepts during the initial discoveryphase, then { I, J } would be a reasonable concept to form. However, if theconvexity condition is being applied, then [ I, J ] would be the outcome. Supposefurther that there are two similar instances K and L , which are distant from I and J , then again we can form [ K, L ], but it is only after forming these twoconcepts that we can consider forming the disjunctive concept { [ I, J ] , [ K, L ] } .To provide some intuition about how this space works, consider the followingset of concepts, all represented as closed intervals along a single color dimension: { blue , red , green , dark-red } Suppose further that all these concepts are disjoint (e.g. blue ∩ red = ∅ ),except red and dark-red where dark-red ⊂ red . Here are some examplesof meets ( ∧ ) and joins ( ∨ ) in this space: blue ∨ red = undefinedblue ∧ red = { blue , red } red ∨ dark-red = dark-redred ∧ dark-red = red { red , blue } ∨ dark-red = dark-red { red , blue } ∧ dark-red = { red , blue }{ red , blue } ∧ green = { red , blue , green } Note that the maximum-length condition is not being applied in any of thesecases, since it does not apply to the disjunctive space.In Section 3.3 below we deﬁne a concept as a conjunction of properties ,where, for the conceptual space based on intervals, a property is a closed intervalalong some feature dimension. This deﬁnition can naturally be extended tothe disjunctive space so that disjunctive concepts are conjunctions of sets ofproperties.

Negated Concepts

Negations of concepts can be formed in the obvious waythrough taking set complements of speciﬁed feature values. For example, con-sider a concept C which has the value [ c , c ] along some feature dimension;the concept ¬ C will have the value (cid:104) ( − inf , c ) , ( c , + inf) (cid:105) . We can also intro-duce this additional “negated” space into the partial order, simply through thesubset relation. Examples of relationships in this partial order ( ⊆ ), assumingthe color intervals used above, include: red ⊆ ¬ blue , ¬ red ⊆ ¬ dark-red , ¬ ( red or blue ) ⊆ ¬ red .However, again we need to be careful when taking meets in this new space:consider two concepts I and J that are far apart; in the original space of24igure 7: The conceptual space with equivalence classes as values, and somefeatures removed.maximum-length intervals the convex hull would be undeﬁned, and hence themeet would be the universal concept. But with the addition of the negatedspace below the original space, there are now many lower bounds for I and J . For example, there are many ways for red and blue to now meet in thenegated space, such as not-green , not-yellow , and so on, but none of theseis a greatest lower bound. Again the solution is to not allow the initial conceptdiscovery process to operate in this negated space, which again makes intuitivesense: it is only after we have found some “positive” concepts that we can startto consider negative ones. As a special case of the closed-intervals conceptual space, we imposed a lengthrestriction on the intervals so that ﬁnding meets could lead to some abstractiontaking place. A further restriction would be to only allow intervals from apartition of the real number line, as described in Section 2.1. The partitionscould be diﬀerent for each feature dimension, but the key point is that there arenow only a ﬁnite number of values for each feature. In fact, this corresponds tothe discrete feature values case set out in Section 2.Figure 7 shows the conceptual lattice in this case. Here the order at the verytop of the diagram—relating instances to concepts with all features deﬁned—isdetermined by the partition function, and the remaining order is given by the Technically the intervals are half-closed in this case. anti-chain in which none of the values are related to any other values.

We have deﬁned a concept’s extension as the set of instances to which the con-cept applies (this will be made more precise in Section 3.4), but what about its intension ? As discussed in Section 3.2.1, the eﬃcient descriptions of a concept’sextension—obtained via abstracting away whole separable dimensions—are in-tensional descriptions, so can we provide a deﬁnition? First, let’s give a nameto the values of a conceptual space, for example the arbitrary sets of points inSection 3.2.1 or closed intervals in Section 3.2.2. And as a reminder, here is howa concept was deﬁned in Section 2:

Deﬁnition 17. (Repeat of Defn. 2) Assuming a set of features

Feat and a setof values

Val , a concept C is a map δ C : SubFeat → Val , where

SubFeat ⊆ Feat . The set of values is determined by the particular conceptual space in ques-tion. Examples of

Val we have considered so far include equivalence classes froma partition of the real number line, arbitrary sets of real values, and closed inter-vals of reals (with a limited maximum length). Intuitively, these values act like properties ; to give a concrete example, assuming color can be represented on asingle real-valued dimension, then dark-red would correspond to a particularclosed interval. Hence let’s call these values properties:

Deﬁnition 18.

A conceptual space is made up of a set of feature dimensions { Feat i } i and each dimension has a corresponding set of values Val i . A value P ∈ Val i is a property . Now we can provide an alternative, but equivalent, deﬁnition of a concept(or a concept’s intension) to the one given above.

Deﬁnition 19.

Assuming a set of feature dimensions { Feat i } i and correspond-ing sets of values { Val i } i , a concept C is a set of properties P ( C ) = { ( P, j ) | P ∈ Val j , j ∈ F ( C ) } , where F ( C ) is the set of features that are “on” for concept C . A concept can now also be thought of as a conjunction of properties . Asa special case, a point concept is a conjunction of what we might call pointproperties . In Formal Concept Analysis (FCA), a concept is deﬁned as a pair, consisting ofan intension and an extension (Ch.3, D&P). However, in FCA the intensions—orwhat D&P call intents —are sets of binary attributes . Figure 8 gives an examplefrom D&P. The extension—or what D&P call the extent —of an attribute is theset of objects that possess that attribute. For example, the extent of has-no-moon is { Mercury, Venus } . 26igure 8: A set of attributes for the planets (example from p.65, D&P).From this simple mathematical structure, a rich theory of formal conceptsemerges, built around the mathematics of partial orders from Section 2. In par-ticular, there are some intimate relationships between the topped ∩ -structuresthat we mentioned brieﬂy in Section 2.4.4, closure operators (which have closeconnections with the CPOs described in Section 3.4), and so-called Galois con-nections (p.68, D&P). Developing these relationships for our own concepts the-ory would be an interesting avenue for further mathematical work.One key diﬀerence in our treatment of concepts is that we don’t assumebinary (or even discrete) attributes, and do not assume that these are givenin advance. The only constraints on properties that we have imposed in thissection, and that will be made more precise in Section 3.4, are that a) theproperties on each feature dimension are ordered; and b) instances (along afeature dimension) are limiting cases of properties.Another diﬀerence is the relationship between intensions and extensions. Inour formalisation, all combinations of available feature values deﬁne an exten-sion; for example, {(cid:104) Color , Pink (cid:105) , (cid:104) Shape , Elephant (cid:105)} would be a perfectly reason-able concept for us, and would have an extension deﬁned by these two featurevalues (with the other features varying freely). In contrast, in FCA, the exten-sion corresponding to this set of attributes may well be empty, depending onthe objects available for making up the extensions (i.e. there may not be anypink elephants in the set of objects).

G¨ardenfors’ theory of conceptual spaces (G¨ardenfors, 2000) shares some simi-larities, and has some diﬀerences, with the theory being described here, so it isworth considering what those similarities and diﬀerences are. First, G¨ardenforsemphasises the fact that his theory is a geometric theory, with any conceptualspace being deﬁned by a set of dimensions. G¨ardenfors calls these dimensions quality dimensions , but they are essentially the same—mathematically at least—as the feature dimensions we have been using. Examples of quality dimensionsthat G¨ardenfors gives include temperature , brightness , weight , and pitch .A key notion for G¨ardenfors is the idea that quality dimensions can be either integral or separable . A set of dimensions is integral if “one cannot assign an27bject a value on one dimension without giving it a value on another” (p.24,G¨ardenfors (2000)). Examples of integral dimensions are { hue , brightness } and { pitch , loudness } . Dimensions which are not integral are separable. In the ex-position of our concepts theory we have been assuming the notion of separabledimensions, but note that G¨ardenfors’ deﬁnition is diﬀerent to that given in, forexample, Higgins, Amos, et al. (2018), which is based on invariant transforma-tions of world state.Another key notion for G¨ardenfors is the idea of a natural property , whichis deﬁned as a convex region of a domain; then a natural concept is deﬁned as aset of natural properties. We have also deﬁned a base concept above as a set ofproperties, and the particular case of the closed-interval properties arose froma convexity condition being applied. However, there is no suggestion in ourpresentation that all concepts are convex; indeed, the disjunctive and negatedconcepts from Section 3.2.2 are examples that are not. Another area in whichour presentation diﬀers from G¨ardenfors is in our clear separation between aconcept’s intension and extension. And ﬁnally, G¨ardenfors does not considerhow the properties in his conceptual spaces could be ordered, whereas partialorders lie at the heart of our formalisation. The ﬁnal part of this section brings all the lattice-theoretic ideas together intoa more precise deﬁnition of a conceptual space, by oﬀering the following propo-sition, which is the main mathematical proposal of this work.

Proposition 4.

A conceptual space of base concepts is a complete partial order(CPO) where the maximal elements of the CPO are instances in representationspace, and the bottom element of the CPO is the universal concept.

This subsection ﬁrst provides the mathematical deﬁnitions required to un-derstand the notion of a CPO, and then informally shows how the conceptualspaces we have deﬁned are all examples of CPOs. The following description anddeﬁnitions are taken from Abramsky and Jung (1994) (A&J) and D&P.The deﬁnition of a CPO relies on the notion of a directed set . Deﬁnition 20.

Let S be a non-empty subset of an ordered set P . Then S issaid to be directed if, for every pair of elements x, y ∈ S , there exists z ∈ S such that z ∈ { x, y } u where { x, y } u is the set of upper bounds of { x, y } (p.148, D&P). Simple examplesof directed sets are chains, since every pair of elements in a chain is related, andso the maximum of any two elements is an upper bound on the pair. Anotherexample is the set of ﬁnite subsets of an arbitrary set (ordered by the subsetrelation), where an upper bound of any pair of subsets is provided by the union.CPOs are partial orders in which each directed subset has a join. Deﬁnition 21.

We say that an ordered set P is a CPO if i) P has a bottom element, ⊥ ,(ii) (cid:70) D (:= (cid:87) D ) exists for each directed subset D of P . (p.175, D&P)If (ii) is satisﬁed but not (i), i.e. there is no bottom element, then the partialorder is often called a directed-complete partial order (DCPO) .A&J (p.15) provide some examples of (D)CPOs, noting that every ﬁnitepartially ordered set is a DCPO. An instructive example of a partial order thatis not a DCPO is the set of natural numbers with the usual order. This setis directed, since it is a chain, and every ﬁnite directed subset has a join (themaximum element), but the whole set itself, for example, does not have a join. There is an alternative formulation of CPOs in terms of chains.

Deﬁnition 22.

Let P be an ordered set. Then P is a CPO if and only if eachchain has a least upper bound in P . (p.176, D&P)This formulation will provide some useful intuition in the context of ourconceptual CPOs. It also explains why every CPO has a bottom element: sinceeach chain in a CPO has a join, then there must be a join for the empty chain(which is the empty set), and hence there must be a bottom element, since thejoin of the empty set is ⊥ (p.179, D&P). We’d like to commit to the existenceof ⊥ in our conceptual spaces, since this corresponds to what we are callingthe universal concept—the concept that applies to all instances—and everyconceptual space should have one.A CPO P has at least one maximal element , i.e. Max P (cid:54) = ∅ (p.229, D&P;p.73, Priestley (2002)). Deﬁnition 23.

Let P be an ordered set, and let Q ⊆ P . Then a ∈ Q is a maximal element of Q if a ≤ x and x ∈ Q imply a = x . We denote the setof maximal elements of Q by Max Q . (p.16, D&P)In our case, if the conceptual space is P , then Max P is the set of instances(Prop. 4). What does it mean for these elements to be maximal? Intuitively,these are the elements that sit at the top of the feature (semi-)lattice, with anyother element in the lattice an abstraction and/or grouping of these instances,as determined by the partial order. Hence instances are fully-speciﬁed featurevalue pairs, in the sense that every feature has a value, and every value ismaximally informative.As promised, we can now formally deﬁne a concept’s extension. Deﬁnition 24.

The extension of a concept C in a conceptual CPO P is theset of instances { x ∈ Max P | x ≥ C } . Section 3.4.1 presents our example conceptual spaces in terms of CPOs,informally demonstrating in each case how the partial order results in a CPOwith the instances as maximal elements. Some of this presentation is repeatedfrom earlier. Note that the join does not have to be in the directed subset, it just needs to exist in P . We will start with the set of instances, and consider minimal ways in whichwe can form CPOs by a) abstracting away from those instances by removingfeatures, and b) grouping the real values into sets of various kinds.

The Universal Concept

Note that the set of instances themselves does notform a CPO, since we must have a bottom element. Hence perhaps the mosttrivial conceptual CPO is the one containing only the instances and the universalconcept, shown in Figure 9, which involves abstracting completely along thefeature axis, by removing all features. This results in what is known as a ﬂat ordered set, in which each instance is only related to the bottom element (theuniversal concept). Note that the instances themselves are not ordered relativeto each other, and hence form an anti-chain . The Conceptual CPO of Point Concepts

Now let’s perform an abstrac-tion as above, but this time removing just some of the features. Concepts arenow single points in a subspace of the representation space, and removing fea-tures amounts to taking a projection in the representation space. The partialorder is given by the subset relation over the feature-value pairs. Hence diﬀer-ent concepts with the same set of features will form an anti-chain (since in thisconceptual CPO there is no way of comparing values). Figure 10 (a repeat of Since the instances are always the maximal elements (according to Prop. 4), they willalways form an anti-chain in any conceptual CPO.

The Conceptual CPO of Equivalence Classes

Now let’s group values to-gether, by putting the values on each feature dimension into equivalence classes,as described in Section 3.2.3. In fact, if we consider each feature dimension, andeach bucket, separately, then this creates a ﬂat order, like that in Figure 9, butwith the bottom element replaced with the bucket. If we also perform abstrac-tion by removing some features, then we have the conceptual CPO in Figure 11(a repeat of Fig. 7). Here the order at the very top of the diagram—relatinginstances to concepts with all features deﬁned—is determined by the partitionfunction, and the remaining order is given by the subset relation over feature-value pairs. Note again how any chain in the partial order has a join (the topelement of the chain), and how every concept sits below a number of instances.This conceptual space is a useful one to consider when providing more in-tuition around the idea of a CPO. What are the directed subsets in this CPO?These are the sets of concepts which are all consistent with each other, i.e. thereare no clashes on any of the feature dimensions. Moreover, the subset must beclosed when considering upper bounds, in the sense that each pair of conceptsin the subset must have an upper bound in the subset (i.e. when following linksupwards in the feature lattice, there must be at least one point in the subsetwhere the links cross). And ﬁnally, the subset itself must have a join, i.e. a least31igure 11: The conceptual CPO with equivalence classes as values, and somefeatures removed.upper bound in the CPO. The join does not have to be in the subset, but in thiscase it always will be—it will be the maximal element of the subset—becauseof the “ﬁnite” nature of the CPO.

The Conceptual CPO of Closed Real-Valued Intervals

Finally we cangroup real values into closed intervals, and order those intervals by the inclusion(superset) relation. Note how the point values in the instances are limiting casesof the closed intervals, with a point x ∈ R corresponding to the interval [ x, x ].This conceptual space is a special case of the more general space in whichreal values are grouped into arbitrary sets, where again the point values in theinstances are limiting cases, with a point x ∈ R corresponding to the singletonset { x } .An interesting extension of this theoretical section would be to demonstratethat all our conceptual CPOs are also examples of domains , a key notion in the-oretical computer science (Abramsky & Jung, 1994). For example, the trivialCPO in Figure 9 is an example of an algebraic domain . In fact, all but the con-ceptual CPO of closed intervals are algebraic domains, with the closed-intervalcase a continuous domain . 32 Probabilities

There are a number of ways in which probabilities can feature in the formal-isation, all depending on how the probabilities are interpreted. First there isthe posterior distribution from, for example, a β -VAE. Since the instance spaceis assumed to come from a representation learning method such as β -VAE, weneed an account of how those distributions propagate through the conceptualspaces. Second, there can be statistical correlations across feature dimensions;to give a concrete example, the color of an apple may not be independent of itstaste (G¨ardenfors, 2000). And ﬁnally, there is a potential use for distributionsin injecting some “fuzziness” into the conceptual spaces described so far, whichhave all assumed hard boundaries, or binary membership conditions, for a con-cept’s properties. In this document we will focus on the ﬁrst case, leaving anaccount of the potential use of probabilities for fuzziness for future work.

Given an input x , for example an image or a video, a β -VAE deﬁnes a posteriordistribution p ( z | x ) over instances z , where the distribution is constrained tobe a multivariate Gaussian with a diagonal covariance matrix. How should weinterpret this posterior? It’s the uncertainty associated with the instance space,given this particular input. Note that we assume there is one true underlying z which generated x , but the stochastic nature of the VAE means we can’t besure which one.This raises a challenge for concept discovery, and more speciﬁcally for ﬁnd-ing meets, since so far we’ve been assuming that the instances are provided tous deterministically. But in the probabilistic setting we’re given two (or moregenerally a set of) inputs, and we’d like to ﬁnd some concepts which apply tothese inputs, by ﬁnding meets. However, we don’t know for sure what the un-derlying instances are, since we only have a posterior over the instance space foreach input. Hence we need a procedure for ﬁnding least general concepts whichapply to some inputs with high probability, what we might call probabilisticmeets , and a way of propagating probabilities through hierarchical conceptualspaces.The instance space now has a set of conditional probabilities associated withit (one for each input, p ( z | x )), and so each instance now comes with some un-certainty associated with it. Since the covariance matrix of p ( z | x ) is assumedto be diagonal, the feature dimensions of z can be treated independently, giventhe input, for example when calculating the probability of a concept, or ﬁndingthe most probable concept. So when discussing p ( C | x ), for example, for someconcept C , this can either mean the full concept (in which case the probability An interesting, and largely uninvestigated, question arises in the last case, which is howto order distributions (van de Wetering, 2017) so that they can form a lattice structure. We are following the standard convention of graphical models in using x to denote theinput and z the hidden variable, which means that an instance is now denoted z , and theinstance space Z , rather than x as before. C along one dimension (i.e. the probability of a property). Whichoption is being used should be clear from the context.How are the probabilities over instances propagated into the conceptualspace? If we interpret p ( C | x ) as the probability that C is true given the input(i.e. that C applies to x ), then probabilities of concepts are given by weightedsums over instances, with the probabilities increasing monotonically with in-creases in the size of a concept’s extension. Let x be an input, z an instance, C a concept, Z ( C ) the extension of concept C , then: p ( C | x ) = (cid:90) z ∈Z p ( z | x ) p ( C | z ) (1)= (cid:90) z ∈Z ( C ) p ( z | x ) . (2)Equation (1) follows from the rules of probability and the fact that C is indepen-dent of x given z . (2) follows from (1) since all our conceptual spaces consideredso far have been deterministic, in the sense that an instance z is either in theextension of a concept C or not, with probability 1 or 0.If we denote the universal concept by U , then p ( U| x ) = p ( U| z ) = 1. We alsohave that, for a conceptual space with a partial order (cid:118) , if concept C (cid:64) C ,then p ( C | x ) > p ( C | x ) and p ( C | z ) > p ( C | z ). As noted above, so far we have been assuming that, for the purposes of conceptdiscovery from instances, the instances are provided with certainty, and thenﬁnding the meet—which could form part of a larger concept discovery process—is a deterministic operation which returns the least general concept consistentwith the input instances. But what if there is a posterior distribution over theinstances? This section outlines a Bayesian framework for thinking about ﬁndingmeets probabilistically, assuming the conceptual space where the properties arereal-valued closed intervals (with a maximal length).Given two inputs x , x (e.g. two images or videos), and an instance space Z , then ﬁnding a probabilistic meet means informally ﬁnding a meet which ishighly probable according to the posteriors p ( z | x ) and p ( z | x ). Let’s continuewith our interpretation of the C random variable, where C has the value T ifconcept C holds, and F otherwise. As above, C can be considered to refer to asingle feature dimension, and have the value T if a particular property holds forthat dimension. Now the goal is to, intuitively, ﬁnd the least general propertythat has the highest probability of holding for both inputs. First of all, let’sframe this problem as simply ﬁnding the most probable concept (i.e the conceptthat is most likely to apply to both inputs). Since the feature dimensions areconditionally independent (given the input), the optimisation can be carried outseparately for each dimension. 34igure 12: Graphical model showing how C relates to both inputs (on the left),and multiple inputs (on the right).Let C be the conceptual space, Z ( C ) the extension of concept C , C max theset of maximum-length property intervals in C , and assume C has the value T from (4) onwards, then: C opt = arg max C ∈C\U p ( C = T | x , x ) (3)= arg max C ∈C\U (cid:90) z ,z p ( C, z , z | x , x ) (4)= arg max C ∈C\U (cid:90) z ,z p ( z | x ) p ( z | x ) p ( C | z , z ) (5)= arg max C ∈C max (cid:90) z ,z p ( z | x ) p ( z | x ) p ( C | z , z ) (6)= arg max C ∈C max (cid:90) z ∈Z ( C ) p ( z | x ) (cid:90) z ∈Z ( C ) p ( z | x ) (7)First note that the maximisation is over C \ U , since the universal concept isalways the most probable (with probability 1) and we don’t necessarily want toreturn that. Since we’re performing the optimisation for each feature dimensionindependently, U here is with respect to just one dimension, what we might callthe universal property . Line (4) follows from (3) by the rules of probability. (5)follows from (4) by the rules of probability and the conditional independenceassumptions implied by Figure 12. (6) follows from (5) since the probability ofa concept being true increases monotonically with the concept extension’s size,35igure 13: Moving the maximum-length interval along the x-axis captures moreor less of the probability mass from the two curves.and C max is the set of concepts with the largest extensions (ignoring U ). Andﬁnally, (7) follows from (6) since the conditional probability of a concept beingtrue is 0 outside of the concept’s extension (hence the integals over Z ( C )) andis 1 within the extension (hence the dropping of P ( C | z , z )).Figure 13 gives some intuition behind the ﬁnal equation above, showing howa maximum-length property interval can capture more or less of the probabilitymass from the two posteriors as it moves along the feature dimension.One problem with this approach is that no abstraction will occur, since weare guaranteed a non-zero probability for each dimension (since the Gaussianscorresponding to the two inputs will overlap to at least some degree on everydimension). An obvious solution is to only return a property for a dimensionwhen the probability of that property holding for both inputs is above somethreshold.Another problem is that it always returns a maximum-length interval, whereasin practice we may want to return a smaller property. For example, considera case where the two Gaussians are relatively peaked and the means close to-gether; in this case we’d like to return a much smaller property interval sincethis will still contain most of the probability mass from both Gaussians. Herea natural solution would be to ﬁnd the smallest property interval with a prob-ability above some threshold . Since we’re now searching for the smallest suchinterval—what we have called a “probabilistic meet”—this is similar to ﬁnding36igure 14: Minimum-threshold probabilistic meets for diﬀerent Gaussian pairs,where 2 (cid:15) is the maximum length for an interval.the meet in the deterministic setting. Figure 14 shows how a thresholded “probabilistic meet” might look for dif-ferent pairs of Gaussians, assuming a relatively high threshold (say 90%). Themaximum length for an interval is denoted 2 (cid:15) . Example a) is a case where themeans are similar and the variances are small, resulting in a short interval asthe meet. Example b) has the two means relatively far apart, but still withinthe 2 (cid:15) maximum range and so the majority of the mass can be captured bya single interval. Example c) has both means close together, but one of theGaussians has a large variance, and so there is no interval less then 2 (cid:15) whichcaptures enough of the mass; in this case there is no meet. Example d) has oneof the curves with a relatively large variance, but not so large that most of themass cannot be captured in a single interval (since the means are also relativelyclose). And ﬁnally, e) is a case where the two variances are relatively small, butthe means are too far apart to give a high-probability meet.A necessary extension is to consider the case where there are more thantwo inputs (the right side of Figure 12). This is potentially non-trivial, how-ever, since, for the properties-as-intervals conceptual space, the meet of a setof instances only depends on the outermost instances (because of the convexitycondition). Hence this extension is left for future work.A further extension of the two approaches considered so far would be to Whether there is guaranteed to be a unique smallest interval whose probability is abovesome threshold is left as an exercise for future work. C probabilistically aswell, given the two instances. That way it may be possible to have the mostprobable concept correspond to a reasonable probabilistic meet, but it wouldrequire a suitable prior over C , which we also leave for future work.38 Conclusion

A key assumption in this work is that the representation learning algorithm pro-vides separable feature dimensions (or more generally separable subspaces) sothat whole dimensions can be dropped in the process of abstraction. Whetherlearning methods such as β -VAE have the appropriate biases to provide suit-able feature dimensions is a question that requires substantial experimentation,continuing work such as that in Higgins, Sonnerat, et al. (2018).Furthermore, once we have the separable feature dimensions, then the pro-cess of concept discovery partly involves searching for sets of common propertiesacross instances, which we have formalised as ﬁnding the meet of a set of in-stances. Continuing with our ongoing example, if a number of images from aplatform video game all contain a heavy black ball, with the position and sizeof the ball varying, then removing the position and size features, but retainingweight, color and shape, could lead to the concept of a cannonball (assum-ing this particular feature set). However, it is likely that the weight, massand color will vary slightly across these instances, and so it is not suﬃcientto search for identical real values. Hence for the real-valued feature spaces wehave been considering, we had to insert additional structure into the featurelattices by grouping values into sets of reals, which then became the propertiesof the feature space. Whether a representation learning algorithm could inducethis additional structure, as well as the separable feature dimensions, is also aquestion for future work.Another place where our theorising in this report has raised interesting prac-tical questions is in the interplay between conceptual spaces and probabilitydistributions, especially the posteriors from a β -VAE. Section 4 set out someinitial ideas in this direction, but there is still much to be done, especially inthe potential role of probabilities in injecting some “fuzziness” into the con-ceptual system. Note that the two conceptual spaces that have featured mostprominently in this report—one which has discrete feature values from a parti-tion of the reals, and one which has values as closed intervals—both have hardproperty boundaries, in the sense that two concepts can have values arbitrarilyclose along some feature dimension, but still have diﬀerent property values forthat feature. Whether this is a desirable feature of the space of base concepts,given the apparent “fuzziness” of the human conceptual system (Laurence &Margolis, 1999), is open to debate.The key mathematical insight from this work is encapsulated in Proposition 4from Section 3.4, repeated here: Proposition 5. (Repeat of prop. 4) A conceptual space of base concepts isa complete partial order (CPO) where the maximal elements of the CPO areinstances in representation space, and the bottom element of the CPO is theuniversal concept.

A CPO beautifully captures the idea that instances are maximally infor-mative concepts, with the least informative concept—the Universal Concept—sitting at the bottom of the CPO. It also neatly formalises the way in which39bstraction over instances—by dropping whole feature dimensions—leads toconcepts which lie below the relevant instances in the partial order, with theinstances above a concept forming the concept’s extension. Grouping of realvalues into property sets can also be easily accommodated in the partial order.A further advantage of the CPO is that the lattice structure comes with meetsand joins which provide a natural mechanism for base concept combination. Inpractice, more complex mechanisms will be required for concept discovery andcreation beyond base concepts. The hope is that this report has provided asound theoretical basis on which to carry out such work.40

Acknowledgements

Thanks to Richard Evans and the DeepMind Concepts team for useful feedback.

References

Abramsky, S., & Jung, A. (1994). Domain theory. In S. Abramsky, D. M. Gab-bay, & T. S. E. Maibaum (Eds.),

Handbook of logic in computer science (Vol. 3). Oxford: Clarendon Press.Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007).Open information extraction from the web. In

Proceedings of the 20thinternational joint conference on artiﬁcial intelligence (pp. 2670–2676).Morgan Kaufmann Publishers Inc.Cantwell Smith, B. (2019).

The promise of artiﬁcial intelligence – reckoningand judgement . The MIT Press.Carpenter, B. (1992).

The logic of typed feature structures . Cambridge Univer-sity Press.Davey, B. A., & Priestley, H. A. (2002).

Introduction to lattices and order (Second ed.). Cambridge University Press.Ganter, B., & Obiedkov, S. (2016).

Conceptual exploration . Springer.G¨ardenfors, P. (2000).

Conceptual spaces: The geometry of thought . The MITPress.G¨ardenfors, P. (2014).

The geometry of meaning . The MIT Press.Goodfellow, I., Bengio, Y., & Courville, A. (2016).

Deep learning . The MITPress.Harnad, S. (1990). The symbol grounding problem.

Physica D: NonlinearPhenomona , , 335-346.Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., &Lerchner, A. (2018). Towards a deﬁnition of disentangled representations. arXiv:1812.02230 .Higgins, I., Matthey, L., Pal, A., Burgess, C. P., Glorot, X., Botvinick, M.,. . . Lerchner, A. (2017). β -VAE: Learning basic visual concepts with aconstrained variational framework. In Proceedings of ICLR 2017.

Higgins, I., Sonnerat, N., Matthey, L., Pal, A., Burgess, C. P., Boˇsnjak, M., . . .Lerchner, A. (2018). SCAN: Learning hierarchical compositional visualconcepts. In

Proceedings of ICLR 2018.

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. In

Pro-ceedings of the international conference on learning representations (iclr2014).

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017).Building machines that learn and think like people.

Behavioral and BrainSciences , .Lakoﬀ, G., & Johnson, M. (1980). Metaphors we live by . The University ofChicago Press. 41aurence, S., & Margolis, E. (1999). Concepts and cognitive science. In E. Mar-golis & S. Laurence (Eds.),

Concepts: Core readings.

The MIT Press.Margolis, E., & Laurence, S. (Eds.). (1999).

Concepts: Core readings . The MITPress.Margolis, E., & Laurence, S. (Eds.). (2015).

The conceptual mind: New direc-tions in the study of concepts . The MIT Press.Margolis, E., & Laurence, S. (2019). Concepts. In E. N. Zalta(Ed.),

The Stanford encyclopedia of philosophy (Summer2019 ed.). Metaphysics Research Lab, Stanford University.https://plato.stanford.edu/archives/sum2019/entries/concepts/.Mitchell, T. M. (1997).

Machine learning . McGraw-Hill.Murphy, G. L. (2002).

The big book of concepts . The MIT Press.Partee, B. H., ter Meulen, A., & Wall, R. E. (1990).

Mathematical methods inlinguistics . Kluwer Academic Publishers.Priestley, H. (2002). Ordered sets and complete lattices. In R. Backhouse,R. Crole, & J. Gibbons (Eds.),

Algebraic and coalgebraic methods in themathematics of program construction.

Berlin, Heidelberg: Springer.Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropaga-tion and approximate inference in deep generative models. In

Proceedingsof the 31st international conference on machine learning (pp. 1278–1286).Russell, S., & Norvig, P. (2003).

Artiﬁcial intelligence: A modern approach(second edition) . Prentice Hall.van de Wetering, J. (2017).