[PDF] Optimal Collusion-Free Teaching

Abstract

Formal models of learning from teachers need to respect certain criteria to avoid collusion. The most commonly accepted notion of collusion-freeness was proposed by Goldman and Mathias (1996), and various teaching models obeying their criterion have been studied. For each model M and each concept class C , a parameter M - TD(C) refers to the teaching dimension of concept class C in model M ---defined to be the number of examples required for teaching a concept, in the worst case over all concepts in C . This paper introduces a new model of teaching, called no-clash teaching, together with the corresponding parameter NCTD(C) . No-clash teaching is provably optimal in the strong sense that, given any concept class C and any model M obeying Goldman and Mathias's collusion-freeness criterion, one obtains NCTD(C)≤M - TD(C) . We also study a corresponding notion NCTD + for the case of learning from positive data only, establish useful bounds on NCTD and NCTD + , and discuss relations of these parameters to the VC-dimension and to sample compression. In addition to formulating an optimal model of collusion-free teaching, our main results are on the computational complexity of deciding whether NCTD + (C)=k (or NCTD(C)=k ) for given C and k . We show some such decision problems to be equivalent to the existence question for certain constrained matchings in bipartite graphs. Our NP-hardness results for the latter are of independent interest in the study of constrained graph matchings.

Full PDF

OOptimal Collusion-Free Teaching ∗ David Kirkpatrick

KIRK @ CS . UBC . CA Department of Computer Science, University of British Columbia

Hans U. Simon

HANS . SIMON @ RUB . DE Department of Mathematics, Ruhr-University Bochum

Sandra Zilles

ZILLES @ CS . UREGINA . CA Department of Computer Science, University of Regina

Abstract

NCTD + , anddiscuss relations of these parameters to the VC-dimension and to sample compression.In addition to formulating an optimal model of collusion-free teaching, our main results areon the computational complexity of deciding whether NCTD + ( C ) = k (or NCTD( C ) = k ) forgiven C and k . We show some such decision problems to be equivalent to the existence questionfor certain constrained matchings in bipartite graphs. Our NP-hardness results for the latter are ofindependent interest in the study of constrained graph matchings. Keywords: machine teaching, constrained graph matchings, sample compression

1. Introduction

Models of machine learning from carefully chosen examples, i.e., from teachers, have gained in-creased interest in recent years, due to various application areas, such as robotics (Argall et al.,2009), trustworthy AI (Zhu et al., 2018), and pedagogy (Shafto et al., 2014). Machine teachingis also related to inverse reinforcement learning (Ho et al., 2016), to sample compression (Moranet al., 2015; Doliwa et al., 2014), and to curriculum learning (Bengio et al., 2009). The paper athand is concerned with abstract notions of teaching, as studied in computational learning theory.A variety of formal models of teaching have been proposed in the literature, for example, theclassical teaching dimension model (Goldman and Kearns, 1995), the optimal teacher model (Bal-bach, 2008), recursive teaching (Zilles et al., 2011), or preference-based teaching (Gao et al., 2017). ∗ . This is an extended version of (Kirkpatrick et al., 2019) a r X i v : . [ c s . L G ] M a r n each of these models, a mapping T (the teacher ) assigns a ﬁnite set T ( C ) of correctly labelledexamples to a concept C in a concept class C in a way that the learner can reconstruct C from T ( C ) .Intuitively, unfair collusion between the teacher and the learner should not be allowed in any formalmodel of teaching. For example, one would not want the teacher and learner to agree on a total orderover the domain and a total order over the concept class and then to simply use the i th instance inthe domain for teaching the i th concept, irrespective of the actual structure of the concept class.However, there is no general deﬁnition of what constitutes collusion, and of what constitutesdesirable or undesirable forms of learning. In this manuscript, we focus on a notion of collusionthat was proposed by Goldman and Mathias (1996) and that has been adopted by the majority ofteaching models studied in the literature. In a nutshell, Goldman and Mathias’s model demands that,(i) the examples in T ( C ) are labelled consistently with C , and (ii) if the learner correctly identiﬁes C from T ( C ) , then it will also identify C from any superset S of T ( C ) as long as the sample set S remains consistent with C . In other words, adding more information about C to T ( C ) will notdivert the learner to an incorrect hypothesis.Most existing abstract models of machine teaching are collusion-free in this sense. Historically,some of these models were designed in order to overcome weaknesses of the previous models. Forexample, the optimal teacher model by Balbach (2008) is designed to overcome limitations of theclassical teaching dimension model, and was likewise superseded by the recursive teaching model(Zilles et al., 2011). The latter again was inapplicable to many interesting inﬁnite concept classes,which gave rise to the model of preference-based teaching (Gao et al., 2017). Each model strictlydominates the previous one in terms of the teaching complexity , i.e., the worst-case number ofexamples needed for teaching a concept in the underlying concept class C . In this context, one quitenatural question has been ignored in the literature to date: what is the smallest teaching complexitythat can be achieved under Goldman and Mathias’s condition of collusion-freeness? This is exactlythe question addressed in this paper.Our ﬁrst contribution is the formal deﬁnition of a collusion-free teaching model that has, forevery concept class C , the provably smallest teaching complexity among all collusion-free teach-ing models. We call this model no-clash teaching , since its core property, which turns out to becharacteristic for collusion-freeness, requires that no pair of concepts are consistent with the unionof their teaching sets. A similar property was used once in the literature in the context of samplecompression schemes (Kuzmin and Warmuth, 2007), and dubbed the non-clashing property.For example, consider a concept class (i.e., set system) C over the instance space { , , , } ,consisting of the four concepts of the form { i, ( i + 1) mod 4 } for ≤ i ≤ . Then no-clashteaching is possible by assigning the singleton set { ( i, } (interpreted as the information “ i belongsto the target concept”) as a teaching set to the concept { i, ( i + 1) mod 4 } ; no two distinct conceptsare consistent with the union of their assigned teaching sets. Thus, in the no-clash setting, eachconcept in C can be taught with a single example. By comparison, consider the classical teachingdimension model, in which a teaching set for a given concept is required to be inconsistent with allother concepts in the concept class (Goldman and Kearns, 1995). It is not hard to see that, undersuch constraints, no concept in C can be taught with a single example; a smallest teaching set forconcept { i, ( i + 1) mod 4 } would then be { ( i, , (( i + 1) mod 4 , } .We call the worst-case number of examples needed for non-clashing teaching of any concept C in a given concept class C the no-clash teaching dimension of C , abbreviated NCTD( C ) , and westudy a variant NCTD + ( C ) in which teaching uses only positive examples. In the example above, NCTD = NCTD + = 1 , while the classical teaching dimension is .2he value NCTD( C ) being the smallest collusion-free teaching complexity parameter of C makes it interesting for several reasons.(1) NCTD represents the limit of data efﬁciency in teaching when obeying Goldman and Math-ias’s notion of collusion-freeness. Therefore the study of

NCTD has the potential to further ourunderstanding how collusion-freeness constrains teaching. It will also help to compare other no-tions of collusion-freeness (see, e.g., (Zilles et al., 2011)) to that of Goldman and Mathias.(2) An open question in computational learning theory is whether the VC-dimension (

VCD ),(Vapnik and Chervonenkis, 1971), which characterizes the sample complexity of learning fromrandomly chosen examples, also characterizes teaching complexity for some reasonable notion ofteaching. Recently, the ﬁrst strong connections between teaching and

VCD were established, cul-minating in an upper bound on the recursive teaching dimension (

RTD ) that is quadratic in

VCD (Hu et al., 2017), but it remains open whether this bound can be improved to be linear in

VCD .Obviously, now

NCTD is a much stronger candidate for a linear relationship with

VCD than

RTD is. In fact, there is no concept class known yet for which

NCTD exceeds

VCD .(3) The problem of relating teaching complexity to

VCD is connected to the famous open prob-lem of determining whether

VCD is an upper bound on the size of the smallest possible sam-ple compression scheme (Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995) of a conceptclass. Some interesting relations between sample compression and teaching have been establishedfor

RTD (Moran et al., 2015; Doliwa et al., 2014; Darnst¨adt et al., 2016). The study of

NCTD canpotentially strengthen such relations.In addition, an important contribution of our paper is to link

NCTD to the extensively devel-oped theory of constrained graph matching. We show that the question whether

NCTD + = 1 isequivalent to a very natural constrained bipartite matching problem which has apparently not yetbeen studied in the literature. We proceed by proving that this particular matching problem is NP-hard—a result that generalizes to larger values of NCTD + as well as to NCTD . By comparison,the question whether

RTD + = 1 or RTD = 1 can be answered in linear time.To sum up, our new notion of optimal collusion-free teaching is of relevance to the study ofimportant open problems in computational learning theory as well as of fundamental graph-theoreticdecision problems, and therefore appears to be worth studying in more detail.

2. Preliminaries

Given a domain X , a concept over X is a subset C ⊆ X , and we usually denote by C a concept class over X , i.e., a set of concepts over X . Implicitly, we identify a concept C over X with a mapping C : X → { , } , where C ( x ) = 1 iff x ∈ C . By VCD( C ) , we denote the VC-dimension of C .A labelled example is a pair ( x, (cid:96) ) ∈ X × { , } , and it is consistent with a concept C if C ( x ) = (cid:96) . Likewise, a set S of labelled examples over X , which is also called a sample set , isconsistent with C , if every element of S is consistent with C . An example with the label (cid:96) = 1 is apositive example, while (cid:96) = 0 is the label of a negative example.Intuitively, the notion of teaching refers to compressing any concept in a given concept class toa consistent sample set. Deﬁnition 1

Let C be a concept class over a domain X . A teacher mapping for C is a mapping T on C such that, for all C ∈ C , T ( C ) is a ﬁnite sample set S ⊆ X × { , } that is consistent with C . T that the concept C ∈ C be the only concept in C that is consistent with T ( C ) , for any C ∈ C (Shinohara and Miyano, 1991; Goldman and Kearns, 1995). This led to the deﬁnition of the well-known teaching dimension parameter. Deﬁnition 2 (Shinohara and Miyano (1991); Goldman and Kearns (1995))

Let C be a conceptclass over a domain X and C ∈ C be a concept. A teaching set for C (with respect to C ) is a sampleset S such that C is the only concept in C consistent with S . The teaching dimension of C in C ,denoted by TD( C, C ) , is the size of the smallest teaching set for C with respect to C . The teachingdimension of C is then deﬁned as TD( C ) = sup { TD( C, C ) | C ∈ C} . For example, let C be a concept class over a domain X of exactly m elements, containingthe empty concept, all singleton concepts over X , and no other concepts. Then TD( { x } , C ) = 1 for each singleton concept { x } , since { ( x, } serves as a teaching set for { x } . By comparison, TD( ∅ , C ) = m , since any set of up to m − negative examples is consistent with some singletonconcept, so that all m negative examples need to be presented in order to identify the empty concept.Consequently, TD( C ) = sup { TD( C, C ) | C ∈ C} = m .As mentioned in the introduction, various notions of teaching have been proposed in the litera-ture. The one that is most relevant to our work is the model of preference-based teaching. In thismodel, intuitively, a preference relation on C is used to reduce the size of teaching sets. In particular,a concept C need no longer be the only concept consistent with its teaching set T ( C ) ; it sufﬁcesif C is the unique most preferred concept in C that is consistent with C . In order to avoid cyclicpreferences, the preference relation is required to form a partial order over C . Deﬁnition 3 (Gao et al. (2017))

Let C be a concept class over a domain X and (cid:31) any binary rela-tion that forms a strict (possibly non-total) order over C . We say that concept C is preferred overconcept C (cid:48) (with respect to (cid:31) ), if C (cid:31) C (cid:48) . The preference-based teaching dimension of C withrespect to C and (cid:31) , denoted by PBTD( C, C , (cid:31) ) , is the size of the smallest sample set S such that1. S is consistent with C , and2. C (cid:31) C (cid:48) for all C (cid:48) ∈ C \ { C } such that S is consistent with C (cid:48) .We write PBTD( C , (cid:31) ) = sup { PBTD( C, C , (cid:31) ) | C ∈ C} . Finally, the preference-based teachingdimension of C , denoted by PBTD( C ) , is deﬁned by PBTD( C ) = min { PBTD( C , (cid:31) ) | (cid:31)⊆ C × C and (cid:31) forms a strict order on C} . An interesting variant of preference-based teaching is obtained when disallowing negative ex-amples in teaching. Learning from positive examples only has been studied extensively in thecomputational learning theory literature, see, e.g., (Denis, 2001; Angluin, 1980) and is motivatedby studies on language acquisition (Wexler and Culicover, 1980) or, more recently, by problems oflearning user preferences from a user’s interactions with, say, an e-commerce system (Schwab et al.,2000), as well as by problems in bioinformatics (Wang et al., 2006).

Deﬁnition 4 (Gao et al. (2017))

Let C be a concept class over a domain X . The positive preference-based teaching dimension of C , denoted by PBTD + ( C ) , is deﬁned analogously to PBTD( C ) , wherethe sets S in Deﬁnition 3 are required to contain only positive examples.

4n the same way, one can deﬁne the notion TD + . The following property, proven by Gao et al.(2017), is crucial when computing the PBTD and

PBTD + of ﬁnite classes. Proposition 5 (Gao et al. (2017))

Let C be a ﬁnite concept class. If PBTD( C ) = d , then C containssome C with TD( C, C ) ≤ d . If PBTD + ( C ) = d , then C contains some C with TD + ( C, C ) ≤ d . This result immediately implies that

PBTD and the well-known notion of

RTD coincide forﬁnite concepts classes, and so do PBTD + and RTD + .

3. Collusion-free Teaching and the Non-Clashing Property

While there is no objective measure of how “reasonable” a formal model of teaching is, the literatureoffers some notions of what constitutes an “acceptable” model of teaching, i.e., one in which theteacher and learner do not collude. So far, the notion of collusion-free teaching that found the mostpositive resonance in the literature is the one deﬁned by Goldman and Mathias.

Deﬁnition 6 (Goldman and Mathias (1996))

Let C be a concept class over X and T a teachermapping on C . Let L be a learner mapping that assigns to each set of labelled examples a conceptover X . The pair ( T, L ) is successful on C if L ( T ( C )) = C for all C ∈ C . The pair ( T, L ) is collusion-free on C if L ( S ) = L ( T ( C )) for any C ∈ C and any set S of labelled examples such that S is consistent with C and S contains T ( C ) . Intuitively, Goldman and Mathias’s deﬁnition captures the idea that a learner conjecturing concept C will not change its mind when given additional information consistent with C .For example, teacher-learner pairs following the classical teaching dimension model, Balbach’soptimal teacher model, the recursive teaching model, or the preference-based teaching model arealways collusion-free according to Deﬁnition 6. Of these models, the classical teaching dimensionmodel is the one imposing the most constraints on the mapping T , followed by Balbach’s optimalteaching, recursive teaching, and preference-based teaching in that order. Consequently, the “teach-ing complexity” among these models is lowest for preference-based teaching; if every concept in aconcept class C can be taught with at most z examples in any of these models, then every conceptin C can be taught with at most z examples in the preference-based model.One can still argue that the preference-based model is unnecessarily constraining. Preference-based teaching of a concept class C relies on a preference relation that induces a strict order on C .However, this strict order is used by the learner only after the teaching set has been communicated,since the learner chooses the unique most preferred concept among those consistent with the set ofexamples provided by the teacher . One might consider loosening the constraints by, for example,demanding only that the set of concepts consistent with any chosen teaching set be ordered underthe chosen preference relation (rather than requiring acyclic preferences over the whole conceptclass). In the same vein, one could relax more conditions—every relaxation might result in a morepowerful model of teaching satisfying the collusion-free property.In this manuscript, we will deﬁne the provably most powerful model of teaching that is collusion-free in the sense proposed by Goldman and Mathias (1996), namely a model that adheres to no otherconstraints on the teacher-learner pairs ( T, L ) than those given by Goldman and Mathias: (i) T is ateacher mapping; (ii) ( T, L ) is successful on C ; and (iii) ( T, L ) is collusion-free on C .

1. The RTD, short for “recursive teaching dimension, is a well-studied teaching parameter deﬁned by Zilles et al. (2011).

Deﬁnition 7

Let C be a concept class and T be a teacher mapping on C . We say that T is non-clashing (on C ) if and only if there are no two distinct C, C (cid:48) ∈ C such that both T ( C ) is consistentwith C (cid:48) and T ( C (cid:48) ) is consistent with C . It turns out that, for a teacher mapping T , the non-clashing property is equivalent to the existenceof a learner mapping L such that ( T, L ) is successful and collusion-free: Theorem 8

Let C be a concept class over the instance space X . Let T be a teacher mapping on C .Then the following two conditions are equivalent:1. T is non-clashing on C .2. There is a mapping L : 2 X ×{ , } → C such that ( T, L ) is both successful and collusion-freeon C . Proof

First, suppose T is a non-clashing teacher mapping, and deﬁne L as follows. Given any set S of labelled examples as input, L checks for the existence of a concept C ∈ C such that T ( C ) ⊆ S and C is consistent with S . If such a concept C is found, L returns an arbitrary such C ; otherwise L returns some default concept in C .To show that ( T, L ) is successful and collusion-free, suppose there is some concept C ∈ C suchthat a given set S of labelled examples is consistent with C and contains T ( C ) . We claim that thensuch C is uniquely determined. For if there were two distinct concepts C, C (cid:48) ∈ C consistent with S such that T ( C ) ∪ T ( C (cid:48) ) ⊆ S , then T ( C (cid:48) ) , being a subset of S , would be consistent with C and,likewise, T ( C ) would be consistent with C (cid:48) —in contradiction to the non-clashing property of T .From the deﬁnition of L , it then follows that ( T, L ) is successful and collusion-free.Second, suppose T is a teacher mapping and there is a mapping L such that ( T, L ) is successfuland collusion-free, i.e., for all C ∈ C , we have L ( S ) = L ( T ( C )) = C whenever S is consistentwith C and contains T ( C ) . To see that T is non-clashing, suppose two concepts C, C (cid:48) ∈ C are bothconsistent with T ( C ) ∪ T ( C (cid:48) ) . Then C = L ( T ( C )) = L ( T ( C ) ∪ T ( C (cid:48) )) = L ( T ( C (cid:48) )) = C (cid:48) .Consequently, teaching with non-clashing teacher mappings is, in terms of the worst-case num-ber of examples required, the most efﬁcient model that obeys Goldman and Mathias’s notion ofcollusion-freeness. We hence deﬁne the notion of no-clash teaching dimension as follows. Deﬁnition 9

Let C be a concept class over the instance space X . Let T : C → ( X × { , } ) ∗ be a non-clashing teacher mapping. The order of T on C , denoted by ord( T, C ) , is then deﬁnedby ord( T, C ) = sup {| T ( C ) | | C ∈ C} . The No-Clash Teaching Dimension of C , denoted by NCTD( C ) , is deﬁned as NCTD( C ) = min { ord( T, C ) | T is a non-clashing teacher mapping for C} . From Theorem 8 we obtain that, for every concept class C , NCTD( C ) = min { ord( T, C ) | there exists an L s.t. ( T, L ) is successful and collusion-free on C} . As in the case of preference-based teaching, it is natural to study a variant of non-clashingteaching that uses positive examples only. 6 eﬁnition 10

Let C be a concept class over the domain X . A teacher mapping T is called pos-itive on C if T ( C ) ⊆ X × { } for all C ∈ C . We then deﬁne NCTD + ( C ) = min { ord( T, C ) | T is a positive non-clashing teacher mapping for C} . Furthermore, for ﬁnite domains X , it will be helpful to have the notion of average no-clashteaching dimension: Deﬁnition 11

Let C be a concept class over the ﬁnite domain X . The Average No-Clash TeachingDimension of C , denoted by ANCTD( C ) , is deﬁned as ANCTD( C ) = min (cid:40) |C| (cid:88) C ∈C | T ( C ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T is a non-clashing teacher mapping for C (cid:41) . Remark 12

It follows immediately from the pigeon-hole principle that

NCTD( C ) ≥ (cid:100) ANCTD( C ) (cid:101) .In the following we describe a natural normal form for non-clashing teacher mappings. T (cid:48) issaid to be an extension of T if T ( C ) ⊆ T (cid:48) ( C ) holds for every C ∈ C . Clearly, if T (cid:48) is an extensionof T and T is non-clashing, then T (cid:48) is non-clashing. Proposition 13 (a) Let T be a non-clashing teacher mapping for C . Then there is a non-clashingteacher mapping T (cid:48) for C such that | T (cid:48) ( C ) | = ord( T, C ) for all C ∈ C .(b) Let T be a positive non-clashing teacher mapping for C . Then there is a positive non-clashingteacher mapping T (cid:48) for C such that | T (cid:48) ( C ) | = min {| C | , ord( T, C ) } for all C ∈ C . While many of our deﬁnitions and results apply to both ﬁnite and inﬁnite concept classes, exceptwhere explicitly stated otherwise, we will hereafter assume that X (and C ) are ﬁnite.

4. Lower Bounds on

NCTD and

NCTD + To establish lower bounds on

NCTD and

NCTD + for ﬁnite concept classes, we ﬁrst show that NCTD( C ) must be at least as large as the smallest d satisfying |C| ≤ d (cid:0) |X | d (cid:1) . A similar state-ment then follows for NCTD + . In fact, we prove a slightly stronger result, replacing |X | with apotentially smaller value: Deﬁnition 14

We deﬁne X T ⊆ X as the set of instances that are part of a labelled example in ateaching set T ( C ) for some C ∈ C . Moreover, we deﬁne X ( C ) = min {|X T | : T is a non-clashing teacher mapping for C with ord( C , T ) = NCTD( C ) } . Intuitively, X ( C ) is the smallest number of instances that must be employed by any optimal non-clashing teacher mapping for C . Likewise, we deﬁne X + ( C ) for positive non-clashing teaching. Theorem 15

Let C be any concept class.1. If NCTD( C ) = d , then |C| ≤ d (cid:0) X ( C ) d (cid:1) .2. If NCTD + ( C ) = d , then |C| ≤ (cid:80) di =0 (cid:0) X + ( C ) i (cid:1) . roof To prove statement 1, let X (cid:48) be a subset of size X ( C ) of X . Let C (cid:55)→ T ( C ) ⊆ X (cid:48) × { , } be a consistent and non-clashing mapping which witnesses that NCTD( C ) = d , and let L be themapping such that L ( T ( C )) = C for all C ∈ C . By Proposition 13, one may assume without lossof generality that | T ( C ) | = d for all C ∈ C . Since T is an injective mapping and there are only d (cid:0) X ( C ) d (cid:1) labelled teaching sets at our disposal, the claim follows.Statement 2 is proven analogously, taking into consideration that, in the NCTD + case, we donot have an analogous statement to Proposition 13, since a concept does not in general contain d ormore elements. Note that the formula has no factors i since there are no options for labelling theinstances in any set T ( C ) .We will next establish a useful lower bound on NCTD( C ) , as well as as a related lower boundon NCTD + ( C ) , based on the number of neighbors of any concept in C .A concept C (cid:48) ∈ C is a neighbor of concept C ∈ C if it differs from C on exactly one instance,i.e., if the symmetric difference C ∆ C (cid:48) := ( C \ C (cid:48) ) ∪ ( C (cid:48) \ C ) has size one. The degree of C ∈ C ,denoted as deg C ( C ) , is deﬁned as the number of neighbors of C in C . The average degree ofconcepts in C is then denoted by deg avg ( C ) := 1 |C| · (cid:88) C ∈C deg C ( C ) . The dominance of C ∈ C , denoted as dom C ( C ) , is deﬁned as the number of smaller neighbors of C in C , i.e. neighbors that contain exactly one fewer instance than C . Theorem 16

Every concept class C over a ﬁnite domain satisﬁes (i) ANCTD( C ) ≥ · deg avg ( C ) and (ii) NCTD( C ) ≥ (cid:100) · deg avg ( C ) (cid:101) . Proof

For assertion (i), let T be any non-clashing teacher mapping for C . If C and C are neigh-bors, say C ∆ C = { x i } , then at least one of the sets T ( C ) , T ( C ) must contain x i . We obtain (cid:80) C ∈C | T ( C ) | ≥ · (cid:80) C ∈C deg C ( C ) = |C| · · deg avg ( C ) . Assertion (ii) is immediate from (i), byRemark 12. Theorem 17

Every concept class C over a ﬁnite domain satisﬁes NCTD + ( C ) ≥ max C ∈C dom C ( C ) . Proof

If the smaller neighbor C (cid:48) of C ∈ C differs from C on instance x i , then ( x i , must be usedin teaching C . Hence, every C ∈ C must have a positive teaching set of size at least dom C ( C ) .Although the lower bounds in Theorems 16 and 17 are not expected to be attained very often,the following Remark shows that they are sometimes tight: Remark 18

Let P m be the powerset over the domain { x , . . . , x m } . Since every concept in P m hasdegree m , clearly deg avg ( P ) = m . It follows from Theorem 16 that ANCTD( P m ) ≥ m/ andhence NCTD( P m ) ≥ (cid:100) m/ (cid:101) . Furthermore, since dom P m ( { x , . . . , x m } ) = m , it follows fromTheorem 17 that NCTD + ( P m ) ≥ m . But the positive mapping T that maps S ∈ P m to S × { } is8rivially non-clashing, and hence NCTD + ( P m ) = m and ANCTD( P m ) = m/ . As the mapping T given by ∅ (cid:55)→ { ( a, } , { a } (cid:55)→ { ( b, } , { b } (cid:55)→ { ( b, } , { a, b } (cid:55)→ { ( a, } , is non-clashing for P , it follows that NCTD( P ) = 1 . As we shall see in Theorem 23, thisgeneralizes to NCTD( P m ) = (cid:100) m/ (cid:101) .We note that the maximum degree of a concept in C is in general not an upper bound on NCTD( C ) . For example, if we consider the concept class C consisting of subsets of size k of somedomain of size n , then all concepts in C have degree zero yet, for n sufﬁciently large, NCTD( C ) = k (since for large enough n the size of C exceeds the number of possible teaching sets in a normal-formteaching mapping T for C with ord( T, C ) < k .)

5. Sub-additivity of

NCTD and

NCTD + In this section, we will show that the

NCTD is sub-additive with respect to the free combination ofconcept classes. As an application of this result, we will determine the

NCTD of the powerset overany ﬁnite domain X . While the powerset is a rather special concept class, knowing its NCTD willturn out useful to obtain a variety of further results.

Deﬁnition 19

Let C and C be concept classes over disjoint domains X and X , respectively.Then the free combination C (cid:116) C of C and C is a concept class over the domain X ∪ X deﬁnedby C (cid:116) C = { C ∪ C | C ∈ C and C ∈ C } . Lemma 20

Let C = C (cid:116) C be the free combination of C and C . Moreover, for i = 1 , , let T i be anon-clashing mapping for C i . Then, for T ( C (cid:116)C ) deﬁned by setting T ( C ∪ C ) = T ( C ) ∪ T ( C ) ,we have that T is a non-clashing teacher mapping for C (cid:116) C . Moreover, as witnessed by T , NCTD acts sub-additively on (cid:116) , i.e.,

NCTD( C (cid:116) C ) ≤ NCTD( C ) + NCTD( C ) . (1) Proof

Suppose that concepts C i , C j ∈ C and C i , C j ∈ C give rise to distinct concepts C i ∪ C i and C j ∪ C j ∈ C (cid:116) C that clash under T . (Without loss of generality we can assume that i (cid:54) = j .) Then C j ∪ C j is consistent with T ( C i ) ∪ T ( C i ) and C i ∪ C i is consistent with T ( C j ) ∪ T ( C j ) . Hence C j is consistent with T ( C i ) and C i is consistent with T ( C j ) , thatis concepts C i and C j in C clash under the mapping T .As we shall see NCTD sometimes acts strictly sub-additively on (cid:116) ; in particular, the compositionof optimal mappings for C and C is not necessarily an optimal mapping for C (cid:116) C . In contrast,ANCTD acts additively on (cid:116) : Lemma 21

ANCTD( C (cid:116) C ) = ANCTD( C ) + ANCTD( C ) . (2)9 roof The proof of Lemma 20 above shows that ANCTD acts sub-additively on (cid:116) , that is

ANCTD( C (cid:116) C ) ≤ ANCTD( C ) + ANCTD( C ) . It remains to show that ANCTD( C (cid:116) C ) ≥ ANCTD( C ) + ANCTD( C ) . (3)To this end, let X (resp. X ) be the domain of concept class C (resp. C ) and suppose that T is a non-clashing teacher mapping on C such that ANCTD( C ) = |C| (cid:80) C ∈C | T ( C ) | . The followingcalculation makes use of the fact, for every ﬁxed choice of C ∈ C , the mapping C (cid:55)→ T ( C ∪ C ) ∩ X is a non-clashing teacher mapping on C (and an analogous remark holds, for reasons ofsymmetry, when the roles of C and C are exchanged): ANCTD( C ) = 1 |C | · |C | (cid:88) C ∈C (cid:88) C ∈C | T ( C ∪ C | )= 1 |C | (cid:88) C ∈C  |C | (cid:88) C ∈C | T ( C ∪ C ) ∩ X |  + 1 |C | (cid:88) C ∈C  |C | (cid:88) C ∈C | T ( C ∪ C ) ∩ X |  ≥ ANCTD( C ) + ANCTD( C ) . Remark 22

In Lemma 20, if T and T are positive non-clashing mappings, then the same proofshows that T (a positive non-clashing mapping) witnesses the fact that NCTD + also acts sub-additively on (cid:116) , i.e., NCTD + ( C (cid:116) C ) ≤ NCTD + ( C ) + NCTD + ( C ) . (4)Furthermore, since (cid:116) is associative, it follows immediately that, for any concept class C , if C k := C (cid:116) . . . (cid:116) C k , where C i := { C × { i }| C ∈ C} for i = 1 , . . . , k , then ANCTD( C k ) = k · ANCTD( C ) (5)and NCTD( C k ) ≤ k · NCTD( C ) and NCTD + ( C k ) ≤ k · NCTD + ( C ) . (6)We have already seen, in Remark 18, that ANCTD( P m ) = m/ , NCTD + ( P m ) = m and NCTD( P m ) ≥ (cid:100) m/ (cid:101) , where P m denotes the powerset over the domain { x , . . . , x m } . The sub-additivity results above can be applied in order to determine NCTD( P m ) exactly as well. Theorem 23

Let P m be the powerset over the domain { x , . . . , x m } . Then NCTD( P m ) = (cid:100) m/ (cid:101) . roof It remains to show that

NCTD( P m ) ≤ (cid:100) m/ (cid:101) . It sufﬁces to verify this upper bound foreven m . But, when m is even, NCTD( P m ) = NCTD( P m/ ) ≤ m/ follows from (6) and the factthat NCTD( P ) = 1 (cf. Remark 18).Since the NCTD of any concept class over a domain X is trivially upper bounded by the NCTD of the powerset over X , this result in particular implies that (cid:100)|X | / (cid:101) is an upper bound on the NCTD of any concept class over a domain X .A further consequence of Theorem 23 is that NCTD is sometimes strictly subadditive withrespect to free combination, i.e., that inequality (1) is sometimes strict. An example for that is thefree combination P m (cid:116) P m of two copies of P m for odd m . Since the domain of P m (cid:116) P m has size m , we obtain NCTD( P m (cid:116) P m ) = m , while NCTD( P m ) + NCTD( P m ) = 2 (cid:100) m (cid:101) = 2 m +12 = m + 1 .Another situation (that we will exploit later) where NCTD + acts strictly additively on (cid:116) , iscaptured in the following: Lemma 24

Let P m be the powerset over the domain { x , . . . , x m } and let C be a concept classwith domain X disjoint from { x , . . . , x m } . Then, NCTD + ( P m (cid:116) C ) = NCTD + ( P m ) + NCTD + ( C ) . Proof

By (4) it sufﬁces to show that

NCTD + ( P m (cid:116) C ) ≥ NCTD + ( P m ) + NCTD + ( C ) . Theo-rem 17 implies that, for each C i ∈ C , any positive non-clashing mapping T for P m (cid:116) C must use m = NCTD + ( P m ) examples from { x , . . . , x m } to teach the single concept { x , . . . , x m } (cid:116) C i within the concept class P m (cid:116) C i . So the only way that T could use fewer than m + NCTD + ( C ) examples in total for each concept in { x , . . . , x m } (cid:116) C is if each such concept is taught with exactly m examples from { x , . . . , x m } , and hence fewer than NCTD + ( C ) examples from X , a contradic-tion.Furthermore, it is easily seen that the average degree acts additively on (cid:116) : Lemma 25

Let C and C be concept classes over disjoint and ﬁnite domains. Then the followingholds: deg avg ( C (cid:116) C ) = deg avg ( C ) + deg avg ( C ) . (7) Proof

Let C := C (cid:116) C . The concepts in C that are neighbors of C ∪ C ∈ C are precisely theconcepts of the form C ∪ C (cid:48) or C (cid:48) ∪ C where C (cid:48) is a neighbor of C in C and C (cid:48) is a neighborof C in C . Hence deg C ( C ∪ C ) = deg C ( C ) + deg C ( C ) . Moreover |C| = |C | · |C | . It follows that (cid:88) C ∈C deg C ( C ) = (cid:88) C ∈C (cid:88) C ∈C deg C ( C ∪ C ) = |C | · (cid:88) C ∈C deg C ( C ) + |C | · (cid:88) C ∈C deg C ( C ) . Division by |C | · |C | immediately yields (7). 11he free combination of classes with a tight degree lower bound is again a class with a tightdegree lower bound: Corollary 26

Let C and C be two concept classes over disjoint and ﬁnite domains, and let C = C (cid:116) C . Then NCTD( C i ) = · deg avg ( C i ) for i = 1 , implies that NCTD( C ) = · deg avg ( C ) . Proof

The assertion is evident from the chain of inequalities:

NCTD( C ) (1) ≤ NCTD( C ) + NCTD( C ) = 12 · deg avg ( C ) + 12 · deg avg ( C ) (7) = 12 · deg avg ( C ) . and Theorem 16.

6. Relation to Other Learning-theoretic Parameters

In this section, we set

NCTD in relation to

PBTD and

VCD , as well as to the smallest possiblesize of a sample compression scheme for a given concept class.

PBTD and

VCD

Since preference-based teaching is collusion-free (Gao et al., 2017), we obtain the following bounds.

Proposition 27

Let C be any concept class. Then NCTD( C ) ≤ PBTD( C ) and NCTD + ( C ) ≤ PBTD + ( C ) . Remark 28

The ﬁrst inequality in Proposition 27 is sometimes strict, as witnessed by Theorem 23,which states that

NCTD( P m ) = (cid:100) m/ (cid:101) . By comparison, PBTD( P m ) = m . In particular, thisyields a family of concept classes of strictly increasing NCTD for which

PBTD exceeds

NCTD by a factor of 2. The fact that the second inequality in Proposition 27 is sometimes strict is witnessedby the simple class C described in the introduction, with NCTD + ( C ) = 1 . Since no concept in C has a positive teaching set of size 1, Proposition 5 implies PBTD + ( C ) = 2 . In particular, theseexamples witness that Proposition 5 does not hold for non-clashing teaching.Results from the literature can now be combined in a straightforward way in order to formulatean upper bound on NCTD in terms of the VC-dimension.

Proposition 29

NCTD( C ) is upper-bounded by a function quadratic in VCD( C ) . Proof

PBTD is known to lower-bound the recursive teaching dimension (Gao et al., 2017). Huet al. (2017) proved that, when

VCD( C ) = d , the recursive teaching dimension of C is no largerthan . · d − . · d . By Proposition 27, the same upper bound applies to NCTD .However,

VCD can also be arbitrarily larger than

NCTD , a result that follows immediatelyfrom the corresponding result for TD : Proposition 30 (Goldman and Kearns (1995))

Let k ∈ N , k ≥ . Then there exists a ﬁnite con-cept class C such that TD + ( C ) = TD( C ) = 1 and VCD( C ) = k .

12o far, there is no concept class for which

VCD is known to exceed

NCTD . Note that anysuch concept class would have to fulﬁll

PBTD > VCD as well. We tested those classes for which

PBTD > VCD is known from the literature, but found that all of them satisfy

NCTD ≤ VCD .As an example, here we present “Warmuth’s class.” This concept class, shown in Table 1,was communicated by Manfred Warmuth and proven by Darnst¨adt et al. (2016) to be the smallestconcept class for which

PBTD exceeds

VCD . In particular,

VCD( C W ) = 2 while PBTD( C W ) = 3 . x x x x x x x x x x C C (cid:48) C C (cid:48) C C (cid:48) C C (cid:48) C C (cid:48) Table 1: Warmuth’s class C W , with the highlighted entries (in bold) corresponding to the images ofa positive non-clashing teacher mapping. The domain of this class is { x , . . . , x } , and itcontains 10 concepts, named C through C and C (cid:48) through C (cid:48) . Proposition 31

NCTD( C W ) = NCTD + ( C W ) = 2 . Proof

The highlighted labels in Table 1 correspond to a positive non-clashing mapping for C W ,which immediately shows that NCTD + ( C W ) ≤ and thus NCTD( C W ) ≤ . To show that NCTD( C W ) ≥ , suppose by way of contradiction that NCTD( C W ) = 1 . Then there is a non-clashing teacher mapping T that assigns every concept in C W a teaching set of size 1.Since C and C (cid:48) differ only on the instance x , the mapping T must fulﬁll either T ( C ) = { ( x , } or T ( C (cid:48) ) = { ( x , } .Case 1. T ( C ) = { ( x , } . Since C is consistent with T ( C ) , the teaching set for C must beinconsistent with C . In particular, T ( C ) (cid:54) = { ( x , } . This implies T ( C (cid:48) ) = { ( x , } , since x is the only instance on which C and C (cid:48) disagree. By an analogous argument concerning C and C (cid:48) , one obtains T ( C (cid:48) ) = { ( x , } . Now T has a clash on C (cid:48) and C (cid:48) , which is a contradiction.Case 2. T ( C (cid:48) ) = { ( x , } . One argues as in Case 1, with C (cid:48) and C (cid:48) in place of C and C ,yielding T ( C ) = { ( x , } and T ( C ) = { x , } . This is a clash, resulting in a contradiction.As both cases result in a contradiction, we have NCTD( C W ) > and thus NCTD( C W ) = 2 .Since NCTD + is an upper bound on NCTD , we also have

NCTD + ( C W ) = 2 .While the general relationship between NCTD and

VCD remains open, it turns out that

NCTD( C ) is upper-bounded by VCD( C ) when C is a ﬁnite maximum class. For a ﬁnite instance space X , aconcept class C of VC dimension d is called maximum if its size |C| meets Sauer’s upper bound (cid:80) di =0 (cid:0) |X | i (cid:1) (Sauer, 1972) with equality. Recently, Chalopin et al. (2018) showed that every ﬁnitemaximum class C admits a so-called representation map , i.e., a function r that maps every conceptin C to a set of at most d (= VCD( C )) instances, in a way that no two distinct concepts C, C (cid:48) ∈ C both agree on all the instances in r ( C ) ∪ r ( C (cid:48) ) . By deﬁnition, any representation map is, translatedinto our setting, simply a non-clashing teacher mapping of order d for C . Therefore, the result byChalopin et al. implies that NCTD( C ) ≤ VCD( C ) for ﬁnite maximum C .13 .2 Sample Compression Intuitively, a sample compression scheme (Littlestone and Warmuth, 1986) for a (possibly inﬁnite)concept class C provides a lossless compression of every set S of labeled examples for any conceptin C in the form of a subset of S . It was proven that the existence of a ﬁnite upper bound on thesize of the compression sets is equivalent to PAC-learnability, i.e., to ﬁnite VC-dimension (Moranand Yehudayoff, 2016; Littlestone and Warmuth, 1986). Open for over 30 years now is the questionhow closely such an upper bound can be related to the VC-dimension.Formally, a sample compression scheme of size k for a concept class C over X is a pair ( f, g ) of mappings, where, for every sample set S consistent with some concept C ∈ C , (i) f maps S to asubset f ( S ) ⊆ S with | f ( S ) | ≤ k ; and (ii) g ( f ( S )) maps the compressed set to a concept C (cid:48) over X (not necessarily in C ) that is consistent with S . By CN( C ) we denote the size of the smallest-sizesample compression scheme for C . The open question then is whether CN( C ) is upper-bounded by(a function linear in) VCD( C ) .Some connections between sample compression and teaching have been established in the liter-ature (Doliwa et al., 2014; Darnst¨adt et al., 2016). The non-clashing property bears some similaritiesto sample compression and has in fact been used in the context of unlabelled sample compression(in which f ( S ) is an unlabelled set) (Kuzmin and Warmuth, 2007; Chalopin et al., 2018). It is thusnatural to ask whether CN is an immediate upper or lower bound on NCTD . Below, we answer thisquestion negatively.

Proposition 32

1. For every k ∈ N , k ≥ , there is a concept class C such that NCTD( C ) = PBTD( C ) = 1 but CN( C ) > k .2. Let P m be the powerset over a domain of size m , where m ≥ is odd. Then CN( P m ) < NCTD( P m ) and P m ) < PBTD( P m ) . Proof

Statement 1 is due to Remark 30, which implies the existence of a concept class C with NCTD( C ) = PBTD( C ) = 1 and VCD( C ) = 5 k . Then CN( C ) > k follows from a result by Floydand Warmuth (1995) that states that no concept class of VC-dimension d has a sample compressionscheme of size at most d .Statement 2 follows from the obvious fact that PBTD( P m ) = m , in combination with The-orem 23, as well as with a result by Darnst¨adt et al. (2016) that shows CN( P m ) ≤ (cid:98) m (cid:99) , for any m ≥ . Note that the compression function f in a sample compression scheme for C trivially induces ateacher mapping T f deﬁned by T f ( C ) = f ( { ( x, C ( x )) | x ∈ X } ) . The decompression mapping g then satisﬁes g ( T f ( C )) = C for all C ∈ C . Hence ( T f , g ) is a successful teacher-learner pair.Proposition 32.2 now states that there are concept classes for which the teacher-learner pairs ( T f , g ) induced by any optimal sample compression scheme necessarily display collusion. In other words,optimal sample compression yields collusive teaching. An interesting problem is to ﬁnd more ex-amples of concept classes for which optimal sample compression yields collusive teachers and to

2. When m = 5 k for some k ≥ , Darnst¨adt et al. (2016) even show that CN( P m ) ≤ k ; hence there is a family ofconcept classes with CN < NCTD for which the gap between CN and NCTD grows linearly with the size of theinstance space. C to a subset of X mustbe injective, so that any two concepts in C remain distinguishable after compression. In other words,the non-clashing teacher mappings induced by representation maps are repetition-free , i.e., they donot map any two distinct concepts C, C (cid:48) ∈ C to labelled samples T ( C ) , T ( C (cid:48) ) for which { x ∈ X | ( x, l ) ∈ T ( C ) for some l ∈ { , }} (cid:54) = { x ∈ X | ( x, l (cid:48) ) ∈ T ( C (cid:48) ) for some l (cid:48) ∈ { , }} . Requiring no-clash teacher mappings to be repetition-free would be a limitation, as the example ofthe powerset over any set of m instances, m ≥ , shows. In this case, no-clash teaching can be donewith teacher mappings of order (cid:100) m (cid:101) , but it is not hard to see that the best possible repetition-freeno-clash teacher mapping is of order m .

7. Complexity of Decision Problems Related to No-clash Teaching

In this section, we address the complexity of the problem of deciding whether or not every conceptin a given ﬁnite concept class can be taught with a non-clashing teaching set of size at most k ,for some speciﬁed k ≥ . Surprisingly perhaps, such decision problems are NP-hard, even when k = 1 and teaching is done using positive examples only. In contrast, we show in subsection 7.5that the corresponding decision problems for PBTD (equivalently, for

RTD ) have polynomial timesolutions.We show an equivalence between the most highly constrained such decision problem (testing if

NCTD + = 1 , for a given concept class) and a natural (but apparently not previously studied) con-strained bipartite matching problem that is related to the well-studied notion of induced matchings.The following, an immediate consequence of Proposition 13, allows us to restrict our complexityanalysis to certain normalized concept classes. Proposition 33

Let C be any non-trivial concept class over a ﬁnite domain, with at least two non-empty concepts. Then, NCTD + ( C ) = NCTD + ( C \ {∅} ) . Proof

Let C (cid:48) denote C \ {∅} . If C (cid:48) = C there is nothing to show. So, suppose that C contains theempty concept. If NCTD + ( C ) = k then trivially NCTD + ( C (cid:48) ) ≤ k .For the converse, suppose that NCTD + ( C (cid:48) ) = k , as witnessed by a normal-form mapping T (cf. Proposition 13(b)). Since T does not assign the empty set to any concept one can obviouslyextend T to assign the empty set to the empty concept and thus teach all of C without clashes usingno negative examples and with teaching sets of size at most k . (There are no clashes, because theempty concept cannot be consistent with any of the teaching sets that use at least one positive ex-ample.)Our goal in the remainder of this section is to set out hardness results for testing NCTD = k ? and NCTD + = k ? , for ﬁxed k ≥ . We begin by establishing that testing NCTD + = 1? , fora given concept class C is NP-hard. Other results follow by reduction from the NCTD + = 1? decision problem. (It is straightforward to conﬁrm that all of the decision problems NCTD ≤ k ? and NCTD + ≤ k ? are in NP.) 15 .1 Testing if NCTD + = 1 is NP-hard Let ( C , X ) be an instance of the NCTD + = 1 decision problem. By Propositions 13 and 33, wecan assume that C does not contain the empty set, and that positive teacher mappings realizing NCTD + = 1 are restricted to those that use exactly one positive instance for each concept.We start by observing that ( C , X ) can be viewed as a bipartite graph B C , X , with vertex classes C (black vertices) and X (white vertices) and an edge from C i ∈ C to x j ∈ X whenever x j ∈ C i .Under our assumptions, it follows that deciding if C has NCTD + = 1 is equivalent to deciding if B C , X admits a matching M such that (i) M saturates all of the black vertices, and (ii) no two edgesof M are part of a -cycle in B C , X . (Condition (i) ensures that each concept in C has an associatedpositive teaching set of size , and condition (ii) ensures that the resulting teacher mapping is non-clashing.)We refer to the problem of deciding if a given bipartite graph B with vertex partition ( V b , V w ) admits a matching M such that (i) M saturates all of the vertices in V b , and (ii) no two edges of M are part of a -cycle in B , as the Non-Clashing Bipartite Matching Problem.

The NP-hardness ofdeciding

NCTD = 1? is thus an immediate consequence of the following:

Theorem 34

The Non-Clashing Bipartite Matching Problem is NP-hard.

The proof of Theorem 34 is by reduction from the familiar NP-hard problem -SAT. The detailsare given in Appendix A. Remark 35

The reduction produces a bipartite graph whose vertices have degree bounded by ﬁve.One can conclude then that testing

NCTD + = 1 is NP-hard even if concepts contain at mostﬁve instances, and instances are contained in at most ﬁve concepts. It is natural to ask to whatextent this can be tightened. In Appendix B.1, we describe a modiﬁcation of the reduction thatproduces a bipartite graph whose vertices have degree bounded by three, from which it follows thattesting NCTD + = 1 is NP-hard even if concepts contain at most three instances, and instances arecontained in at most three concepts. On the other hand, if either (i) all concepts have at most twoinstances, or (ii) all instances are contained in at most two concepts, the bipartite graph B C , X hasthe property that the degree of all vertices in one of its two parts bounded by at most two. In thiscase, it follows immediately from the algorithm in Appendix B.2 that testing NCTD + = 1 can bedone in polynomial time. NCTD = 1 is NP-hard

We reduce the

NCTD + = 1 decision problem to the NCTD = 1 decision problem. Let ( C , X ) bean instance of the NCTD + = 1 decision problem. As before, we will assume (following Proposi-tion 33) that C does not contain the empty set. We make two disjoint copies ( C , X ) and ( C , X ) of ( C , X ) , and take their union, denoted C to be an instance of the NCTD = 1 decision problem.We argue that

NCTD(2 C ) = 1 if and only if NCTD + ( C ) = 1 .It is clear that NCTD + ( C ) = 1 implies NCTD(2 C ) = 1 . For the converse, suppose thatteaching mapping T provides a NCTD = 1 solution of the concept class C that, among all suchmappings, uses the fewest negative examples.Note that, all concepts in one component concept class are consistent with (necessarily negative)examples drawn from the opposite domain, and inconsistent with positive examples drawn from theopposite domain. Thus, the minimality of T ensures that negative examples used for any component16oncept class C i are drawn from its associated domain X i (otherwise any such negative examplecould be replaced by a positive example for the corresponding concept, without creating a clash).But, for the same reason, it cannot be that for both concept classes C i there exist one or moreconcepts whose teaching set uses a negative example drawn from the associated domain X i , sinceany pair of concepts from different classes taught in this way would necessarily clash. It followsthat T must use only positive examples (necessarily from the associated domain X i ) for teachingconcepts in at least one of the two component concept classes C i ; in this sense it must provide a NCTD + = 1 solution of the instance ( C , X ) . NCTD + = k is NP-hard, for k > . Again we describe a reduction from the

NCTD + = 1 decision problem. Given an instance of the NCTD + = 1 decision problem, speciﬁcally a pair ( C , X ) , where C is a concept class over the ﬁnitedomain X disjoint from { x , . . . , x k − } , we construct the concept class P k − (cid:116) C . By Lemma 24,we know that NCTD + ( P k − (cid:116) C ) = k − + ( C ) , so NCTD + ( C ) = 1 if and only if NCTD + ( P k − (cid:116) C ) = k . NCTD = k is NP-hard, for k > . Again we describe a reduction from the

NCTD + = 1 decision problem. Let C be a concept classover the ﬁnite domain X , disjoint from { x , . . . , x k − } . We construct the composite conceptclass C := 2(2 C ) as in subsection 7.2. By the reduction of that subsection, it will sufﬁce to arguethat NCTD(2 C ) = 1 if and only if NCTD( P k − (cid:116) C ) = k .First note that, by Theorem 23 and the sub-additivity of NCTD (equation (1)),

NCTD(2 C ) = 1 implies that NCTD( P k − (cid:116) C ) ≤ k . In addition NCTD(2 C ) = 1 implies that ANCTD(4 C ) > which, together with ANCTD( P k − ) = k − (Remark 18) implies ANCTD( P k − ) +ANCTD(4 C ) > k − . This in turn implies ANCTD( P k − (cid:116) C ) > k − , by Lemma 21, fromwhich we immediately conclude that NCTD( P k − (cid:116) C ) > k − , and hence NCTD( P k − (cid:116) C ) = k .For the converse, we have seen that NCTD( P k − (cid:116) C ) = k implies ANCTD( P k − (cid:116) C ) ≤ k (trivially). This in turn implies ANCTD( P k − ) + ANCTD(4 C ) ≤ k , by Lemma 21,and hence ANCTD(4 C ) ≤ , by Remark 18. But ANCTD(4 C ) ≤ implies NCTD(2 C ) = 1 ,since (i) C can be viewed as two copies of C , and (ii) any teacher mapping for C realizing ANCTD(4 C ) ≤ uses the empty set as a teaching set at most once, and hence uses a teaching setof size greater than 1 in at most one of the two copies of C . PBTD ≤ k ? (equivalently RTD ≤ k ? ) has a polynomial-time solution As before, we can cast the decision problem

PBTD ≤ k ? as a constrained matching problem in asuitably deﬁned bipartite graph. Deﬁnition 36

Let C be a concept class over domain X . Let ≤ k ≤ |X | be an integer. Let S k = ( X × { , } ) k be the family of labeled samples of size k . Then G k = G k ( C ) denotes thebipartite graph with vertex classes C and S k and an edge between C ∈ C and S ∈ S k iff C isconsistent with S . The following result is a slight extension of a well known result in the theory of constrainedmatchings: 17 heorem 37

Let C be a concept class of size m over domain X and let G = G k ( C ) . Then thefollowing statements are equivalent:1. There exists a matching M of size m in G such that G contains no alternating cycles withrespect to M (called a uniquely restricted matching of size m (Golumbic et al., 2001)).2. There exists a matching M of size m in G such that every M -induced subgraph contains avertex of degree .3. There exist m distinct vertices (samples) S , . . . , S m ∈ S k and an ordering C , . . . , C m ofthe vertices (=concepts) in C such that the following holds: • There is an edge between C i and S i (i.e., C i is consistent with S i ). • If there is an edge between C i and S j (i.e., if C i is consistent with S j ), then i ≤ j .4. PBTD( C ) ≤ k . Proof ⇔

2. and 2. ⇔

3. are well known equivalences in the theory of constrained matchings. SeeTheorem . in (Golumbic et al., 2001).3. ⇒ C j is preferred over C i whenever j > i . This preference relation is a witness of PBTD( C ) ≤ k .4. ⇒ PBTD( C ) ≤ k exists, then any linearextension of it will satisfy 3. Remark 38

The second statement in Theorem 37 can be strengthened as follows: there exists amatching of size m in G such that every M -induced subgraph contains a vertex of degree in eachof the two vertex classes, cf. (Cechl´arov´a, 1991).The above characterization of classes with a recursive teaching dimension of at most k easilyleads to a linear time algorithm for the corresponding decision problem: Corollary 39

1. Let G be a bipartite graph with vertex classes V and V such that | V | ≤| V | . Let | G | denote the size (= number of vertices plus number of edges) of G . There is analgorithm that runs in time O ( | G | ) and returns a uniquely restricted matching of size | V | (provided it exists).

2. Suppose that the graph G k ( C ) associated with a concept class C is given. Then there is analgorithm for checking whether PBTD( C ) ≤ k whose run time is linear in the size of G k ( C ) .Moreover, if PBTD( C ) ≤ k , it returns a preference relation that witnesses this fact. The second part of the corollary is immediate from the ﬁrst part and Theorem 37. The ﬁrst partof the corollary is based on the simple idea of initializing the matching M with the empty set andthen iteratively doing the following:

3. See (Cechl´arov´a, 1991) for a similar algorithm (based on a similar characterization) that decides in linear time whethera bipartite graph has a unique maximum matching.4. The more general problem of deciding whether there exists a uniquely restricted matching of size k , with k beingpart of the input, is known to be NP-complete (Golumbic et al., 2001).

18. If V = ∅ , then return M and stop.2. If V does not contain any vertex of degree , then return an error message (indicating thatthere exists no uniquely restricted matching of size | V | in G ) and stop. Otherwise, pick avertex of degree from V , say vertex v with u as its unique neighbor in V .3. Insert the edge ( u, v ) into M , remove u from V and v from V and update the degrees of thevertices which are adjacent to either u or v .Note that V and V are dynamically changed in the course of the algorithm. The following con-ditions are easily shown to be satisﬁed after a run through the main loop provided that they aresatisﬁed immediately before entering the loop:1. There exists a uniquely restricted matching of size | V | for the subgraph induced by V ∪ V .2. Let V (cid:48) and V (cid:48) denote the vertices that have been removed from V and V , respectively. Thenthere is a unique perfect matching for the subgraph induced by V (cid:48) ∪ V (cid:48) .The correctness of the algorithm directly follows from these invariance conditions.We brieﬂy note that the results described in this section also hold, mutatis mutandis, for PBTD + in place of PBTD .

8. Conclusions

No-clash teaching represents the limit of data efﬁciency that can be achieved in teaching settingsobeying Goldman and Mathias’s notion of collusion-freeness. Therefore, it is the sole most promis-ing collusion-free teaching model to shed light on two open problems in computational learningtheory, namely (i) to ﬁnd a teaching complexity parameter that is upper-bounded by a function lin-ear in

VCD , and (ii) to establish an upper bound on the size of smallest sample compression schemesthat is linear in

VCD . If any collusion-free teaching model yields a complexity upper-bounded by(a function linear in)

VCD , then no-clash teaching does. Likewise, if any collusion-free modelis powerful enough to compress concepts as efﬁciently as sample compression schemes do, thenno-clash teaching is.The most fundamental open question resulting from our paper is probably whether

NCTD isupper-bounded by

VCD in general.Furthermore, our results introduce some intriguing connections between

NCTD and the well-studied ﬁeld of constrained matching in bipartite graphs that may open up a line of study that relatesteaching complexity, as well as sample compression and

VCD , to fundamental issues in matchingtheory.

References

Dana Angluin. Inductive inference of formal languages from positive data.

Information and Control ,45(2):117–135, 1980.

5. This can be justiﬁed by Remark 38 and the invariance conditions below.6. The corresponding bipartite graph has as its second vertex class the family of positive samples of order at most k . Robotics and Autonomous Systems , 57(5):469–483, 2009.Frank Balbach. Measuring teachability using variants of the teaching dimension.

Theoretical Com-puter Science , 397(1–3):94–113, 2008.Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

Proceedings of the 26th Annual International Conference on Machine Learning, ICML , pages41–48, 2009.Katarina Cechl´arov´a. The uniquely solvable bipartite matching problem.

Operations ResearchLetters , 10(4):221–224, 1991.J´er´emie Chalopin, Victor Chepoi, Shay Moran, and Manfred K. Warmuth. Unlabeled sample com-pression schemes and corner peelings for ample and maximum classes.

ArXiv , abs/1812.02099,2018. URL http://arxiv.org/abs/1812.02099 .Malte Darnst¨adt, Thorsten Kiss, Hans Ulrich Simon, and Sandra Zilles. Order compressionschemes.

Theor. Comput. Sci. , 620:73–90, 2016.Franc¸ois Denis. Learning regular languages from simple positive examples.

Machine Learning , 44(1/2):37–66, 2001.Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching dimen-sion, VC-dimension and sample compression.

J. Mach. Learn. Res. , 15:3107–3131, 2014.Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkisdimension.

Machine Learning , 21(3):1–36, 1995.Ziyuan Gao, Christoph Ries, Hans Ulrich Simon, and Sandra Zilles. Preference-based teaching.

J.Mach. Learn. Res. , 18(31):1–32, 2017.Sally A. Goldman and Michael J. Kearns. On the complexity of teaching.

Journal of Computer andSystem Sciences , 50(1):20–31, 1995.Sally A. Goldman and H. David Mathias. Teaching a smarter learner.

J. Comput. Syst. Sci. , 52(2):255–267, 1996.Martin C. Golumbic, Tirza Hirst, and Moshe Lewenstein. Uniquely restricted matchings.

Algorith-mica , 31(2):139–154, 2001.Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Show-ing versus doing: Teaching by demonstration. In

Advances in Neural Information ProcessingSystems 29 (NIPS) , pages 3027–3035. 2016.Lunjia Hu, Ruihan Wu, Tianhong Li, and Liwei Wang. Quadratic upper bound for recursive teachingdimension of ﬁnite VC classes. In

Proceedings of the 30th Conference on Learning Theory, COLT2017, Amsterdam, The Netherlands, 7-10 July 2017 , pages 1147–1156, 2017.David Kirkpatrick, Hans U. Simon, and Sandra Zilles. Optimal collusion-free teaching. In

Pro-ceedings of Machine Learning Research (ALT2019) , volume 98, 2019.20ima Kuzmin and Manfred K. Warmuth. Unlabeled compression schemes for maximum classes.

J. Mach. Learn. Res. , 8:2047–2081, 2007.Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. TechnicalReport, UC Santa Cruz, 1986.Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes.

J. ACM , 63(3):21:1–21:10, 2016.Shay Moran, Amir Shpilka, Avi Wigderson, and Amir Yehudayoff. Teaching and compressing forlow VC-dimension.

CoRR , abs/1502.06187, 2015.Norbert Sauer. On the density of families of sets.

Journal of Combinatorial Theory, Series A , 13(1):145–147, 1972.Ingo Schwab, Wolfgang Pohl, and Ivan Koychev. Learning to recommend from positive evidence.In

Proceedings of the 5th International Conference on Intelligent User Interfaces, IUI , pages241–247, 2000.Patrick Shafto, Noah D. Goodman, and Thomas L. Grifﬁths. A rational account of pedagogicalreasoning: Teaching by, and learning from, examples.

Cognitive psychology , 71:55–89, 2014.Ayumi Shinohara and Satoru Miyano. Teachability in computational learning.

New Gen. Comput. ,8:337–348, 1991.Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequen-cies of events to their probabilities.

Theor. Probability and Appl. , 16(2):264–280, 1971.Chunlin Wang, Chris H. Q. Ding, Richard F. Meraz, and Stephen R. Holbrook. Psol: a positivesample only learning algorithm for ﬁnding non-coding RNA genes.

Bioinformatics , 22(21):2590–2596, 2006.Kenneth Wexler and Peter W. Culicover.

Formal principles of language acquisition . MIT Press,1980.Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching.

CoRR , abs/1801.05927, 2018.Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Models of cooperative teachingand learning.

J. Mach. Learn. Res. , 12:349–384, 2011.

Appendix A. Proof of Theorem 34

Proof

We describe a parsimonious reduction from the familiar NP-hard problem -SAT, an instanceof which is a set D = { D , . . . , D m } of clauses, each of which is a disjunction of three literalsdrawn from an underlying set V = { V , . . . , V n } of variables. Speciﬁcally, given an instance D of -SAT, we construct a bipartite graph B D (vertices are either black or white, and all edges join ablack vertex to a white vertex) that admits a matching M such that (i) M saturates all of the black21ertices, and (ii) no two edges of M are part of a -cycle in B , if and only if the instance D issatisﬁable.To this end, we ﬁrst associate with each variable V i a variable gadget : a ring of m vertices,with alternating subscripted labels v i and w i , emphasizing its bipartite nature (cf. Figure 1(a)). Amatching that saturates all of the v i -vertices (black) of this gadget is of one of two types, illustratedin Figure 1(b) and (c)), which we associate with the two possible truth assignments to V i . w i w i w i w i w i w i w i v i v i v i v i v i v i V i ( a ) w i w i w i w i w i w i w i v i v i v i v i v i v i V i ≡ true ( b ) V i ≡ false w i w i w i w i w i w i w i v i v i v i v i v i v i ( c ) v i v i v i Figure 1: VariableGadgetWe associate with each clause D j a clause gadget consisting of vertices, with subscriptedlabels p j , q j , r j and s j (cf. Figure 2(a)). It is straightforward to conﬁrm that any matching thatsaturates all of the r j and q j -vertices (black) must use exactly one of the three p j q j -edges, illustratedin Figure 2(b) (c) and (d)). We refer to the p j q j -edges as portals of the clause gadget, since theirendpoints are the only points of connection with other parts of the full construction. ( a ) p j p j p j q j q j q j s j s j r j r j ( b ) p j p j p j q j q j q j s j s j r j r j ( c ) p j p j p j q j q j q j s j s j r j r j ( d ) p j p j p j q j q j q j s j s j r j r j Figure 2: ClauseGadgetWe complete the construction by adding edges from vertex gadgets to appropriate clause gadgetportals. Speciﬁcally, (i) if the k -th literal in clause D j is V i , then we add edges from v i j to p jk and q jk to w i j (cf. Figure 3(a)) and (ii) if the k -th literal in clause D j is V i , then we add edges from v i j to p jk and q jk to w i j − (cf. Figure 3(b)). These connector edges, shown dashed in Figures 3(a)and (b), are forbidden in any matching satisfying the constraints set out above, by the inclusion,for each such edge, of a pair of additional vertices and associated bridging path, as illustrated inFigure 3(c). (Observe that since the graph has the same number of black and white vertices, a22atching that saturates all of the black vertices must also saturate all of the white vertices. Thus,for each connector edge, the middle edge of its bridging path is forced to belong to the matching;otherwise, the end edges of the bridging path must both be chosen, resulting in a clash.)It follows that if the k -th literal in clause D j is V i , and the edge p jk q jk belongs to the constrainedmatching then edge v i j w i j cannot belong. Similarly, if the k -th literal in clause D j is V i , and theedge p jk q jk belongs to the constrained matching then edge v i j w i j − cannot belong. ( c ) w i j − ( b ) v i j w i j ( a ) p jk w i j − w i j v i j q jk p jk q jk Figure 3: Connector GadgetsTo complete the proof it remains to argue that the resulting graph B D admits a matching M such that (i) M saturates all of the black vertices, and (ii) no two edges of M are part of a -cyclein B D , if and only if the instance D is satisﬁable. Suppose ﬁrst that B D admits such a matching M . Since none of the connector edges are included in M , it follows (as argued above) that in everyvertex gadget the black vertices are saturated in one of the two ways illustrated in Figure 1(b) and1(c)). Similarly, in every clause gadget, the black vertices are saturated in one of the three waysillustrated in Figure 2(b), 2(c) and 2(d)). Suppose that the portal edge p jk q jk of the gadget associatedwith clause D j belongs to the matching M . Then, by our choice of connector edges, if the k -thliteral in clause D j is V i , it must be that edge v i j w i j does not belong to M , that is the matchingon the variable gadget associated with V i has the associated truth assignment true . Similarly, ifthe k -th literal in clause D j is V i , it must be that edge v i j w i j − does not belong to M , that is thematching on the variable gadget associated with V i has the associated truth assignment false . Itfollows that the truth assignment to the variables in V , associated with the matchings induced on thevertex gadgets, satisﬁes all of the clauses in D .On the other hand, suppose that D is satisﬁable, that is there is an assignment of truth values tothe variables in V that satisﬁes all of the clauses in D . Then, if we (i) choose the matching on thevertex gadget associated with V i to be the one corresponding to its truth assignment, and (ii) chooseany matching on the clause gadget associated with clause D j including a portal edge associatedwith one of the satisﬁed literals in D j , and (iii) choose all of the edges added to prevent the choiceof connector edges, it is straightforward to conﬁrm that the chosen edges form a matching M in B D such that (i) M saturates all of the black vertices, and (ii) no two edges of M are part of a -cyclein B D . 23 ppendix B. Complexity of Degree-bounded Instances of Non-clashing BipartiteMatching The reduction described in the proof of Theorem 34 produces a bipartite graph whose vertices havedegree at most ﬁve. (Degree ﬁve is attained for the vertices p j and q j of the clause gadgets, both ofwhich have three incident edges within the gadget and two from a bridged connector.) It is naturalto ask if the hardness result continues to hold for bipartite graphs all of whose vertices have degreestrictly less than ﬁve. In the next subsection we describe a fairly simple modiﬁcation of both ourclause and connector structures that allows us to reduce the maximum degree to three. Followingthat, we show that if the maximum degree among vertices in either part of a given bipartite graphis reduced to two there is a polynomial time algorithm to decide if the graph admits a non-clashingmatching. B.1 A modiﬁed reduction with maximum degree three

We begin by describing a new clause gadget, illustrated in Figure 4(a), with the same p - q portalstructure as before but with the additional property that all p and q vertices have degree two. It isstraightforward to conﬁrm that, up to symmetry, the matching illustrated in Figure 4(b) is the onlymatching that saturates all of the vertices using only edges internal to the gadget. ( b ) p j p j p j q j c j q j b j b j b j c j q j c j a j d j ( a ) p j p j p j q j c j q j b j b j b j c j q j c j a j d j Figure 4: NewClauseGadgetNext we describe a somewhat more complicated connector structure that is used to link verticesin the variable gadgets with portal vertices of the new clause gadget. Schematically, as illustrated inFigures 5(a) and (b), the connector structure plays exactly the same role as its counterpart (pair ofbridged edges) in the earlier construction. The new connector structure, illustrated in Figures 5(c),also contains edges, dashed as before, that cannot be part of any perfect non-clashing matching.Their role, as before, is simply to constrain the choice of other edges (in any perfect non-clashingmatching).It is easiest to argue ﬁrst that neither of the dashed diagonals can be used. If both are used thenedge r s must also be used, creating a clash. On the other hand if just one, say r s is used, theneither r s must also be used or both r r and s s must be used, creating a clash in either case.24 b )( a ) p jk w i j − w i j v i j q jk w i j − v i j w i j p jk q jk ( c ) wv qps s s s s s s r r r r r r r ( c (cid:48) ) vw qpr r r s s s s s s s r r r r Figure 5: NewConnectorGadgetsBy parity, an even number of the horizontal dashed edges are used in any perfect matching.Since it is impossible to choose both wr and vs (or both r p and s q ) in a non-clashing matching,it sufﬁces to rule out the case where exactly one of wr and vs and exactly one of r p and s q belong to a perfect matching. Suppose r p (but not s q ) is chosen. Then the matching is forced toinclude r r and s s (in order to saturate r and s ). This in turn forces the choice of r r and s s (in order to saturate r and s ), creating a clash. By symmetry, it follows that none of the horizontaldashed edges can be used in a perfect non-clashing matching.It remains to argue that (i) if a non-clashing matching contains edge pq then edge vw cannotbelong (and vice versa); (ii) there is a non-clashing matching of the connector gadget that containsedge pq but leaves both v and w exposed (and vice versa); and (iii) there is a non-clashing matchingof the connector gadget that leaves all of v , w , p and q exposed. For (i), we observe that, by chainedforcing as above, the inclusion of pq forces the inclusion of r s (and, by symmetry, the inclusionof vw forces the inclusion of r s ). Properties (ii) and (iii) are illustrated in Figure 6. B.2 An efﬁcient algorithm for Non-Clashing Bipartite Matching, when the maximum degreeon either part is at most two

Suppose we are given a bipartite graph B whose vertices are either black or white, and all edgesjoin a black vertex to a white vertex. We want to determine if B admits a matching M such that (i) M saturates all of the black vertices, and (ii) no two edges of M are part of a -cycle in B .Suppose further that the vertices on one of the two parts of B all have degree at most two. Wecan assume, without loss of generality that they all have degree exactly two, since edges with anendpoint of degree one can be (incrementally) included in a maximum matching M without risk ofbeing part of a -cycle in B . 25 v qps s s s s s s r r r r r r r wv qps s s s s s s r r r r r r r Figure 6: ConnectorMatchingsWe say that a pair of vertices in this degree-bounded part are twins if they have the same adjacentvertices. We can assume that B has no twins since (i) twins cannot both be saturated withoutproducing a forbidden4