Preference-based Teaching

(1)

Preference-based Teaching

Ziyuan Gao [email protected]

Department of Computer Science, University of Regina

Christoph Ries [email protected]

Department of Mathematics, Ruhr-University Bochum

Hans U. Simon [email protected]

Department of Mathematics, Ruhr-University Bochum

Sandra Zilles [email protected]

Department of Computer Science, University of Regina

Editor:Manfred Warmuth

Abstract

We introduce a new model of teaching named “preference-based teaching” and a corresponding complexity parameter—the preference-based teaching dimension (PBTD)—representing the worst-case number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well-known recursive teaching dimension (RTD) on finite classes, it is radically different on infinite ones: the RTD becomes infinite already for trivial infinite classes (such as half-intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of infinite classes including classes consisting of so-called closed sets w.r.t. a given closure operator, including various classes related to linear sets overN0(whose RTD had been studied quite recently)

and including the class of Euclidean half-spaces. On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial flavor) which helps to derive bounds on the PBTD.

Keywords: teaching dimension, preference relation, recursive teaching dimension, learning half-spaces, linear sets

1. Introduction

The classical model of teaching (Shinohara and Miyano, 1991; Goldman and Kearns, 1995) formu-lates the following interaction protocol between a teacher and a student:

• Both of them agree on a “classification-rule system”, formally given by a concept classL.

• In order to teach a specific conceptL∈ L, the teacher presents to the student ateaching set, i.e., a setT of labeled examples so thatLis the only concept inLthat is consistent withT.

• The student determinesLas the unique concept inLthat is consistent withT.

Goldman and Mathias (1996) pointed out that this model of teaching is not powerful enough, since the teacher is required to make anyconsistent learner successful. A challenge is to model powerful teacher/student interactions without enabling unfair “coding tricks”. Intuitively, the term “coding trick” refers to any form of undesirable collusion between teacher and learner, which would reduce the learning process to a mere decoding of a code the teacher sent to the learner. There is no

c

(2)

generally accepted definition of what constitutes a coding trick, in part because teaching an exact learner could always be considered coding to some extent: the teacher presents a set of examples which the learner “decodes” into a concept.

In this paper, we adopt the notion of “valid teacher/learner pair” introduced by Goldman and Mathias (1996). They consider their model to be intuitively free of coding tricks while it provably allows for a much broader class of interaction protocols than the original teaching model. In partic-ular, teaching may thus become more efficient in terms of the number of examples in the teaching sets. Further definitions of how to avoid unfair coding tricks have been suggested (Zilles et al., 2011), but they were less stringent than the one proposed by Goldman and Mathias. The latter simply requests that, if the learner hypothesizes conceptLupon seeing a sample setS of labeled examples, then the learner will still hypothesizeLwhen presented with any sample setS∪S0, where S0 contains only examples labeled consistently withL. A coding trick would then be any form of exchange between the teacher and the learner that does not satisfy this definition of validity.

The model of recursive teaching (Zilles et al., 2011; Mazadi et al., 2014), which is free of coding tricks according to the Goldman-Mathias definition, has recently gained attention because its complexity parameter, the recursive teaching dimension (RTD), has shown relations to the VC-dimension and to sample compression (Chen et al., 2016; Doliwa et al., 2014; Moran et al., 2015; Simon and Zilles, 2015), when focusing on finite concept classes. Below though we will give examples of rather simple infinite concept classes with infinite RTD, suggesting that the RTD is inadequate for addressing the complexity of teaching infinite classes.

In this paper, we introduce a model calledpreference-based teaching, in which the teacher and the student do not only agree on a classification-rule systemLbut also on a preference relation (a strict partial order) imposed onL. If the labeled examples presented by the teacher allow for several consistent explanations (= consistent concepts) inL, the student will choose a conceptL∈ Lthat she prefers most. This gives more flexibility to the teacher than the classical model: the set of labeled examples need not distinguish a target conceptLfrom any other concept inLbut only from those concepts L0 over whichL is not preferred.1 At the same time, preference-based teaching yields valid teacher/learner pairs according to Goldman and Mathias’s definition. We will show that the new model, despite avoiding coding tricks, is quite powerful. Moreover, as we will see in the course of the paper, it often allows for a very natural design of teaching sets.

Assume teacher and student choose a preference relation that minimizes the worst-case number M of examples required for teaching any concept in the classL. This numberM is then called the preference-based teaching dimension (PBTD) ofL. In particular, we will show the following:

(i) Recursive teaching is a special case of preference-based teaching where the preference re-lation satisfies a so-called “finite-depth condition”. It is precisely this additional condition that renders recursive teaching useless for many natural and apparently simple infinite concept classes. Preference-based teaching successfully addresses these shortcomings of recursive teaching, see Sec-tion 3. For finite classes, PBTD and RTD are equal.

(ii) A wide collection of geometric and algebraic concept classes with infinite RTD can be taught very efficiently, i.e., with low PBTD. To establish such results, we show in Section 4 that spanning sets can be used as preference-based teaching sets with positive examples only — a result that is very simple to obtain but quite useful.

(3)

(iii) In the preference-based model, linear sets overN0 with origin 0 and at mostkgenerators

can be taught withkpositive examples, while recursive teaching with a bounded number of positive examples was previously shown to be impossible and it is unknown whether recursive teaching with a bounded number of positive and negative examples is possible fork≥4. We also give some almost matching upper and lower bounds on the PBTD for other classes of linear sets, see Section 6. (iv) The PBTD of halfspaces inRdis upper-bounded by6, independent of the dimensionalityd (see Section 7), while its RTD is infinite.

(v) We give full characterizations of concept classes that can be taught with only one example (or with only one example, which is positive) in the preference-based model (see Section 8).

Based on our results and the naturalness of the teaching sets and preference relations used in their proofs, we claim that preference-based teaching is far more suitable to the study of infinite concept classes than recursive teaching.

Parts of this paper were published in a previous conference version (Gao et al., 2016).

2. Basic Definitions and Facts

N0 denotes the set of all non-negative integers and N denotes the set of all positive integers. A concept classLis a family of subsets over a universeX, i.e.,L ⊆2X where2X denotes the powerset ofX. The elements ofL are calledconcepts. Alabeled exampleis an element ofX × {−,+}. We slightly deviate from this notation in Section 7, where our treatment of halfspaces makes it more convenient to use{−1,1}instead of{−,+}, and in Section 8, where we perform Boolean operations on the labels and therefore use {0,1} instead of {−,+}. Elements of X are called

examples. Suppose thatT is a set of labeled examples. LetT+ = {x ∈ X : (x,+) ∈ T}and T−={x∈ X : (x,−)∈T}. A setL⊆ X isconsistent withT if it includes all examples inTthat are labeled “+” and excludes all examples inTthat are labeled “−”, i.e, ifT+ ⊆LandT−∩L=∅. A set of labeled examples that is consistent withLbut not withL0 is said todistinguishLfromL0. The classical model of teaching is then defined as follows.

Definition 1 (Shinohara and Miyano (1991); Goldman and Kearns (1995)) Ateaching setfor a conceptL ∈ Lw.r.t.Lis a setT of labeled examples such thatLis the only concept inLthat is consistent withT, i.e.,T distinguishesLfrom any other concept inL. DefineTD(L,L) = inf{|T|:

T is a teaching set forLw.r.t.L}. i.e.,TD(L,L)is the smallest possible size of a teaching set for

Lw.r.t. L. IfL has no finite teaching set w.r.t.L, then TD(L,L) = ∞. The numberTD(L) = sup_L∈LTD(L,L)∈N0∪ {∞}is called theteaching dimension ofL.

For technical reasons, we will occasionally deal with the numberTDmin(L) = infL∈LTD(L,

L), i.e., the number of examples needed to teach the concept fromLthat is easiest to teach. In this paper, we will examine a teaching model in which the teacher and the student do not only agree on a classification-rule systemLbut also on a preference relation, denoted as≺, imposed on L. We assume that≺is astrict partial orderonL, i.e.,≺is asymmetric and transitive. The partial order that makes every pairL6=L0 ∈ Lincomparable is denoted by≺∅. For everyL∈ L, let

L≺L={L0 ∈ L:L0 ≺L}

(4)

As already noted above, a teaching setT ofLw.r.t.LdistinguishesLfrom any other concept in L. If a preference relation comes into play, then T will be exempted from the obligation to distinguishLfrom the concepts inL≺LbecauseLis strictly preferred over them anyway.

Definition 2 Ateaching set forL⊆Xw.r.t.(L,≺)is defined as a teaching set forLw.r.t.L \ L≺L.

Furthermore define

PBTD(L,L,≺) = inf{|T|:Tis a teaching set forLw.r.t.(L,≺)} ∈N₀∪ {∞} .

The numberPBTD(L,≺) = sup_L_∈LPBTD(L,L,≺)∈N₀∪ {∞}is called theteaching dimen-sion of(L,≺).

Definition 2 implies that

PBTD(L,L,≺) = TD(L,L \ L≺L) . (1)

LetL 7→ T(L)be a mapping that assigns a teaching set forLw.r.t.(L,≺)to every L ∈ L. It is obvious from Definition 2 thatT must be injective, i.e.,T(L)6=T(L0)ifLandL’ are distinct con-cepts fromL. The classical model of teaching is obtained from the model described in Definition 2 when we plug in the empty preference relation≺_∅for≺. In particular,PBTD(L,≺_∅) = TD(L).

We are interested in finding the partial order that is optimal for the purpose of teaching and we aim at determining the corresponding teaching dimension. This motivates the following notion:

Definition 3 Thepreference-based teaching dimension ofLis given by

PBTD(L) = inf{PBTD(L,≺) :≺is a strict partial order onL} .

A relationR0onLis said to be anextension of a relationRifR⊆R0. Theorder-extension prin-ciplestates that any partial order has a linear extension (Jech, 1973). The following result (whose second assertion follows from the first one in combination with the order-extension principle) is pretty obvious:

Lemma 4 1. Suppose that≺0 _extends_≺_{. If}_T _{is a teaching set for}_L_w.r.t.₍_L,_≺₎_{, then}_T _{is a}

teaching set forLw.r.t.(L,≺0). MoreoverPBTD(L,≺0)≤PBTD(L,≺).

2. PBTD(L) = inf{PBTD(L,≺) :≺is a strict linear order onL}.

(5)

Preference-based teaching with positive examples only. Suppose thatLcontains two concepts L, L0 such that L ⊂ L0. In the classical teaching model, any teaching set for L w.r.t. L has to employ a negative example in order to distinguishLfromL0. Symmetrically, any teaching set for L0w.r.t.Lhas to employ a positive example. Thus classical teaching cannot be performed with one type of examples only unlessLis an antichain w.r.t. inclusion. As for preference-based teaching, the restriction to one type of examples is much less severe, as our results below will show.

A teaching setTforL∈ Lw.r.t.(L,≺)is said to bepositiveif it does not make use of negatively labeled examples, i.e., ifT− =∅. In the sequel, we will occasionally identify a positive teaching set T withT+. A positive teaching set forLw.r.t.(L,≺)can clearly not distinguishLfrom a proper superset ofLinL. Thus, the following holds:

Lemma 5 Suppose thatL7→T+(L)maps eachL∈ Lto a positive teaching set forLw.r.t.(L,≺). Then≺must be an extension of⊃(so that proper subsets of a setLare strictly preferred overL) and, for everyL∈ L, the setT+(L)must distinguishLfrom every proper subset ofLinL.

Define

PBTD+(L,L,≺) = inf{|T|:T is a positive teaching set forLw.r.t.(L,≺)} . (2)

The numberPBTD+(L,≺) = sup_L∈LPBTD+(L,L,≺)(possibly∞) is called thepositive

teach-ing dimension of(L,≺). Thepositive preference-based teaching dimension ofLis then given by

PBTD+(L) = inf{PBTD+(L,≺) :≺is a strict partial order onL} . (3)

Monotonicity. A complexity measureKthat assigns a numberK(L)∈N₀to a concept classL is said to bemonotonicifL0 ⊆ Limplies thatK(L0)≤K(L). It is well known (and trivial to see) thatTDis monotonic. It is fairly obvious thatPBTDis monotonic, too:

Lemma 6 PBTDandPBTD+are monotonic.

As an application of monotonicity, we show the following result:

Lemma 7 For every finite subclassL0_of_L_{, we have}_PBTD(_L₎_≥_PBTD(_L0₎_≥_TD

min(L0).

Proof The first inequality holds becausePBTDis monotonic. The second inequality follows from the fact that a finite partially ordered set must contain a minimal element. Thus, for any fixed choice of≺,L0must contain a conceptL0such thatL0_≺_L0 =∅. Hence,

PBTD(L0,≺)≥PBTD(L0,L0,≺)(1)= TD(L0,L0\ L0_≺_L0) = TD(L0,L0)≥TD_min(L0) .

Since this holds for any choice of≺, we getPBTD(L0₎_≥_TD

(6)

3. Preference-based versus Recursive Teaching

The preference-based teaching dimension is a relative of the recursive teaching dimension. In fact, both notions coincide on finite classes, as we will see shortly. We first recall the definitions of the recursive teaching dimension and of some related notions (Zilles et al., 2011; Mazadi et al., 2014).

Ateaching sequence forLis a sequence of the formS= (L_i, di)i≥1whereL1,L2,L3, . . .form

a partition ofLinto non-empty sub-classes and, for everyi≥1, we have that

di= sup L∈Li

TD

L,L \ ∪i_j−₌₁1Lj

. (4)

If, for everyi ≥1,di is the supremum over allL ∈ Li of the smallest size of apositive teaching

set forL w.r.t.∪j≥iLj (anddi = ∞ if someL ∈ Li does not have a positive teaching set w.r.t. ∪j≥iLj), thenSis said to be apositive teaching sequence forL. Theorderof a teaching sequence or a positive teaching sequenceS (possibly ∞) is defined asord(S) = sup_i≥1di. Therecursive

teaching dimension ofL (possibly∞) is defined as the order of the teaching sequence of lowest order forL. More formally,RTD(L) = infSord(S)whereS ranges over all teaching sequences

forL. Similarly,RTD+(L) = infSord(S), whereSranges over all positive teaching sequences for

L. Note that the following holds for everyL0 ⊆ Land for every teaching sequenceS= (Li, di)i≥1

forL0 such thatord(S) = RTD(L0):

RTD(L)≥RTD(L0) = ord(S)≥d1 = sup

L∈L1

TD(L,L0)≥TDmin(L0) . (5)

Note an important difference betweenPBTDandRTD: whileRTD(L)≥TDmin(L0)forall L0 _{⊆ L, in general the same holds for}_PBTD_{only when restricted to finite}_L0_{, cf. Lemma 7. This}

difference will become evident in the proof of Lemma 10.

The depth of L ∈ L w.r.t. a strict partial order imposed onL is defined as the length of the longest chain in(L,≺)that ends with the ≺-maximal element L(resp. as∞if there is no bound on the length of these chains). The recursive teaching dimension is related to the preference-based teaching dimension as follows:

Lemma 8 RTD(L) = inf≺PBTD(L,≺)andRTD+(L) = inf≺PBTD+(L,≺)where≺ranges

over all strict partial orders onLthat satisfy the following “finite-depth condition”: everyL∈ L

has a finite depth w.r.t.≺.

The following is an immediate consequence of Lemma 8 and the trivial observation that the finite-depth condition is always satisfied ifLis finite:

Corollary 9 PBTD(L)≤RTD(L), with equality ifLis finite.

WhilePBTD(L)andRTD(L) refer to the same finite number whenLis finite, there are classes for which theRTDis infinity and yet thePBTDis finite, as Lemma 10 will show.

(7)

Proof LetL∞be the family of closed half-intervals over[0,1), i.e.,L∞ ={[0, a] : 0 ≤a < 1}.

We first prove thatPBTD+(L∞) = 1. Consider the preference relation given by[0, b]≺[0, a]iff

a < b. Then, for each0≤a <1, we have

PBTD([0, a],L∞,≺) (1)

= TD([0, a],{[0, b] : 0≤b≤a}) = 1

because the single example (a,+) suffices for distinguishing [0, a] from any interval [0, b] with b < a.

It was observed by Moran et al. (2015) already thatRTD(L∞) =∞because every teaching set

for some[0, a]must contain an infinite sequence of distinct reals that converges from above toa. Thus, using Equation (5) withL0 =L, we haveRTD(L∞)≥TDmin(L∞) =∞.

Remark 11 For everyk ≥ 1, there exists an infinite class L_k such that PBTD+(L_k) = 1and RTD(L_k) =k; see (Gao et al., 2017, Lemma 6).

4. Preference-based Teaching with Positive Examples Only

The main purpose of this section is to relate positive preference-based teaching to “spanning sets” and “closure operators”, which are well-studied concepts in the computational learning theory lit-erature. Let Lbe a concept class over the universe X. We say thatS ⊆ X is aspanning setof L∈ Lw.r.t.LifS ⊆Land any set inLthat containsSmust containLas well.2 In other words, Lis the unique smallest concept inLthat containsS. We say thatS ⊆ X is aweak spanning set

ofL ∈ Lw.r.t. L ifS ⊆ L andS is not contained in any proper subset of L inL.3 We denote by I(L) (resp. I0(L)) the smallest numberk such that every concept L ∈ Lhas a spanning set (resp. a weak spanning set) w.r.t.Lof size at mostk. Note thatS is a spanning set ofLw.r.t.Liff SdistinguishesLfrom all concepts inLexcept for supersets ofL, i.e., iffS is a positive teaching set forLw.r.t.(L,⊃). Similarly,S is a weak spanning set ofLw.r.t.LiffS distinguishesLfrom all its proper subsets inL(which is necessarily the case whenS is a positive teaching set). These observations can be summarized as follows:

I0(L)≤PBTD+(L)≤PBTD+(L,⊃)≤I(L) . (6)

The last two inequalities are straightforward. The inequalityI0(L)≤PBTD+(L)follows from Lemma 5, which implies that no conceptLcan have a preference-based teaching setT smaller than its smallest weak spanning set. Such a set T would be consistent with some proper subset of L, which is impossible by Lemma 5.

Suppose L is intersection-closed. Then∩L∈L:S⊆LLis the unique smallest concept inL con-tainingS. IfS ⊆L0 is a weak spanning set ofL0 ∈ L, then∩L∈L:S⊆LL=L0because, on the one

hand,∩L∈L:S⊆LL ⊆L0 and, on the other hand, no proper subset ofL0 inLcontainsS. Thus the

distinction between spanning sets and weak spanning sets is blurred for intersection-closed classes:

2. This generalizes the classical definition of a spanning set (Helmbold et al., 1990), which is given w.r.t. intersection-closed classes only.

(8)

Lemma 12 Suppose thatLis intersection-closed. ThenI0(L) = PBTD+(L) =I(L).

Example 1 LetRddenote the class ofd-dimensional axis-parallel hyper-rectangles (=d

-dimensio-nal boxes). This class is intersection-closed and clearlyI(R_d) = 2. ThusPBTD+(R_d) = 2.

A mapping cl : 2X → 2X is said to be aclosure operatoron the universe X if the following conditions hold for all setsA, B⊆ X:

A⊆B ⇒cl(A)⊆cl(B) and A⊆cl(A) = cl(cl(A)) .

The following notions refer to an arbitrary but fixed closure operator. The setcl(A) is called the

closureofA. A setCis said to beclosedifcl(C) =C. It follows that precisely the setscl(A)with A⊆ X are closed. With this notation, we observe the following lemma.

Lemma 13 Let C be the set of all closed subsets of X under some closure operator cl, and let

L∈ C. IfL= cl(S), thenSis a spanning set ofLw.r.t.C.

Proof SupposeL0∈ C andS ⊆L0. ThenL= cl(S)⊆cl(L0) =L0.

For every closed setL∈ L, letscl(L)denote the size (possibly∞) of the smallest setS ⊆ X such thatcl(S) =L. With this notation, we get the following (trivial but useful) result:

Theorem 14 Given a closure operator, let C[m] be the class of all closed subsets C ⊆ X with

scl(C)≤m. ThenPBTD+(C[m])≤PBTD+(C[m],⊃)≤m. Moreover, this holds with equality

provided thatC[m]\ C[m−1]6=∅.

Proof The inequalityPBTD+(C[m],⊃)≤mfollows directly from Equation (6) and Lemma 13. Pick a concept C0 ∈ C[m]such thatscl(C0) = m. Then any subset S ofC0 of size less than m

spans only a proper subset ofC0, i.e.,cl(S) ⊂ C0. ThusS does not distinguish C0 fromcl(S).

However, by Lemma 5, any preference-based learner must strictly prefercl(S)overC0. It follows

that there is no positive teaching set of size less thanmforC0 w.r.t.C[m].

Many natural classes can be cast as classes of the formC[m]by choosing the universe and the closure operator appropriately; the following examples illustrate the usefulness of Theorem 14 in that regard.

Example 2 Let

LINSETk={hGi: (G⊂N)∧(1≤ |G| ≤k)}

wherehGi =nP

g∈Ga(g)g:a(g)∈N0

o

. In other words,LINSETk is the set of all non-empty

linear subsets ofN0that are generated by at mostkgenerators. Note that the mappingG7→ hGiis a closure operator over the universeN₀. Since obviouslyLINSETk\LINSETk−16=∅, we obtain PBTD+(LINSETk) =k.

Example 3 LetX =R2and letCkbe the class of convex polygons with at mostkvertices. Defining

cl(S)to be the convex closure ofS, we obtainC[k] =C_kand thusPBTD+(C_k) =k.

Example 4 LetX =Rnand letCkbe the class of polyhedral cones that can be generated byk(or

less) vectors inRn. If we takecl(S)to be the conic closure ofS ⊆Rn_{, then}_C_[_k_{] =}_C

kand thus

(9)

5. A Convenient Technique for Proving Upper Bounds

In this section, we give an alternative definition of the preference-based teaching dimension using the notion of an “admissible mapping”. Given a concept classL over a universe X, let T be a mappingL7→T(L)⊆ X × {−,+}that assigns a setT(L)of labeled examples to every setL∈ L such that the labels inT(L)are consistent withL. TheorderofT, denoted asord(T), is defined as

sup_L_∈L|T(L)| ∈N∪ {∞}. Define the mappingsT+andT−by settingT+(L) ={x : (x,+)∈ T(L)}andT−(L) ={x: (x,−)∈T(L)}for everyL∈ L. We say thatT ispositiveifT−(L) =∅ for everyL ∈ L. In the sequel, we will occasionally identify a positive mappingL7→ T(L)with the mappingL 7→ T+(L). The symbol “+” as an upper index ofT will always indicate that the underlying mappingT is positive.

The following relation will help to clarify under which conditions the sets(T(L))L∈Lare teaching

sets w.r.t. a suitably chosen preference relation:

RT ={(L, L0)∈ L × L: (L6=L0)∧(Lis consistent withT(L0))} .

The transitive closure ofRT is denoted astrcl(RT) in the sequel. The following notion will play an important role in this paper:

Definition 15 A mappingL7→T(L)withLranging over all concepts inLis said to beadmissible forLif the following holds:

1. For everyL∈ L,Lis consistent withT(L).

2. The relationtrcl(RT)is asymmetric (which clearly implies thatRT is asymmetric too).

IfT is admissible, thentrcl(RT)is transitive and asymmetric, i.e.,trcl(RT)is a strict partial order on L. We will therefore use the notation ≺_T instead of trcl(RT) whenever T is known to be admissible.

Lemma 16 Suppose thatT+is a positive admissible mapping forL. Then the relation≺_T+ onL

extends the relation⊃onL. More precisely, the following holds for allL, L0 ∈ L:

L0⊂L⇒(L, L0)∈RT+ ⇒L≺_T+ L0 .

Proof IfT+_{is admissible, then}_L0_{is consistent with}_T+₍_L0₎_{. Thus}_T+₍_L0₎_⊆_L0 _⊂_L_{so that}_L_is

consistent withT+(L0)too. Therefore(L, L0)∈RT+, i.e.,L≺_T+ L0.

The following result clarifies how admissible mappings are related to preference-based teaching:

Lemma 17 For each concept classL, the following holds:

PBTD(L) = inf

T ord(T) and PBTD

+₍_L_{) = inf}

T+ord(T +₎

(10)

Proof We restrict ourselves to the proof for PBTD(L) = infTord(T) because the equation

PBTD+(L) = infT+ord(T+)can be obtained in a similar fashion. We first prove thatPBTD(L)

≤infT ord(T). LetT be an admissible mapping forL. It suffices to show that, for everyL ∈ L, T(L)is a teaching set forLw.r.t.(L,≺_T). SupposeL0 ∈ L \ {L}is consistent withT(L). Then

(L0, L) ∈RT and thusL0 ≺T L. It follows that≺T prefersLover all conceptsL0 ∈ L \ {L}that are consistent withT(L). ThusT is a teaching set forLw.r.t.(L,≺_T), as desired.

We now prove thatinfT ord(T) ≤ PBTD(L). Let≺be a strict partial order onLand letT be a mapping such that, for everyL ∈ L,T(L)is a teaching set for Lw.r.t.(L,≺). It suffices to show thatT is admissible forL. Consider a pair(L0, L) ∈ RT. The definition ofRT implies that L0 6=Land thatL0is consistent withT(L). SinceT(L)is a teaching set w.r.t.(L,≺), it follows that L0 ≺L. Thus,≺is an extension ofRT. Since≺is transitive, it is even an extension oftrcl(RT). Because≺is asymmetric,trcl(RT)must be asymmetric, too. It follows thatT is admissible.

6. Preference-based Teaching of Linear Sets

Some work in computational learning theory (Abe, 1989; Gao et al., 2015; Takada, 1992) is con-cerned with learningsemi-linear sets, i.e., unions of linear subsets ofNk for some fixed k ≥ 1,

where each linear set consists of exactly those elements that can be written as the sum of some constant vector c and a linear combination of the elements of some fixed set of generators, see Example 2. While semi-linear sets are of common interest in mathematics in general, they play a particularly important role in the theory of formal languages, due toParikh’s theorem, by which the so-called Parikh vectors of strings in a context-free language always form a semi-linear set (Parikh, 1966).

A recent study (Gao et al., 2015) analyzed computational teaching of classes of linear subsets ofN(wherek= 1) and some variants thereof, as a substantially simpler yet still interesting special

case of semi-linear sets. In this section, we extend that study to preference-based teaching.

Within the scope of this section, all concept classes are formulated over the universeX =N0.

LetG={g₁, . . . , gk}be a finite subset ofN. We denote byhGiresp. byhGi+the following sets:

hGi=







k

X

i=1

aigi: a1, . . . , ak∈N0







and hGi₊ =







k

X

i=1

aigi : a1, . . . , ak∈N







.

We will determine (at least approximately) the preference-based teaching dimension of the fol-lowing concept classes overN0:

LINSETk = {hGi: (G⊂N)∧(1≤ |G| ≤k)} .

CF-LINSETk = {hGi: (G⊂N)∧(1≤ |G| ≤k)∧(gcd(G) = 1)} .

NE-LINSETk = {hGi+: (G⊂N)∧(1≤ |G| ≤k)} .

NE-CF-LINSETk = {hGi+: (G⊂N)∧(1≤ |G| ≤k)∧(gcd(G) = 1)} .

(11)

and Garc´ıa-S´anchez, 2009). A zero coefficientaj = 0erasesgjin the linear combinationPki=1aigi. Coefficients fromNare non-erasing in this sense. The letters “NE” in “NE-LINSET” mean “non-erasing”.

Theshift-extensionL0 of a concept classLover the universeN0 is defined as follows:

L0={c+L: (c∈N₀)∧(L∈ L)} . (7)

The following bounds on RTD and RTD+ (for sufficiently large values of k)4 _{are known}

from (Gao et al., 2015):

RTD+ RTD LINSETk =∞ ?

CF-LINSETk =k ∈ {k−1, k}

NE-LINSET0_k =k+ 1 ∈ {k−1, k, k+ 1}

HereNE-LINSET0_kdenotes the shift-extension ofNE-LINSETk.

The following result shows the corresponding bounds with PBTD in place of RTD:

Theorem 18 The bounds in the following table are valid:

PBTD+ PBTD LINSETk =k ∈ {k−1, k}

CF-LINSETk =k ∈ {k−1, k}

NE-LINSETk ∈

j

k−1 2

k

:k

∈

j

k−1 2

k

:k

NE-CF-LINSETk ∈

_j

k−1 2

k

:k

∈

_j

k−1 2

k

:k

Moreover

PBTD+(L0) =k+ 1 ∧ PBTD(L0)∈ {k−1, k, k+ 1} (8)

holds for allL ∈ {LINSETk,CF-LINSETk,NE-LINSETk,NE-CF-LINSETk}.

Note that the equationPBTD+(LINSETk) = kwas already proven in Example 2, using the fact that G 7→ hGi is a closure operator. Since G 7→ hGi₊ is not a closure operator, we give a separate argument to prove an upper bound ofk on PBTD+(NE-LINSETk) (see Lemma 37 in Appendix A). All other upper bounds in Theorem 18 are then easy to derive. The lower bounds in Theorem 18 are much harder to obtain. A complete proof of Theorem 18 will be given in Ap-pendix A.

Remark 19 The lower bound onPBTD+(NE-LINSETk)andPBTD+(NE-CF-LINSETk)may

be improved tok−1; see (Gao et al., 2017, Theorem 2, Appendix A.3).

4. For instance,RTD+(LINSETk) = ∞holds for allk ≥ 2andRTD(LINSETk) = ? (where “?” means

(12)

7. Preference-based Teaching of Halfspaces

In this section, we study preference-based teaching of halfspaces. We will denote the all-zeros vector as~0. The vector with1in coordinateiand with0in the remaining coordinates is denoted as ~ei. The dimension of the Euclidean space in which these vectors reside will always be clear from the context. The sign of a real numberx(with value1ifx > 0, value−1ifx < 0, and value0if x= 0) is denoted bysign(x).

Suppose thatw ∈ Rd_{\ {}_~₀_}_and_b _∈ _R_{. The}_{(positive) halfspace induced by}_w_and_b_{is then} given by

Hw,b ={x∈Rd: w>x+b≥0} .

Instead ofHw,0, we simply write Hw. Let Hddenote the class ofd-dimensional Euclidean half-spaces:

H_d={H_w,b : w∈Rd\ {~₀} ∧b∈R} .

Similarly,H0

ddenotes the class ofd-dimensional homogeneous Euclidean halfspaces:

H_d0={Hw : w∈Rd\ {~0}} .

Let Sd−1 denote the (d−1)-dimensional unit sphere in Rd. Moreover Sd+−1 = {x ∈ Sd−1 :

xd > 0}denotes the “northern hemisphere”. If not stated explicitly otherwise, we will represent homogeneous halfspaces with normalized vectors residing on the unit sphere. We remind the reader of the following well-known fact:

Remark 20 The orthogonal group in dimensiond(i.e., the multiplicative group of orthogonal(d× d)-matrices) acts transitively onSd−1and it conserves the inner product.

We now prove a helpful lemma, stating that each vector w∗ in the northern hemisphere may serve as a representative for some homogeneous halfspaceHu in the sense that all other elements ofHu in the northern hemisphere have a strictly smallerd-th component thanw∗. This will later help to teach homogeneous halfspaces with a preference that orders vectors by the size of their last coordinate.

Lemma 21 Letd≥2, let0< h≤1and letRd,h ={w∈Sd−1 :wd=h}. With this notation the

following holds. For everyw∗ ∈Rd,h, there existsu∈Rd\ {~0}such that

(w∗∈Hu)∧(∀w∈(Sd+−1∩Hu)\ {w∗}:wd< h) . (9)

Proof Forh= 1, the statement is trivial, sinceRd,1 ={~ed}. So leth <1.

Because of Remark 20, we may assume without loss of generality that the vector w∗ ∈ Rd,h equals(0, . . . ,0,√1−h2_{, h}₎_{. It suffices therefore to show that, with this choice of}_w∗_{, the vector}

u = (0, . . . ,0, w_d∗,−w_d∗₋₁)satisfies (9). Note thatw ∈Huiffhu, wi =w_d∗wd−1−w_d∗−1wd≥ 0.

Sincehu, w∗i= 0, we havew∗ ∈Hu. Moreover, it follows that

S_d+₋₁∩Hu =

(

w∈S_d+₋₁ : wd−1

wd ≥ w

∗ d−1

w∗_d >0

)

.

It is obvious that no vectorw∈S_d+₋₁∩Hucan have ad-th componentwdexceedingw∗d =hand that settingwd=h=w∗dforces the settingswd−1 =w∗d−1 =

√

1−h2_and_w

(13)

Consequently, (9) is satisfied, which concludes the proof.

With this lemma in hand, we can now prove an upper bound of 2 for the preference-based teach-ing dimension of the class of homogeneous halfspaces, independent of the underlyteach-ing dimensiond.

Theorem 22 PBTD(H0

1) = TD(H01) = 1and, for everyd≥2, we havePBTD(H0d)≤2.

Proof Clearly, PBTD(H0

1) = TD(H01) = 1sinceH10 consists of the two sets{x ∈ R :x ≥ 0}

and{x∈R:x≤0}.

Suppose now thatd≥ 2. Letw∗be the target weight vector (i.e., the weight vector that has to be taught). Under the following conditions, we may assume without loss of generality thatw_d∗6= 0:

• For any0< s1 < s2, the student prefers any weight vector that ends withs2zero coordinates

over any weight vector that ends with onlys1zero coordinates.

• If the target vector ends with (exactly) s zero coordinates, then the teacher presents only examples ending with (at least)szero coordinates.

In the sequel, we specify a student and a teacher such that these conditions hold, so that we will consider only target weight vectorsw∗withw∗_d6= 0.

The student has the following preference relation:

• Among the weight vectorswwithwd 6= 0, the student prefers vectors with larger values of |w_d|over those with smaller values of|w_d|.

The teacher will use two examples. The first one is chosen as

(

(−~ed,−) ifw∗_d>0

(~ed,−) ifw∗d<0 .

This example reveals whether the unknown weight vectorw∗ ∈ Sd−1 has a strictly positive or a

strictly negatived-th component. For reasons of symmetry, we may assume thatw∗_d > 0. We are now precisely in the situation that is described in Lemma 21. Givenw∗ andh = w∗_d, the teacher picks as a second example(u,+)whereu∈Rd_{\ {}_~₀_}_{has the properties described in the lemma. It} follows immediately that the student’s preferences will make her choose the weight vectorw∗.

The upper bound of 2 given in Theorem 22 is tight, as is stated in the following lemma.

Lemma 23 For everyd≥2, we havePBTD(H0

d)≥2.

Proof We verify this lemma via Lemma 7, by providing a finite subclass F of H0

2 such that TDmin(F) = 2. Let F = {Hw : ~0 6= w ∈ {−1,0,1}2}. It is easy to verify that each of the

8halfspaces inF has a teaching dimension of 2 with respect toF. This example can be extended to higher dimensions in the obvious way.

(14)

Corollary 24 For everyd≥2, we havePBTD(H0

d) = 2.

By contrast, we will show next that the recursive teaching dimension of the class of homoge-neous halfspaces grows with the dimensionality.

Theorem 25 For anyd≥2,TD(H0

d) = RTD(Hd0) =d+ 1.

Proof Assume by normalization that the target weight vector has norm1, i.e., it is taken fromSd−1.

Remark 20 implies that all weight vectors inSd−1are equally hard to teach. It suffices therefore to

show thatTD(H~e1,H 0

d) =d+ 1. We first show that TD(H~e1,H

0

d) ≤ d+ 1. Define u = −

Pd

i=2~ei. We claim that T = {(~ei,+) : 2≤ i ≤ d} ∪ {(u,+),(~e1,+)}is a teaching set forH~e1 w.r.t.H

0

d.Consider anyw ∈ Sd−1 such thatHw is consistent withT. Note that wi = h~ei, wi ≥ 0 for alli ∈ {2, . . . , d}and hu, wi=−Pd

i=2wi≥0together imply thatwi= 0for alli∈ {2, . . . , d}and thereforew=±~e1.

Furthermore,w1=hw, ~e1i ≥0, and sow=~e1, as required.

Now we show thatTD(H~e1,H 0

d) ≥ d+ 1 holds for alld ≥ 2. It is easy to see that two ex-amples do not suffice for distinguishing~e1 ∈ R2 from all weight vectors inS1. In other words, TD(H~e1,H

0

2) ≥ 3. Suppose now thatd ≥ 3. It is furthermore easy to see that a teaching setT

which distinguishes~e1from all weight vectors inSd−1must contain at least one positive exampleu

that is orthogonal to~e1. The inequalityTD(H~e1,H 0

d)≥d+ 1is now obtained inductively because the example(u,+)∈T leaves open a problem that is not easier than teaching~e1w.r.t. the(d−2)

-dimensional sphere{x∈Sd−1 :x⊥u}.

We have thus established that the class of homogeneous halfspaces has a recursive teaching dimension growing linearly withd, while its preference-based teaching dimension is constant. In the case of general (i.e., not necessarily homogeneous) d-dimensional halfspaces, the difference betweenRTDandPBTD is even more extreme. On the one hand, by generalizing the proof of Lemma 10, it is easy to see thatRTD(H_d) =∞for alld≥1. On the other hand, we will show in the remainder of this section thatPBTD(Hd)≤6, independent of the value ofd.

We will assume in the sequel (by way of normalization) that an inhomogeneous halfspace has a biasb∈ {±1}. We start with the following result:

Lemma 26 Letw∗ ∈Rd_{be a vector with a non-trivial}_d_{-th component}_w∗

d6= 0and letb

∗_{∈ {±}₁_}

be a bias. Then there exist three examples labeled according toHw∗_,b∗such that the following holds.

Every weight-bias pair(w, b)consistent with these examples satisfiesb=b∗,sign(wd) = sign(wd∗)

and

(

|w_d| ≥ |w∗_d| ifb∗=−1

|w_d| ≤ |w∗

d| ifb∗= +1

. (10)

Proof Within the proof, we use the label “1” instead of “+” and the label “−1” instead of “−”. The pair(w, b)denotes the student’s hypothesis for the target weight-bias pair(w∗, b∗). The examples shown to the student will involve the unknown quantitiesw∗ andb∗. Each example will lead to a new constraint onwandb. We will see that the collection of these constraints reveals the required information. We proceed in three stages:

(15)

2. The next example is chosen as~a2=−2b

∗

w∗

d

·~edand labeled “−b∗”. Note thathw∗, ~a2i+b∗=

−b∗_{. We obtain the following new constraint:}

hw, ~a2i+b=



   

   

−2wd

w_d∗ + ∈{0,1}

z}|{

b <0 ifb∗ = 1 +2wd

w∗

d +|{z}b

=−1

≥0 ifb∗ =−1 .

The pair (w, b) with b = b∗ if b∗ = −1 andb ∈ {0,1} if b∗ = 1can satisfy the above constraint only if the sign ofwdequals the sign ofw∗d.

3. The third example is chosen as the example ~a3 = −b

∗

w∗

d

·~ed with label “1”. Note that hw∗_{, ~a}

3i∗+b∗ = 0. We obtain the following new constraint:

hw, ~a3i=−

b∗wd

w∗_d +b≥0 .

Given thatw is already constrained to weight vectors satisfying sign(wd) = sign(wd∗), we can safely replacewd/w∗dby|wd|/|w∗d|. This yields|wd|/|w∗d| ≤bifb∗ = 1and|wd|/|w∗d| ≥ −b if b∗ = −1. Since b is already constrained as described in stage 1 above, we obtain |w_d|/|w_d∗| ≤b∈ {0,1}ifb∗= 1and|w_d|/|w∗_d| ≥ −b= 1ifb∗=−1. The weight-bias pair

(w, b)satisfies these constraints only ifb=b∗and if (10) is valid.

The assertion of the lemma is immediate from this discussion.

Theorem 27 PBTD(Hd)≤6.

Proof As in the proof of Lemma 26, we use the label “1” instead of “+” and the label “−1” instead of “−”. As in the proof of Theorem 22, we may assume without loss of generality that the target weight vectorw∗ ∈Rd_satisfies_w∗

d6= 0. The proof will proceed in stages. On the way, we specify six rules which determine the preference relation of the student.

Stage 1is concerned with teaching homogeneous halfspaces given by w∗ (andb∗ = 0). The student respects the following rules:

Rule 1: She prefers any pair(w,0)over any pair(w0, b)withb6= 0. In other words, any homoge-neous halfspace is preferred over any non-homogehomoge-neous halfspace.

Rule 2: Among homogeneous halfspaces, her preferences are the same as the ones that were used within the proof of Theorem 22 for teaching homogeneous halfspaces.

Thus, ifb∗ = 0, then we can simply apply the teaching protocol for homogeneous halfspaces. In this case,w∗can be taught at the expense of only two examples.

Stage 1 reduces the problem to teaching inhomogeneous halfspaces given by(w∗, b∗)withb∗6= 0. We assume, by way of normalization, thatb∗ ∈ {±1}, but note thatw∗ can now not be assumed to be of unit (or any other fixed) length.

Instage 2, the teacher presents three examples in accordance with Lemma 26. It follows that the student will take into consideration only weight-bias pairs(w, b)such that the constraintsb = b∗,

(16)

Rule 3: Among the pairs(w, b)such thatwd 6= 0andb ∈ {±1}, the student’s preferences are as follows. Ifb=−1(resp.b= 1), then she prefers vectorswwith a smaller (resp. larger) value of|w_d|over those with a larger (resp. smaller) value of|w_d|.

Thanks to Lemma 26 and thanks to Rule 3, we may from now on assume thatb=b∗andwd=wd∗. In the sequel, let w∗ be decomposed according tow∗ = (w~∗_d₋₁, w_d∗) ∈ Rd−1 _×_R_{. We think of}

~

wd−1as the student’s hypothesis forw~d∗−1.

Stage 3is concerned with the special case wherew~_d∗₋₁=~0. The student will automatically set ~

wd−1 =~0if we add the following to the student’s rule system:

Rule 4: Given that the values forwd andbhave been fixed already (and are distinct from0), the student prefers weight-bias pairs withw~d−1 =~0over any weight-bias pair withw~d−1 6=~0.

Stage 3 reduces the problem to teaching (w∗, b∗) with fixed non-zero values for wd and b∗ (known to the student) and withw~_d∗₋₁6=~0. Thus, essentially, onlyw~∗_d₋₁has still to be taught. In the next stage, we will argue that the problem of teachingw~∗_d₋₁is equivalent to teaching a homogeneous halfspace.

Instage 4, the teacher will present only examplesasuch thatad = −b

∗

w∗

d so that the

contribu-tion of thed-th component to the inner product of w∗ andacancels with the bias b∗. Given this commitment forad, the first d−1components of the examples can be chosen so as to teach the homogeneous halfspaceHw~∗

d−1. According to Theorem 22, this can be achieved at the expense of

two more examples. Of course the student’s preferences must match with the preferences that were used in the proof of this theorem:

Rule 5: Suppose that the values of wd and b have been fixed already (and are distinct from 0) and suppose that w~d−1 6= ~0. Then the preferences for the choice of w~d−1 match with the

preferences that were used in the protocol for teaching homogeneous halfspaces.

After stage 4, the student takes into consideration only weight-bias pairs(w, b)such thatwd= w∗_d,b = b∗ andHw~d−1 = Hw~∗d−1. However, since we had normalized the bias and not the weight

vector, this does not necessarily mean thatw~d−1 =w~d∗−1. On the other hand, the two weight vectors

already coincide modulo a positive scaling factor, say

~

wd−1 =s·w~d∗−1for somes >0 . (11)

In order to complete the proof, it suffices to teach theL1-norm ofw~∗d−1to the student (because (11)

andkw~d−1k1 =kw~_d∗−1k1imply thatw~d−1=w~ ∗

d−1). The next (and final) stage serves precisely this

purpose.

As for stage 5, we first fix some notation. Fori = 1, . . . , k−1, letβi = sign(wi∗). Note that (11) implies thatβi = sign(wi). LetL = kw~_d∗−1k1 denote theL1-norm ofw~

∗

d−1. The final

example is chosen as~a6 = (β1, . . . , βd−1,−(L+b∗)/w∗d)and labeled “1”. Note that

w∗, ~a6

+b∗ =|w₁∗|+. . .+|w∗_d₋₁| −L= 0 .

Given thatβi = sign(wi), wd = w∗dandb = b

∗_{, the student can derive from}_a_~

6 and its label the

following constraint onw~d−1:

hw, ~a6i+b=|w1|+. . .+|wd−1| −L≥0 .

(17)

Rule 6: Suppose that the values ofwdandbhave been fixed already (and are distinct from0) and suppose thatHw~d−1has already been fixed. Then, among the vectors representingHw~d−1, the

ones with a smallerL1-norm are preferred over the ones with a largerL1-norm.

An inspection of the six stages reveals that at most six examples altogether were shown to the stu-dent (three in stage 2, two in stage 4, and one in stage 5). This completes the proof of the theorem.

Note that Theorems 22 and 27 remain valid when we allow wto be the all-zero vector, which extendsH0

d by {Rd} andHd by {Rd,∅}. Rd will be taught with a single positive example, and

∅with a single negative example. The student will give the highest preference to Rd, the second

highest to∅, and among the remaining halfspaces, the student’s preferences stay the same.

8. Classes withPBTDorPBTD+Equal to One

In this section, we will give complete characterizations of (i) the concept classes with a positive preference-based teaching dimension of 1, and (ii) the concept classes with a preference-based teaching dimension of1. Throughout this section, we use the label “1” to indicate positive examples and the label “0” to indicate negative examples.

LetI be a (possibly infinite) index set. We will consider a mapping A : I×I → {0,1}as a binary matrixA∈ {0,1}I×I_._A_{is said to be}_{lower-triangular}_{if there exists a linear ordering}_≺_on I such thatA(i, i0) = 0for every pair(i, i0)such thati≺i0.

We will occasionally identify a setL⊆ X with its indicator function by settingL(x) =1_[_x∈L].

For eachM ⊆ X, we define

M⊕L= (L\M)∪(M\L)

and

M⊕ L={M⊕L:L∈ L} .

ForT ⊆ X × {0,1}, we define similarly

M⊕T ={(x,y¯) : (x, y)∈T andx∈M} ∪ {(x, y)∈T :x /∈M} .

Moreover, givenM ⊆ X and a linear ordering≺onL, we define a linear ordering≺M onM⊕ L as follows:

M ⊕L0 ≺_M M ⊕L⇔M⊕(M⊕L0)

| {z }

=L0

≺M⊕(M⊕L)

| {z }

=L

.

Lemma 28 With this notation, the following holds. If the mappingL 3L7→ T(L) ⊆ X × {0,1}

assigns a teaching set toLw.r.t. (L,≺), then the mappingM ⊕ L 3 M ⊕L 7→ M ⊕T(L) ⊆ X × {0,1}assigns a teaching set toM⊕Lw.r.t.(M ⊕ L,≺M).

Since this result is rather obvious, we skip its proof.

We say that LandL0 areequivalentifL0 = M⊕ Lfor some M ⊆ X (and this clearly is an equivalence relation). As an immediate consequence of Lemma 28, we obtain the following result:

(18)

The following lemma provides a necessary condition for a concept class to have a preference-based teaching dimension of one.

Lemma 30 Suppose thatL ⊆2X is a concept class ofPBTD 1. Pick a linear ordering≺onLand a mappingL 3L 7→(xL, yL) ∈ X × {0,1}such that, for everyL ∈ L,{(xL, yL)}is a teaching

set forLw.r.t.(L,≺). Then

• either every instancex∈ X occurs at most once in(xL)L∈L

• or there exists a conceptL∗∈ Lthat is preferred over all other concepts inLandxL∗is the

only instance fromX that occurs twice in(xL)L∈L.

Proof Since the mappingTmust be injective, no instance can occur twice in(xL)L∈Lwith the same

label. Suppose that there exists an instancex ∈ X and conceptsL ≺L∗ such thatx =xL =xL∗

and, w.l.o.g.,yL= 1andyL∗ = 0. Since{(x,1)}is a teaching set forLw.r.t.(L,≺), every concept

L0 L (including the ones that are preferred overL∗) must satisfy L0(x) = 0. For analogous reasons, every conceptL0 L∗(if any) must satisfyL0(x) = 1. A conceptL0 ∈ Lthat is preferred overL∗ would have to satisfyL0(x) = 0andL0(x) = 1, which is impossible. It follows that there can be no concept that is preferred overL∗.

The following result is a consequence of Lemmas 28 and 30.

Theorem 31 If PBTD(L) = 1, then there exists a concept classL0 that is equivalent to Land satisfiesPBTD(L0) = PBTD+(L0) = 1.

Proof Pick a linear ordering≺onLand, for everyL∈ L, a pair(xL, yL)∈ X × {0,1}such that T(L) ={(xL, yL)}is a teaching set forLw.r.t.(L,≺).

Case 1: Every instancex∈ X occurs at most once in(xL)L∈L.

Then chooseM ={xL:yL= 0}and apply Lemma 28.

Case 2: There exists a conceptL∗∈ Lthat is preferred over all other concepts inLandxL∗is the

only instance fromX that occurs twice in(xL)L∈L.

Then choose M = {x_L : yL = 0∧L 6= L∗} and apply Lemma 28. With this choice, we obtainM ⊕T(L) = {(xL,1)}for everyL ∈ L \ {L∗}. SinceL∗ is preferred over all other concepts inL, we may teachL∗ w.r.t.(L,≺)by the empty set (instead of employing a possibly0-labeled example).

The discussion shows that there is a class L0 _{that is equivalent to} _L _{and can be taught in the}

preference-based model with positive teaching sets of size1(or size0in case ofL∗).

We now have the tools required for characterizing the concept classes whose positive PBTD equals1.

Theorem 32 PBTD+(L) = 1if and only if there exists a mappingL 3 L 7→ xL ∈ X such that

the matrixA∈ {0,1}(L\{∅})×(L\{∅})_{given by}_A₍_{L, L}0_{) =}_L0₍_x

(19)

Proof Suppose first thatPBTD+(L) = 1. Pick a linear ordering≺onLand, for everyL∈ L\{∅}, pickxL ∈ X such that {xL} is a positive teaching set forLw.r.t.(L,≺).5 IfL ≺ L0 (so thatL0 is preferred overL), we must haveL0(xL) = 0. It follows that the matrix A, as specified in the theorem, is lower-triangular.

Suppose conversely that there exists a mapping L 3 L 7→ xL ∈ X such that the matrix A ∈ {0,1}(L\{∅})×(L\{∅}) _{given by} _A₍_{L, L}0_{) =} _L0₍_x

L) is lower-triangular, say w.r.t. the linear ordering ≺on L \ {∅}. Then, for everyL ∈ L \ {∅}, the singleton{xL} is a positive teaching set forLw.r.t. (L,≺) because it distinguishesL from∅(of course) and also from every concept L0 ∈ L \ {∅}such thatL0 L. If∅ ∈ L, then extend the linear ordering≺by preferring∅over every other concept fromL(so that∅is a positive teaching set for∅w.r.t.(L,≺)).

In view of Theorem 31, Theorem 32 characterizes every class L with PBTD(L) = 1 up to equivalence.

LetSg(X) ={{x}:x∈ X }denote the class of singletons overX and suppose thatSg(X)is a sub-class ofLandPBTD(L) = 1. We will show that only fairly trivial extensions ofSg(X)with a preference-based dimension of1are possible.

Lemma 33 LetL ⊆2Xbe a concept class ofPBTD 1that containsSg(X). LetTbe an admissible mapping forLthat assigns a labeled example(xL, yL)∈ X × {0,1}to eachL∈ L. Forb= 0,1,

letLb ₌ _{L _{∈ L} _: _y

L = b}. Similarly, letXb = {x ∈ X : y{x} ∈ Lb}. With this notation, the

following holds:

1. IfL∈ L1 _and_L_⊂_L0 _{∈ L}_{, then}_L0 _{∈ L}1_.

2. IfL0 ∈ L0_and_L0_⊃_L_{∈ L}_{, then}_L_{∈ L}0_.

3. |X0_{| ≤} ₂_{. Moreover if} _|X0_| _{= 2}_{, then there exist}_q ₆₌ _q0 _{∈ X} _{such that}_X0 ₌ _{{q, q}0_}_and

x{q}=q0.

Proof Recall thatRT ={(L, L0) ∈ L × L : (L =6 L0)∧(Lis consistent withT(L0))}and that RT (and even the transitive closure ofRT) is asymmetric ifT is admissible.

1. If L ∈ L1 _and_L _⊂ _L0_{, then}_y

L = 1so thatL0 is consistent with the example(xL, yL). It follows that(L0, L) ∈ RT. L0 ∈ L0 would similarly imply that(L, L0) ∈ RT so that RT would not be asymmetric. This is in contradiction with the admissibility ofT.

2. The second assertion in the lemma is a logically equivalent reformulation of the first assertion.

3. Suppose for the sake of contradiction that X0 _{contains three distinct points, say}_q

1, q2, q3.

Since, fori = 1,2,3, T assigns a0-labeled example to{q_i}, at least one of the remaining two points is consistent withT({qi}). LetGbe the digraph with the nodesq1, q2, q3and with

an edge fromqj toqi iff{qj}is consistent withT({qi}). Then each of the three nodes has an indegree of at least1. Digraphs of this form must contain a cycle so thattrcl(RT)is not asymmetric. This is in contradiction with the admissibility ofRT.

5. Such anxLalways exists, even if∅is a teaching set forL, because every superset of a teaching set forLthat is still

(20)

A similar argument holds ifX0_{contains only two distinct elements, say}_q _and_q0_{. If neither}

x{q} = q0 nor x{q0_} = q, then ({q0},{q}) ∈ R_T and({q},{q0}) ∈ R_T so thatR_T is not

asymmetric — again a contradiction to the admissibility ofRT.

We are now in the position to characterize those classes ofPBTDone that contain all singletons.

Theorem 34 Suppose thatL ⊆2X is a concept class that containsSg(X). ThenPBTD(L) = 1 if and only if the following holds. Either L coincides with Sg(X) or L contains precisely one additional concept, which is either the empty set or a set of size2.

Proof We start with proving “⇐”. It is well known thatPBTD+(L) = 1forL= Sg(X)∪ {∅}: prefer∅ over any singleton set, set T(∅) = ∅and, for everyx ∈ X, set T({x}) = {(x,1)}. In a similar fashion, we can show thatPBTD(L) = 1forL = Sg(X)∪ {{q, q0_}}_{for any choice of}

q 6=q0 ∈ X. Prefer{q, q0}over{q}and{q0}, respectively. Furthermore, prefer{q}and{q0}over all other singletons. Finally, setT({q, q0}) =∅,T({q}) ={(q0,0)},T({q0}) ={(q,0)}and, for everyx∈ X \ {q, q0_}_{, set}_T₍_{x}_{) =}_{₍_x,₁₎_}_.

As for the proof of “⇒”, we make use of the notionsT, xL, yL,L0,L1,X0,X1 that had been introduced in Lemma 33 and we proceed by case analysis.

Case 1: X0₌_∅.

SinceX0 ₌ _∅_{, we have}_X ₌ _X1_{. In combination with the first assertion in Lemma 33, it}

follows thatL \ {∅} = L1_{. We claim that no concept in} _L_{contains two distinct elements.}

Assume for the sake of contradiction that there is a conceptL ∈ Lsuch that |L| ≥ 2. It follows that, for everyq ∈ L, x{q} = q andy{q} = 1 so that(L,{q}) ∈ RT. Moreover, there existsq0 ∈ Lsuch thatxL = q0 andyL = 1. It follows that({q0}, L) ∈ RT, which contradicts the fact thatRT is asymmetric.

Case 2: X0₌_{q}_{for some}_q_{∈ X}_.

Setq0 = x{q} and note that y{q} = 0. Moreover, sinceX1 = X \ {q}, we havex{p} = p

andy{p} = 1for everyp ∈ X \ {q}. We claim thatL cannot contain a conceptL of size

at least2that contains an element ofX \ {q, q0}. Assume for the sake of contradiction, that there is a setLsuch that|L| ≥2andp ∈Lfor somep ∈ X \ {q, q0_{}. The first assertion in}

Lemma 33 implies thatyL= 1(becausey{p} = 1and{p} ⊆L). Since all pairs(x,1)with

x 6= q are already in use for teaching the corresponding singletons, we may conclude that q ∈ LandT(L) = {(q,1)}. This contradicts the fact thattrcl(RT)is asymmetric, because our discussion implies that(L,{p}),({p},{q}),({q}, L) ∈ RT. We may therefore safely assume that there is no concept of size at least2inLthat has a non-empty intersection with X \ {q, q0}. Thus, except for the singletons, the only remaining sets that possibly belong toL are∅and{q, q0_}_{. We still have to show that not both of them can belong to}_L_{. Assume for the}

(21)

is already in use for teaching{q0_}_{. It is therefore necessary to set}_T₍_L_{) =} _{₍_q,₁₎_}_{. An}

in-spection of the various teaching sets shows that(∅,{q}),({q}, L),(L,{q0}),({q0},∅)∈RT, which contradicts the fact thattrcl(RT)is asymmetric.

Case 3: X0₌_{{q, q}0_}_{for some}_q₆₌_q0_{∈ X}_.

Note first that y{q} = y{q0_} = 0 and y_{_p_} = 1 for every p ∈ X \ {q, q0}. We claim that

∅∈ L. Assume for the sake of contradiction that/ ∅ ∈ L. Then(∅,{q}),(∅,{q0})∈RT since ∅is consistent with the teaching sets for instances fromX0_{. But then, no matter how} _x_in

T(∅) ={(x,0)}is chosen, at least one of the sets{q}and{q0_}_{will be consistent with}_T₍_∅₎

so that at least one of the pairs ({q},∅) and({q0},∅) belongs to RT. This contradicts the fact thatRT must be asymmetric. Thus∅∈ L, indeed. Now it suffices to show that/ Lcannot contain a concept of size at least2that contains an element ofX \{q, q0_{}. Assume for the sake}

of contradiction that there is a setL∈ Lsuch that|L| ≥2andp∈Lfor somep∈ X \{q, q0}. Observe that(L,{p}) ∈ RT. Another application of the first assertion in Lemma 33 shows thatyL = 1(because y{p} = 1andp ∈ L) and xL ∈ {q, q0} (because the other1-labeled instances are already in use for teaching the corresponding singletons). It follows that one of the pairs({q}, L)and({q0}, L)belongs toRT. The third assertion of Lemma 33 implies that T(q) ={(q0,0)}orT(q0) ={(q,0)}. For reasons of symmetry, we may assume thatT(q) =

{(q0,0)}. This implies that({p},{q})∈RT. Letq00be given byT(q0) ={(q00,0)}. Note that eitherq00 =qorq00∈ X \{q, q0}. In the former case, we have that({p},{q0})∈RT and in the latter case we have that({q},{q0_}₎_∈_R

T. Since({p},{q})∈RT (which was observed above already), we conclude that in both cases, ({p},{q}),({p},{q0_}₎ _∈ _trcl(_R

T). Combining this with our observations above that(L,{p}) ∈ RT and that one of the pairs({q}, L)and

({q0_{}, L}₎_{belongs to}_R

T, yields a contradiction to the fact thattrcl(RT)is asymmetric.

Corollary 35 Let L ⊆ 2X be a concept class that contains Sg(X). If PBTD(L) = 1, then RTD(L) = 1.

Proof According to Theorem 34, eitherLcoincides withSg(X)orLcontains precisely one addi-tional concept that is∅or a set of size2. The partial ordering≺onLthat is used in the first part of the proof of Theorem 34 (proof direction “⇐”) is easily compiled into a recursive teaching plan of order1forL.6

The characterizations proven above can be applied to certain geometric concept classes. Consider a classL, consisting of bounded and topologically closed objects in thed-dimensional Euclidean space, that satisfies the following condition: for every pair(A, B)∈Rd_{, there is exactly} one object in L, denoted as LA,B in the sequel, such that A, B ∈ L and such that kA− Bk coincides with the diameter of L. This assumption implies that|L \Sg(Rd)| = ∞. By setting A = B, it furthermore implies Sg(Rd) ⊆ L. Let us prefer objects with a small diameter over objects with a larger diameter. Then, obviously,{A, B}is a positive teaching set forLA,B. Because

(22)

of|L \Sg(Rd)|= ∞,Ldoes clearly not satisfy the condition in Theorem 34, which is necessary forLto have a PBTD of1. We may therefore conclude thatPBTD(L) = PBTD+(L) = 2.

The family of classes with the required properties is rich and includes, for instance, the class of d-dimensional balls as well as the class ofd-dimensional axis-parallel rectangles.

9. Conclusions

Preference-based teaching uses the natural notion of preference relation to extend the classical teaching model. The resulting model is (i) more powerful than the classical one, (ii) resolves dif-ficulties with the recursive teaching model in the case of infinite concept classes, and (iii) is at the same time free of coding tricks even according to the definition by Goldman and Mathias (1996). Our examples of algebraic and geometric concept classes demonstrate that preference-based teach-ing can be achieved very efficiently with naturally defined teachteach-ing sets and based on intuitive preference relations such as inclusion. We believe that further studies of the PBTD will provide in-sights into structural properties of concept classes that render them easy or hard to learn in a variety of formal learning models.

We have shown that spanning sets lead to a general-purpose construction for preference-based teaching sets of only positive examples. While this result is fairly obvious, it provides further justification of the model of preference-based teaching, since the teaching sets it yields are often intuitively exactly those a teacher would choose in the classroom (for instance, one would represent convex polygons by their vertices, as in Example 3). It should be noted, too, that it can sometimes be difficult to establish whether the upper bound on PBTD obtained this way is tight, or whether the use of negative examples or preference relations other than inclusion yield smaller teaching sets. Generally, the choice of preference relation provides a degree of freedom that increases the power of the teacher but also increases the difficulty of establishing lower bounds on the number of examples required for teaching.

Acknowledgements. Sandra Zilles was supported by the Natural Sciences and Engineering Re-search Council of Canada (NSERC), in the Discovery Grant and Canada ReRe-search Chairs programs. We thank the anonymous referees for their numerous thoughtful comments, which greatly helped to improve the presentation of the paper.

References

N. Abe. Polynomial learnability of semilinear sets. InProceedings of the 2nd Annual Conference on Learning Theory (COLT), pages 25–40, 1989.

D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45(2):117–135, 1980.

X. Chen, Y. Cheng, and B. Tang. On the recursive teaching dimension of vc classes. InAdvances in Neural Information Processing Systems 29 (NIPS 2016), pages 2164–2171, 2016.

T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension, and sample compression. Journal of Machine Learning Research, 15:3107–3131, 2014.

(23)

Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. InProceedings of the 29th Conference on Learning Theory (COLT), pages 971–997, 2016.

Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. ArXiv e-prints, February

2017. URLhttps://arxiv.org/pdf/1702.02047.pdf.

S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:20–31, 1995.

S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and System Sciences, 52:255–267, 1996.

D. Helmbold, R. Sloan, and M. K. Warmuth. Learning nested differences of intersection-closed concept classes. Machine Learning, pages 165–196, 1990.

T. J. Jech. The Axiom of Choice. North-Holland Pub. Co., Amsterdam, 1973.

Z. Mazadi, Z. Gao, and S. Zilles. Distinguishing pattern languages with membership examples. In

Proceedings of the 8th International Conference on Language and Automata Theory and Appli-cations (LATA), pages 528–540, 2014.

S. Moran, A. Shpilka, A. Wigderson, and A. Yehudayoff. Compressing and teaching for low VC-dimension. In Proceedings of the 56th Annual Symposium on the Foundations of Computer Science (FOCS), pages 40–51, 2015.

R. J. Parikh. On context-free languages. Journal of the ACM, 13(4):570–581, 1966.

J. G. Rosales and P. A. Garc´ıa-S´anchez. Numerical Semigroups. Springer, 2009.

A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4):337–347, 1991.

H. U. Simon and S. Zilles. Open problem: Recursive teaching dimension versus VC dimension. InProceedings of the 28th Annual Conference on Learning Theory (COLT), pages 1770–1772, 2015.

Y. Takada. Learning semilinear sets from examples and via queries.Theoretical Computer Science, 104(2):207–233, 1992.

S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning.

Journal of Machine Learning Research, 12:349–384, 2011.

Appendix A. Proof of Theorem 18

(24)

A.1 The Shift Lemma

In this section, we assume that L is a concept class over a universe X ∈ {N₀,Q+₀,R+₀}. We furthermore assume that0is contained in every conceptL∈ L. We can extendLto a larger class, namely the shift-extensionL0ofL, by allowing each of its concepts to be shifted by some constant which is taken fromX:

L0={c+L: (c∈ X)∧(L∈ L)} .

The next result states that this extension has little effect only on the complexity measuresPBTD

andPBTD+:

Lemma 36 (Shift Lemma) With the above notation and assumptions, the following holds:

PBTD(L)≤PBTD(L0)≤1+PBTD(L) and PBTD+(L)≤PBTD+(L0)≤1+PBTD+(L) .

Proof It suffices to verify the inequalitiesPBTD(L0) ≤1 + PBTD(L)andPBTD+(L0) ≤1 + PBTD+(L)because the other inequalities hold by virtue of monotonicity. LetT be an admissible mapping forL. It suffices to show thatT can be transformed into an admissible mappingT0forL0

such thatord(T0)≤1 + ord(T)and such thatT0is positive provided thatT is positive. To this end, we defineT0as follows:

T0(c+L) ={(c,+)} ∪ {(c+x, b) : (x, b)∈T(L)} .

Obviouslyord(T0)≤1+ord(T). Note thatc∈c+Lbecause of our assumption that0is contained in every concept inL. Moreover, since the admissibility ofTimplies thatLis consistent withT(L), the above definition ofT0(c+L)makes sure thatc+Lis consistent withT0(c+L). It suffices there-fore to show that the relationtrcl(RT0)is asymmetric. Consider a pair(c0+L0, c+L)∈R_T0. By the

definition ofRT0, it follows thatc0+L0is consistent withT0(c+L). Because of(c,+)∈T0(c+L),

we must have c0 ≤ c. Suppose thatc0 = c. In this case, L0 must be consistent withT(L). Thus L0 ≺_T L. This reasoning implies that(c0+L0, c+L) ∈ RT0 can happen only if eitherc0 < cor

(c0 =c)∧(L0≺_T L). Since≺_T is asymmetric, we may now conclude thattrcl(RT0)is asymmetric,

as desired. Finally note that, according to our definition above, the mappingT0is positive provided thatT is positive. This concludes the proof.

A.2 The Upper Bounds in Theorem 18

We remind the reader that the equalityPBTD+(LINSETk) =kwas stated in Example 2. We will show in Lemma 37 thatPBTD+(NE-LINSETk) ≤k. In combination with the Shift Lemma, this implies thatPBTD+(LINSET0_k) ≤ k+ 1andPBTD+(NE-LINSET0_k) ≤k+ 1. All remaining upper bounds in Theorem 18 follow now by virtue of monotonicity.

Lemma 37 PBTD+(NE-LINSETk)≤k.

Proof We want to show that there is a preference relation for whichkpositive examples suffice to teach any concept inNE-LINSETk. To this end, letG={g1, . . . , g`}be a generator set with`≤k whereg1< . . . < g`. We usesum(G) =g1+. . .+g`to denote the sum of all generators inG. We say thatgi is aredundant generatorinGifgi ∈

{g1, . . . , gi−1}

(25)

withg₁∗ < . . . < g∗_`∗be the set of non-redundant generators inGand lettuple(G) = (g₁∗, . . . , g_`∗∗)

be the corresponding ordered sequence. ThenG∗is an independent subset ofGgenerating the same linear set asGwhen allowing zero coefficients, i.e., we havehG∗i=hGi(althoughhG∗i₊ 6=hGi₊ wheneverG∗is a proper subset ofG).

To define a suitable preference relation, letG,Gbbe generator sets of sizekor less withtuple(G)

= (g∗₁, . . . , g_`∗∗) and tuple(Gb) = (_bg∗₁, . . . ,_bg∗ b

`∗). Let the student prefer G over Gb if any of the

following conditions is satisfied:

Condition 1: sum(G)>sum(Gb).

Condition 2: sum(G) = sum(Gb)andtuple(G)is lexicographically greater thantuple(Gb)without

havingtuple(Gb)as prefix.

Condition 3: sum(G) = sum(Gb)andtuple(G)is a proper prefix oftuple(Gb).

To teach a concepthGi ∈NE-LINSETkwithsum(G) =gandtuple(G) = (g∗1, . . . , g`∗∗), one

uses the teaching set

S ={(g,+),(g+g₁∗,+), . . . ,(g+g∗_h∗,+)}

where

h=

(

`∗−1 ifG∗=G

`∗ ifG∗⊂G . (12)

Note thatS contains at most|G| ≤ kexamples. Let Gb with D

b

GE

+

∈ NE-LINSETkdenote the

generator set that is returned by the student. ClearlyDGb E

satisfiessum(Gb) =gsince

• concepts with larger generator sums are inconsistent with(g,+), and

• concepts with smaller generator sums have a lower preference (compare with Condition 1 above).

It follows thatg+g_i∗ ∈DGb E

+is equivalent tog ∗ i ∈

D

b

GE=DGb∗ E

. We conclude that the smallest

generator intuple(Gb)equalsg₁∗since

• a smallest generator in tuple(Gb) that is greater thang₁∗ would cause an inconsistency with

(g+g₁∗,+), and

• a smallest generator intuple(Gb)that is smaller thang₁∗would have a lower preference

(com-pare with Condition 2 above).

Assume inductively that thei−1 smallest generators intuple(Gb) areg₁∗, . . . , g∗_i₋₁. Sinceg_i∗ ∈/ D

{g₁∗, . . . , g∗_i₋₁}E, we may apply a reasoning that is similar to the above reasoning concerning g₁∗

and conclude that thei’th smallest generator intuple(Gb)equalsg_i∗. The punchline of this discussion

is that the sequencetuple(Gb)starts withg₁∗, . . . , g_h∗ withhgiven by (12). LetG0 =G\G∗be the

set of redundant generators inGand note that

g− h

X

i=1

g∗_i =

(

g∗_`∗ ifG∗ =G

P

g0_∈_G0g0 ifG∗ ⊂G

.