• No results found

When can two unsupervised learners achieve PAC separation?

N/A
N/A
Protected

Academic year: 2020

Share "When can two unsupervised learners achieve PAC separation?"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Original citation:

Goldberg, Paul W. (2000) When can two unsupervised learners achieve PAC

separation? University of Warwick. Department of Computer Science. (Department of

Computer Science Research Report). (Unpublished) CS-RR-377

Permanent WRAP url:

http://wrap.warwick.ac.uk/61134

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

A note on versions:

The version presented in WRAP is the published version or, version of record, and may

be cited as it appears here.For more information, please contact the WRAP Team at:

(2)

Separation?

Paul W. Goldberg

Dept. of Computer Siene,

University of Warwik,

Coventry CV4 7AL, U.K.

pwgds.warwik.a.uk

November 1, 2000

Abstrat

InthispaperweintrodueanewframeworkforstudyingPAClearningproblems, that haspratial aswellastheoretial motivations. Inour framework thetraining examplesaredividedintothetwosetsassoiatedwiththetwopossibleoutputlabels, and eah set is sent to a separate (unsupervised) learner. The two learners must independently t probability distributions to their examples, and afterwards these distributionsare ombined to form a hypothesis bylabeling test dataaording to maximum likelihood. That is, the output label for any input would be the one assoiatedwiththeunsupervisedlearnerthatgavethatinputthehigherprobability. ThisapproahismotivedbytheobservationthatPAClearningalgorithmsofthis kindareextendableinanaturalwaytomulti-lasslassiation. Italso turnsoutto introduenewalgorithmihallenges,evenforverysimpleoneptlasses. Learning maybeviewedasaooperativegamebetweentwoplayers,forwhihwemustdesign a strategythatthey willbothuse.

(3)

In a standard learning problem in the Probably Approximately Corret (PAC) setting, there is asoure of data onsisting ofinstanes generated by a probability distribution D over a domain X, labeled using an unknown funtion f : X ! f0;1g. Thus f divides

members of X into two sets f 1

(0) and f 1

(1), and the learner's objetive is to nd goodapproximationsto those sets. The main algorithmi hallenge is usually to nd any lassiationfuntion h thatagrees withthe observed data,where h must behosen from a lass of funtions whose diversity is limited. Thus learning is all about separating the two sets.

Theabovesenarioisaspeialaseofsupervisedlearning,wherethelearner's taskisto learn afuntion. Anotherlearning taskwhiharises frequently inpratie isunsupervised learning,wherethetaskistoestimateaprobabilitydistributionthatgeneratedaolletion of observations, assumed to be sampled from that distribution. There are not so many papersonunsupervised learninginthelearningtheoryliterature;the topiwasintrodued in [13℄, see also [5, 7, 8, 6℄. This paper is motivated by the simple observation that it may be possible to use unsupervised learning to do supervised learning, espeially when the supervised learning problem is to learn a funtion whose range is a small set V = fv

1 ;:::;v

k

g. ThegeneralapproahistotprobabilitydistributionsD 1

;:::;D k

tothesets f

1 (v

1

);:::;f 1

(v k

), and derive a lassier h whih labels any x 2 X by the value v i

, where i maximizes D

i

(x) (i.e. maximizesthe likelihood).

This approah to multi-lass lassiation (for problems suh as handwritten digit reognition) has several advantages over the separation approah:

1. All lasses are treatedin the same way. We avoidhaving to start by hoosing some partition of the lasses into two sets for the purpose of applying binary separation. Suhapartitiontreatslasslabelslyingonoppositesidesdierentlyfromlasslabels lyingon the same side.

2. For appliations suh as handwritten digit reognition, it is more natural to model thedata generationproessintermsof10separate probabilitydistributions,thanas a olletionof thresholds between dierent digits.

3. The approah is robust to situations where lass overlap ours (as is usually the aseinpratie). Class overlap refers toasituationinalassiationproblemwhere an input may have more than one possible valid output (lass label). Standard PAC algorithms do not address this problem (although there have been extensions suh as \probabilisti onepts" [14℄ that do so). Moreover, if we have a modelfor howmembers of eah separate lass are generated, wean easilymodify the derived lassiation funtion to deal with a variable mislassiation penalty funtion, or hanges in the lass priors. This kind of modiation annot be appliedto a simple dividing linebetween pairs of lasses.

(4)

example ensuring that all lasses are treated the same way. In [15℄, the rst test for any unlabeled input is to apply the separator that distinguishes 0 from 9. Thus 0 and 9 are being treateddierently fromother digits (whih in turn are also treateddierently from eahother.) See also[9℄for a review of methodsfor using binary lassiers to onstrut a multi-lasslassier.

Class overlap is the main pratial reason why it is generally inappropriate, or just simplyimpossible, to nd a separator that isolates inputs with one lass label fromthose withanother. Iflass overlapisreallypresentbutlass separationisnevertheless ahieved, it usually means the model has overtted the data. In these situations, it is wrong to try toseparate the dierentlasses perfetly.

From a theoretial PAC learning perspetive, we are addressing a natural question in asking towhat extent it hampers the learnernot to have simultaneous aessto examples with dierent labels. The question raises novel algorithmi hallenges, even for some of the simplest PAC learning problems. Indeed, a notable feature of this framework is the number of hard open problems that seem to arise. In ontrast with all other studies of unsupervised learning,weare interested initsusefulness infulllinga given supervised learningobjetive,asopposedtoapproximatingtheprobabilitydistributionthatgenerated thedata. Aswewillsee, hypothesisdistributionsneednot begoodapproximatorsinorder for their ombinationtoahieve PAC learning.

The maintheoretialquestion(whihweleaveopen)isofourse: areallPAClearnable problemsalsolearnableinthisframework,andifnot,howdoesthesetoflearnableproblems omparewithothersubsetsofPAClearnableproblems,forexampleStatistialQuery(SQ) learnability[12℄. (Inthe ase ofSQ learning,wend thatparityfuntions arelearnable in this framework but from[12℄ we know they are not learnable using SQs.)

1.1 Formalizing the Learning Framework

We use the standard onvention of referring to the two lasses of inputs assoiated with the twolass labelsasthe \positiveexamples"and the \negativeexamples". Eah learner has aess toa soure of examples having one of the two lass labels. More preisely, one learnermay (inunit time) draw asample from D restrited tothe positiveexamples,and the other maydrawa samplefrom D restritedto the negative examples. This formalism loses any information about the lass priors, i.e. the relative frequeny of positive and negativeexamples,butPAClearnabilityisequivalenttoPAClearnabilitywheretherelative frequeny of positive/negative examples is onealed from the learner. (Formally, this is the equivaleneof the standard PACframeworkwith the \two-button" version, wherethe learner has aess to a \positive example orale" and a \negative example orale" [10℄. The two-button version oneals the lass priors and only gives the learner aess to the distribution asrestrited to eahlass label.)

(5)

is not learnable in our framework, any onept lass that distinguishes PAC learnability fromlearnability withunsupervised learnersthatreeived labeledexamples, would have a losure under omplementation that would distinguish PAC learnability from learnability with unsupervised learners that reeive unlabeled examples. Hene we assume examples are unlabeled.

Note however that we nd that for some onept lasses that are not losed under omplementation (notably monomials, setion 3.2) we an (without muh diÆulty) nd algorithms in our setting that use labeled examples, but we have so far not found any algorithmthat works with unlabeled examples.

The lass label whihthe pair of learners assign to input x is the one assoiated with the data sent to the learner that assigned x the higher likelihood. If x is given the same likelihood by both distributions generated by the learners, the tie is broken at random. (Thiskindoftiemayourwhenbothdistributionsgiveaprobabilityof0tox,asituation where \onservative"hypotheses are appropriate).

1.2 Notation and Terminology

Werefertothetwounsupervised learnersaslearner A and learner B,and wealsoassume by onvention that learner A is the one reeiving the positive examples, however as we havenoted above, learner A (and likewiselearner B)is not told its identity.

Theorem1belowjustiesthedesignofalgorithmsinwhihinsteadofinsistingthatthe outputsof theunsupervised learnersdene probabilitydistributions,weallowthe outputs to be any real values. We say that eah learner must onstrut a soring funtion from the domain X to the real numbers R. The omparison of the two sores assignedto any x2X isused to determine the lass labelthat the hypothesis lassier assigns to x.

We willsay that learner A (respetively B) \laims" aninput x if it gives x a higher likelihood or sore than learner B (respetively A). We say that a learner \rejets" an inputifitassignsalikelihoodof0,oralternativelyasoreofminimalvalue(itisonvenient to use 1 to denote suh a sore). Thus if a learner rejets an input, it will be laimed by the otherlearner provided that the otherlearner does not alsorejet that input.

2 General Results

(6)

unsupervised learnermay assignany realnumber to anelement of X, then there isa PAC algorithm in whih the learners must hoose numbers that integrate or sum to 1 over the domain (i.e. are a probability distribution).

Proof: LetA beanalgorithmthatreturnsasoringfuntion. Soinaprobleminstane, A is applied twie, one to A's data and one to B's data, and we obtain funtions f

A

: X !R and f B

: X !R. (So for example any x 2X with f A

(x)>f B

(x) would be labeled aspositiveby the overall hypothesis,under our onvention that A reeives the positive examples.)

Our approahistore-saleanyfuntionreturnedbythealgorithmsothattheoutome ofanyomparisonispreserved,butthenewfuntionssumorintegrateto1. Inaasewhere, for example,

P x2X

f A

(x)=1 and P

x2X f

B

(x)=2, this initiallyappears problemati: f B has to be saled down, but then the new values of f

B

may beome less than f A

. Note however that we an modify f

A

by hoosing an arbitrary x^ in A's data, and adding 1 to f

A

(^x). Thisan onlyimprovetheresultinglassier(itmayause x^ tobeput inA'slass where previously itwasput in B's lass). Now the new f

A

together with f B

an both be resaled down by a fator of 2, and omparisonsare learly preserved.

Making the above idea general, suppose that algorithm A takes a sample of data S

and returnsafuntion f :X !R. Modify A asfollows. Dene g(x)=e f(x)

=(1+e f(x)

), so the range of g is (0;1). Let P(X) be a probability distribution over X that does not vanish anywhere. Let s =

P x2X

g(x)P(x), or R

x2X

g(x)P(x)dx for ontinuous X. s is well-dened and lies in the range (0;1). Now for a disrete domain X, the probability distribution returned by the modied A is D

0

(x) = g(x)P(x) for all x 2 X exept for some arbitrary x^ 2 S, where D

0

(^x ) = g(x)P(x)+1 s. For a ontinuous domain the probability distribution is the mixture of the ontinuous density D

0

(x) = g(x)P(x) with oeÆient s and apointprobabilitymassloatedatsome x^2S withprobability 1 s. 2

As a onsequene of the above result,we ontinue by givingalgorithmsfor hypotheses that may output unrestritedreal numbers.

Our other general result about the PAC learningframework we have dened provides some further information about how it ompares with other variants of PAC learning in terms of whih onept lasses beome learnable. First, note that the framework an be extended to a mislassiationnoise situation by letting eah learner have examples that havebeen(orretlyorinorretly)assignedtothe lasslabelassoiatedwiththatlearner.

Theorem 2 Any onept lass that is PAC-learnable in the presene of uniform mislas-siation noise an be learned by two unsupervised learners in the presene of uniform mislassiation noiseif the input distribution is known to both learners.

Proof: Let D be the known distributionover the inputdomain X. Let C be aonept lass thatisPAC-learnablewith uniformmislassiationnoise. Wemayassumethat C is losed underomplementation(we havenoted that if it isnot losed under omplementa-tion we an take it losureunder omplementationwhih shouldstillbePAC learnable).

Eah learner takes a set S of N examples, where N is hosen suh that a standard PACalgorithmwould haveerror bound

2

. Let D +

(7)

probability

(1 )D (1 )D

+ +D

an example reeived by A belongs to target T; meanwhile with

probability

(1 )D (1 )D +D

+

anexample reeived by B omes from XnT.

Eahlearnerlabelsallitsexamplesaspositive,andthengenerates aset of N examples from D, eah ofwhihislabeledpositivewithsomeprobability p<

1 2

, otherwisenegative. For learner A, the union of these two sets onsists of a set of examples from a new prob-ability distribution D

0

, labeled by the same targetonept T. It may be veried that the examples from T have mislassiationnoise with noise rate

(1 p)(D +

(1 )+D )

(1 p)(D +

(1 )+D )+1

and the examples from XnT have mislassiationnoise with noise rate

+p(D +

(1 )+D )

+(D +

(1 )+D ) :

Itmaybeveriedfromtheseexpressionsthatthereisauniquevalueofpintherange[0; 1 2 ℄ forwhihthesetwo noiseratesare equal, andforthat valueof pthey are bothstritly less than

1 2 .

Hene for some value of p we obtain data with uniform mislassiation noise. An appropriate p an be found by trying all values p = r for r = 0;:::;1=2, and hek-ing whether the hypothesis obtained (using a standard noise-tolerant PAC algorithm) is onsistent with uniform mislassiationnoise.

The same reasoning appliesto learner B using XnT asthe targetonept.

Let H bethe setofexampleslabeledpositivebythe resultinghypothesis. Eahlearner assigns sores asfollows. If the observed value of D

0

(H) is atleast 1 , use a sore of 1 2 for all elements of X. Otherwise use a sore of 1 for elements of H and a sore of 0 for elements of XnH.

Let D 0 A

and D 0 B

be the D 0

'sfor learners A and B.

D(T) is the probability that a random example from D belongs to T. Assuming

D(T) < 1 , we an say that error O( 2

) with respet to D 0 A

implies error O() with respet to D. If alternatively D(T) 1 =2, the hypothesis H found by A will have a probabilityD

0

(H)>1 asobservedonthedata. Alearnerndingsuhahypothesisthen givesall examplesa sore of

1 2

, allowing B'ssores to determine the overall lassiation.

A similarargumentapplies forlowvalues of D(T) (where weexpet B toassign soresof 1

2

). 2

(8)

The algorithmsin this setion give an idea of the new tehnial hallenges, and also dis-tinguishthe learningsetting fromvariousothers. Sofor example,the learnabilityresultof setion 3.3 distinguishes it from learnabilitywith uniform mislassiationnoise or learn-ability with a restrited fous of attention, and the result of setion 3.4 distinguishes it from learnability with one-sided error or learnability from positive or negative examples only.

3.1 Boolean Funtions over a Constant Number of Variables

Letthedomain X bevetors oflength n ofbinary values0/1(where weuse 0torepresent falseand1torepresent true). Weshowhowtolearnafuntion f fromthelassofboolean funtions having the property that only k of the variables are relevant (aet the output

value), where k is a onstant. There are < ( n k

)2 2

k

funtions in this lass, so in the PAC framework, the brute-fore approah of trying eah one until a onsistent hypothesis is found, is apolynomial-timealgorithm.

We identify the following method for the two unsupervised learners framework. Eah unsupervised learner onstruts a soring funtion h : X ! f0;1g using the following rule. Let S be the sample of vetors that one of the learners has at itsdisposal. Forany vetor v2 X, put h(v)=1 i for all sets of k attributes in v, the values they take in v equal the values they takein some element of S (and otherwise put h(v)=0).

The above rule an be expressed as a polynomial-sized boolean formula, and it an be seen moreover that the formula is satised by elements of S but not by any examples that were sent to the other unsupervised learner. (That follows from the fat that a booleanformulaover k of thevariablesisbeingusedtodistinguishthe twosets ofboolean vetors.) Let

A

and B

be the formulae onstruted by learners A and B respetively.

A

isPAC(beingaonsistenthypothesisofpolynomialdesriptionlength)and B

isPAC with respet to the omplement of the target onept. Hene with high probability for v sampledusing D,

A

(v)6= B

(v), and so aomparison of the values of A

(v) and B

(v) willusually (in the PACsense) indiate orretlywhih learnerto assoiate v with.

3.2 Monomials

Reallthatamonomialisabooleanfuntiononsistingoftheonjuntionofasetofliterals (where aliteraliseitherabooleanattribute oritsnegation). Despite thesimpliityofthis lass of funtions, we have not resolved its learnability in the two unsupervised learners framework,evenformonotone(i.e. negation-free)monomials. Iftheunsupervisedlearners aretoldwhihofthemhasthe positiveandwhihthenegativeexamples,thentheproblem does have a simple solution (a property of any lass of funtions that is learnable from eitherpositiveexamplesonlyorelsenegativeexamplesonly). The\negative"unsupervised

learner assigns a sore of 1 2

(9)

Givenamonomialf,letpos(f)denoteitssatisfyingassignments. Theproblemthatarises whentheunsupervisedlearnersarenottoldwhihoneisreeivingthepositiveexamples,is thatthedistributionoverthenegativeexamplesouldinfatproduebooleanvetorsthat satisfysomemonomialm thatdiers fromtargetmonomialt,but if D(pos(m)\pos(t))> this may give exessive error. This problem an of ourse be handled in the speial ase where the monomial is over a onstant number k of literals, and the algorithm of setion3.1anbeused. Whathappensthenisthattheunsupervisedlearnersbeomemore \onservative", tending to assign zero sore more often to an instane, and in partiular will do so for instanes belonging to subsets of the domain where the subset orresponds to satisfying assignments of some k-variable monomial, and furthermore no instane has been observed in that subset. Formonomialsover only k literals, this approah does not exessively ut down the set of instanes thatare assigneda sore of 1.

3.2.2 Learnability of Monomials over Vetors of Attributes Generated by a Produt Distribution

In view of the importane of monomials as a onept lass, we onsider whether they are learnable given that the input distribution D belongs to a given lass of probability distributions. This situationisintermediatebetween knowing D exatly(inwhihase by theorem 2 the problem would be solved sine monomials are learnable in the presene of uniform mislassiationnoise) andthe distribution-independentsetting. Weassume now that the lass priors are known approximately,sine we no longer have the equivaleneof theone-buttonandtwo-buttonversionsofPAClearnability. Formally,assumeeahlearner an samplefrom D, butif the exampledrawnbelongstothe otherlearner's lass thenthe learneristoldonlythatthe examplebelongedtotheotherlass, andnootherinformation about it. Hene eah learner has an\observed lass prior", the observed probability that examples belong toone's own lass.

Suppose that D is known to be a produt distribution. Let x 1

;:::;x n

bethe boolean attributes in examples. Let d

i

, i = 1;:::;n, be the probability that the i-th attribute equals 1. For attribute x

i

for whih one of the literals x i

or x i

is in the target monomial t (we assume that they are not both in t), let p

i

be the probability that the literal is satised, so p

i =d

i

for anun-negated literal,otherwise p i

=1 d i

. Wesaythatattribute x

i

is\useful"ifx i

orx i

isint,andalsop i

2[;1 =n℄. Notethat if p

i

< then the probability that any example ispositiveis also <, and if p i

>1 =n then onlya very smallfrationof examples an benegativedue totheir value of x

i .

The Algorithm

WeusethefatthatforDaprodutdistribution,D restritedtothepositiveexamples of a monomial is also a produt distribution. We apply a test (step 3 of the algorithm) to see whether the observed data appears to ome from a produt distribution. The test identiesnegativedata whenthereismorethanoneuseful attribute. Oursoringfuntion separately handlesthe ases when <2 attributes are useful.

1. Drawa sample S of size O((n=) 3

log( 1 Æ

)).

(10)

=n of elements of S,let S l

denote elementsof S whihsatisfy l,and hek whether the relativefrequeny with whihany other literalis satisedwithin S

l

, diers from its frequeny of being satised in S by at least =n

2

. If so, the test \fails" and we assume that the negative examples are being seen, and give all examples a sore of 1/2. Otherwise (the test is \passed"):

4. Let L be the set of literals satisedby allelementsof S.

(a) Ifanexamplesatisesalltheliteralsin Landfailstosatisfyallliteralsthatare satised by a fration <=2n of elements of S, give that example asore of 1.

(b) Otherwise, if the examplestill satises L, assigna sore of 1=2.

() Otherwise assigna sore of 0.

Note: step 3 is a test with \one-sided error" in the sense that we may reasonably expet all produt distributions to pass the test, but there exist distributions other than produt distributions that may also pass. However, we show below that when a produt distribution restrited tothe negative data of amonomialpasses the test, then thereis at most one useful attribute.

3.2.3 Proving That the Algorithm is PAC

Theremustbeatleast oneuseful attribute inorder forthe frequeny of both positiveand negativeexamples to exeed . We onsider two ases: rst when there is only one useful attribute, seond, whenthere is more than one.

Case 1: In this ase, we expet the distributions over both the positive and negative examples to be lose to produt distributions, so that the test of step 3 of the algorithm will be passed in both A's ase and B's ase. Learner A gives a sore of 1 to examples whih satisfy the useful literal l, with the exeption of a small fration of them due to the additional requirement in step 4a. Meanwhile, learner B assigns a sore of

1 2

to all examples satisfying l, sine l is not satised by any of B's data. Hene learner A laims all but a fration <=2 of the positive data. By a similar argument,learner B laims all but a fration <=2 of the negative data.

Case 2: When there are two useful attributes, the positive examples are still generated by aprodutdistribution,so A'sdatapass thetest ofstep 3. Meanwhile,with probability >1 Æ, B's data failthis test, sine when we hoose literal l in target t that happens to be useful, and remove elementsof S whih satisfy l, then the onditional probabilitythat anyother usefulliteralissatised, hangesby >=n

2

. (AChernoboundanalysis assures that the hange will be deteted with probability 1 Æ=2.) All examples are then given sores of 1=2,and this allows A tolaim positives and leave B the negatives.

3.3 Parity Funtions

(11)

onlyand fromnegativeexamplesonly. Similarto the algorithmof [11℄, eah unsupervised learner nds the aÆne subspae of GF(2)

n

spanned by its examples, and assigns a sore of 1 toelements of that subspae and a sore of 0 toall elementsof the domain.

Clearly these aÆne subspaes do not interset, and if the two subspaes do not over the whole domain, then the region not overed is with high probability a low-probability region.

3.4 Unions of Intervals

Let the domain be the real numbers R, and assume that the target onept is a union of k intervals in R. We show that this onept lass is learnable by two unsupervised learners. This result shows that learnability with two unsupervised learners is distint fromlearnabilityfrom positiveexamples only,or fromnegativeexamples only.

Eah learner does the following. Let S be the set of real values that onstitutes the data. Dene soring funtion f as

f(r)=

min s2S;s>r

(s) max s2S;s<r

(s)

if r62S

f(r)=1 if r 2S:

This hoie of soring funtion ensures that when A's and B's sores are ombined, the set of all points that are laimedby A onsistsof a union of atmost k intervals,and this set ontains A's data but not B's data. Hene we have a onsistent hypothesis of V-C dimension no more than the target onept, so this method is PAC (with runtime polynomial in k,

1

and Æ 1

). Note that the value of k needs to be prior knowledge, in ontrast withPAC learninginthe standard setting, whereanappropriate samplesize an be identied using the standard on-line approah of omparing the number of examples seen so far with the omplexity of the simplest onsistent lassier, and ontinuing until the ratio is large enough.

3.5 Retangles in the Plane with Bounded Aspet Ratio

Let denotethelengthofthetargetretangledividedbythewidth,andourPAC-learning algorithmispolynomialin aswellasthestandardPACparameters. Thissimpleonept lass turns out torequire a relativelyomplex method,and isworth studying onits own, to indiate the problems involved. Possible extensions are disussed in the next setion. Wedo not havea PAC algorithmthat workswithout the bound on the aspet ratio.

The general idea is that eah learner partitionsthe domain into retangles ontaining equal numbers of itsdata points,and given aquery point q, ompares the oordinates of q with other points in the partition element within whih q falls. A high sore is given when there existpointsin that partitionelement with similaroordinate values.

The Algorithm: Eah learner doesthe following.

1. Generate a sampleof size N =O(log(Æ 1

)= 9

).

2. Builda partition P of the domain R 2

(12)

(a) Partitionthe domaininto1= pieesusinglinesnormaltothe y-axis,suhthat eah piee ontains the same number of data points.

1

(b) Partitioneahelementoftheabovepartitioninto 1= 2

pieesusinglinesnormal to the x-axis,suh that eahpiee ontains the same numberof data points.

3. Forquery point q2R 2

the sore assigned to q is omputed asfollows.

(a) Let P q

2P be the retangle in P ontaining q.

(b) Let S(P q

) bethe samplepointsthat lie inside P q

.

() Sort S(P q

)[fqg by x-oordinate. If q is among the rst ( 1 )

1

elements or

among the last ( 1 )

1

elements, then rejet q, i.e. assign q a sore of 1 and terminate.

(d) If q was not rejeted, dene the x-ost of q to be the dierene between the x-oordinates of the two neighborsof q.

(e) Sort S(P q

)[fqg by y-oordinate. If q is among the rst ( 1 )

1

elements or

among the last ( 1 )

1

elements,then rejet q.

(f) If q was not rejeted, dene the y-ost of q to be the dierene between the y-oordinates ofthe two neighborsof q.

(g) Finally, the sore assigned to q is the negation of the sum of the x-ost and y-ost.

3.5.1 Proving That the Algorithm is PAC

Let P A

be the partition onstruted by A and let P B

be the partition onstruted by B. Let

A

(respetively B

) be the measure on R 2

indued by the input distribution D restrited to target T (respetively T

0

, the omplement of T) and re-normalised, so that we have A (R 2 ) = B (R 2

) =1. So, given region R R 2

, A

(R ) is the probability that a random input x lies within R onditioned on x being an element of T. Let ^

A and ^

B

denote the measures A

and B

as observed by A and B respetively on the random examplesused inthe algorithm. The well-known V-Ctheory of[2℄saysthat for aonept

lass C of V-Cdimension v,given asample ofsize 2

O(vlog(Æ 1

1

)=), wehavethat with probability 1 Æ,

j^(C) (C)j for allC 2C

where isa probabilitymeasure and ^ isthe measure asobserved onthe sample. Noting thataxis-alignedretangles inR

2

haveV-Cdimension 4,wededuethat iflearnersA and B drawsamples of size O(log(Æ

1

1

)=), thenwith probability 1 Æ,

j^ A

(R) A

(R)j and

j^ B

(R) B

(R)j; for all retangles R:

The following fatemerges automatiallyfrom the V-C bounds:

1

Throughoutweignoreroundingerrorinsituationswhereforexampleanequalpartitionisimpossible; suhroundingwillonlyhangequantitiesbyaonstant.

2

(13)

probability 1 O(Æ) we have that for allretangles R:

j^ A

(R) A

(R)j 9

j^ B (R) B (R)j 9 :

The following fatfollows fromthe way the algorithmonstruts the partition P:

Fat 4 P A

and P B

are eah of size (1=) 4

, and elements of P A (respetively P B ) ontain (1=) 5

of A's (respetively B's) data points.

From fat4, given retangle R 2P A

, ^ A

(R)= 4

, and onsequently j A

(R) 4

j 9

withhighprobability,usingfat3. ClearlyallretanglesinP A

intersettargetretangle T (similarly members of P

B

interset T 0

). Nowonsider the potentialproblemof retangles inP

B

thatontainpositiveexamples. Weontinue byupper-boundingthenumberofthose retangles, and upper-bounding the amount of damage eah one an do (due to laiming data examples that should belaimed by A).

Let P A

P A

be elements of P A

whih interset T 0

(so are not proper subsets of T). Similarly let P

B

denote elements of P B

whih interset T. We show that the number of elements of P

A

and P B

is substantially smallerthan the ardinalitiesof P A

and P B

.

Fat 5 Any axis-aligned lineuts ( 1 )

2

elements of P A

andsimilarly ( 1 )

2

elements of P B

.

Corollary 6 Theboundary of T intersets at most O(( 1 )

2

) elements of P A

andsimilarly

at most O(( 1 )

2

) elements of P B

.

In partiular, it intersets atmost 4:( 1 )

2

elementsof either partition.

So partition P B has ( 1 ) 4

elementseah ontaining ( 1 )

5

data points, and only O(( 1 )

2 ) of them interset T. Now we onsider how an element R 2 P

B

ould interset T. We divide the kinds of overlap into

1. An edge overlap, whereone edge and noverties of T are overlapped by R.

2. Atwo-edgeoverlap,where2oppositeedgesand noornerof T are overlapped by R.

3. Any overlap where R ontains a ornerof T.

We treat these as separate ases. Note that sine there are at most 4 overlaps of type 3 we may obtain relatively high bounds on the error they introdue, by omparison with the edge and two-edge overlaps, of whihthere may be up to (

1 )

2

inthe worst ase. Throughout we use the following notation. Let x be a point in target retangle T whih is being assigned sores using A's and B's partitions. Let x 2 retangle R

A 2 P

A and x2R

B 2P

B

, sothat R A

intersets T. Case 1: (edge overlap) R

B

has an edge overlap with T. Consider steps 3 and 3e of the algorithm. When x is being ompared with the points in R

B

it will have either an x-oordinate or a y-oordinate whih is maximal or minimal for data points observed in R

B

(14)

a low probability (in partiular O( )) of having a maximal or minimal oordinate value

amongstpointsin R A (sine R A ontains ( 1 ) 5

data pointsand x isgenerated by thesame

distribution that generated those data points).

Case 2: (two-edge overlap) There are at most ( 1 )

2

two-edge overlaps possible. Suppose

thatinfatR B

overlapsthetop andbottomedgesof T (thefollowingargumentwillapply alsotothe othersub-ase). Hene allthe two-edge overlapsdoin fatoverlapthe top and bottom edges of T. Let x

T

and y T

denote the lengths of T as measured in the x and y diretions, sowehave x

T =y

T

2[1= ; ℄. Then the y-ost of R B

isatleast y T

. Meanwhile, all but a fration of boxes in P

A

will give a y-ost of y T

. Also, all but a fration of boxes in P

A

willhave x-osts at most x T

. Using our aspet ratio assumption, this is at most y

T

. Hene, for points in all but a fration of boxes in P A

, the y-ost will dominate, and the sore assigned by B willexeed A'ssore.

Case 3: (orneroverlap)Suppose R B

overlapsaornerof T. WeshowthatR B

introdues error O(),and sinethereareatmost4suhretangles,this aseisthensatisfatory. For R

B

tointrodueerror>,itmustoverlapafration ()ofretanglesinP A

,hene>( 1 )

3

retanglesin P A

. In thissituation,R B ontains (( 1 ) 3

) reanglesin P A

initsinterior. On average,both the x and y oordinates ofsample pointsinthese interior retangles willbe ((

1

)) loser to eah other than the points in R B

. This means that only an -fration of points in these elements of P

A

willhave oordinates loser to points in R B

, than tosome other point inthe same elementof P

A

. Hene all but an -frationof these pointswill be laimedby A.

3.5.2 Disussion, possible extensions

Obviously we would like to know whether it is possible to have PAC learnability without the restrition on the aspet ratio of the target retangle. The restrition is arguably benignfromapratialpointofview. Alternatively,variousreasonable\well-behavedness" restritions onthe input distributionwould probablyallowtheremovalof the aspetratio restrition, and alsoallow simpleralgorithms.

The extension of this result to unions of k retangles in the plane is fairly straight-forward, assuming that the aspet ratio restrition is that both the target region and its omplementareexpressibleasaunionofk retanglesallwithbound ontheaspetratio.

The generalidea being used is likely to be extendable toany onstant dimension, but then the ase analysis (on the dierent ways that a partition element may interset the region with the opposite lass label)may need to be extended. If so it should generalize tounions of boxes

3

inxed dimension(as studiedin[4℄in the settingof query learning,a generalizationis studied in[3℄ inPAC learning). Finally,if boxes are PAClearnable with two unsupervised learners in time polynomial in the dimension, then this would imply learnability of monomials,onsidered previously.

3

(15)

Given a set S of points in the plane, it would be valid for an unsupervised learner to use a probability distribution whosedomain is the onvex hull

4

of S, provided thatonly a \small" fration of elements of S are atually verties of that onvex hull. For a general PAC algorithm we have to be able to handle the ase when the onvex hull has most or all of the points at its verties, as an be expeted to happen for an input distribution whose domain is the boundary of a irle, for example. Our general approah is to start out by omputingtheonvexhullP and givemaximalsore topointsinside P (whihare guaranteed tohave the same lass labelas the observed data). Thengiveanintermediate sore topointsinapolygon Q ontainingP,where Q has feweredges. Weargue thatthe way Q is hosen ensures that most pointsin Q are indeed laimedby the learner.

3.6.1 The Algorithm

The general idea is to hoose a soring funtion in suh a way that we an show that the

boundary between the lasses is pieewise linear with O( p

N) piees, where N is sample size. ThissublineargrowthensuresaPACguarantee,sinewehavean\Oam"hypothesis

| the V-C dimension of pieewise linear separators in the plane with O( p

N) piees is

itself O( p

N).

1. Drawa sample S of size N =O(log(Æ 2

2

)= 2

).

2. Letpolygon P be the onvex hullof S.

3. Let Q bea polygon having 2+ p

N edges suh that

(a) Every edge of Q ontains anedge of P

(b) Adjaent edges of Q ontain edgesof P that are p

N apartinthe adjaeny sequene of P'sedges.

4. Dene soring funtion h as follows.

(a) Forpoints in P use asore of 1.

(b) Foreahregion ontained between P and 2adjaentedges of Q,givepoints in that regiona sore of the negationof the area of that region.

() Rejet allother points (not in Q).

Regarding step 3: Q an be found in polynomial time; we allow Q to have 2+ p

N edges sine P may have 2aute verties that forepairs ofadjaentedges of Q toontain adjaent edges of P.

4

(16)

l

P

Q

P

Q

A

A

B

B

shaded region is

claimed by A

gure1

3.6.2 Proving That the Algorithm is PAC

Figure 1 illustrates the onstrution. Let P A

and P B

be the onvex hulls initially found by learners A and B respetively. They dene subsets of the regions laimed by A and B respetively. Let Q

A

and Q B

bethe polygons Q onstruted by A and B respetively. Let l be a line separating P

A

and P B

. Observe that Q A

and Q B

respetively an eah onlyhave atmost two edgesthat ross l, and atmost one vertex on the opposite side of l from P

A

and P B

respetively,using the fatthat eahedge of Q A

ontains anedge of P A

, and similarlyfor Q

B

and P B

.

Hene onlyone of the regions enlosed between P A

and two adjaent edges of Q A

an ross line l, and potentially be used to laim part of the interior of Q

B

. If this region ontains more than one of the similar regions in Q

B

, then it will not in fat laim those regions of Q

B

, sine the sore assigned to its interior will be lower. Omittingthe details, it is not hard to show using these observations that the region laimed by A is enlosed

by a polygon with O( p

N) edges, and similarlyfor B. N was hosen suh that the V-C bound of setion 3.5.1 ensures PAC-ness with parameters and Æ.

4 Conlusion and Open Problems

(17)

what assumptionsneed tobe madeabout the distribution ofinputs, inorder forstandard pratialmethodsofunsupervised learning,suhaskernel-basedsmoothing,tobeapplied. The main open problem is the question of whether there is a onept lass that is PAC-learnable but is not PAC-learnable in the two unsupervised learners framework. It wouldberemarkableifthetwolearningframeworkswere equivalent,inviewofthe waythe PAC riterion seems to impose a disipline of lass separation on algorithms. Regarding speionept lasses,the mostinterestingone toget ananswerforseems tobethelass of monomials, whih is a speial ase of nearly all boolean onept lasses studied in the literature. We believethat itshouldbepossibleto extendthe approahinsetion3.2.2 to the assumption that D is a mixture of two produt distributions, a lass of distributions shown to be learnable in[5, 7℄.

Related open questions are: does there exist suh a onept lass for omputation-ally unbounded learners (where the only issue is suÆieny of information ontained in a polynomial-sizesample). Also, an it be shown that proper PAC-learnability holds for some onept lass but not in the two unsupervised learners version. (So, we have given algorithms for various learning problems that are known to be properly PAC-learnable, but the hypotheses we onstrut donot generallybelong to the onept lasses.)

Withregardtothemainopenproblem,wementionanobviousideathatfails. Takeany oneptlassthatisnotPAClearnablegivenstandardomputationalomplexitytheoreti assumptions, e.g. boolean iruits. Augment the representations of positive and negative examples with \random" numbers suh that the number from any positive example and the number from any negative example between them allow an enoding of the target onept to be derived by a simple seret-sharing sheme, but the individual numbers are of no use. It looks as if this lass should be PAC learnable, but should not be learnable inour framework. The problemisthat we have to say whether exampleslabeledwith the \wrong"numbers(orrespondingtoanalternativerepresentation)arepositiveornegative. Then some probability distribution would be able to generate these examples instead of the ones we want to use.

Referenes

[1℄ M.Anthony andN. Biggs(1992).ComputationalLearning Theory, CambridgeTrats in TheoretialComputer Siene, Cambridge University Press, Cambridge.

[2℄ A. Blumer,A. Ehrenfeuht,D. Hausslerand M.K. Warmuth (1989).Learnabilityand the Vapnik-Chervonenkis Dimension, J.ACM 36, 929-965.

[3℄ N.H. Bshouty, S.A. Goldman, H.D. Mathias, S.Suri and H. Tamaki (1998). Noise-Tolerant Distribution-Free Learning of General Geometri Conepts. Journal of the ACM 45(5),pp. 863-890.

[4℄ N.H.Bshouty,P.W.Goldberg,S.A.GoldmanandH.D.Mathias(1999).Exatlearning of disretized geometri onepts. SIAM J. Comput.28(2)pp. 674-699.

(18)

dations of Computer Siene.

[7℄ Y.FreundandY.Mansour(1999).Estimatingamixtureoftwoprodutdistributions. Pros.of 12th COLT onferene,pp. 53-62.

[8℄ A. Frieze, M. Jerrum and R. Kannan (1996). Learning Linear Transformations. 37th IEEE Symposium on Foundationsof ComputerSiene, pp. 359-368.

[9℄ V. Guruswami and A. Sahai (1999). Multilass Learning, Boosting, and Error-Correting Codes. Pros. of 12th COLT onferene,pp. 145-155.

[10℄ D. Haussler, M. Kearns, N. Littlestone and M.K. Warmuth (1991). Equivalene of Models for Polynomial Learnability. Information and Computation, 95(2), pp. 129-161.

[11℄ D. Helmbold, R. Sloan and M.K. Warmuth (1992). Learning Integer Latties. SIAM Journal on Computing, 21(2), pp. 240-266.

[12℄ M.J. Kearns (1993). EÆient Noise-Tolerant Learning From Statistial Queries, Pros.of the 25th Annual Symposium on the Theory of Computing, pp. 392-401.

[13℄ M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R.E. Shapire and L. Sellie (1994). On the Learnability of Disrete Distributions, Proeedings of the 26th Annual ACM Symposium on the Theory of Computing, pp. 273-282.

[14℄ M.J. Kearnsand R.E. Shapire (1994).EÆient Distribution-freeLearningof Proba-bilisti Conepts,Journal of Computerand System Sienes,48(3)464-497.(see also FOCS '90)

[15℄ J.C. Platt, N.CristianiniandJ. Shawe-Taylor(2000).Large MarginDAGsfor Multi-lass Classiation, Pros. of 12th NIPSonferene.

[16℄ L.G. Valiant (1984). A Theory of the Learnable. Commun. ACM 27(11), pp. 1134-1142.

References

Related documents

 Improve public understanding of local authority commissioning of care and the prices paid The paper concludes by arguing that with the local authority care system in England facing

The rate of UB II recipients receiving a sanction per calendar month increased during 2005 from less than one percent in Spring 2005 to 2.5 percent in Winter 2005 (see Table

-Organisasie baie goed en samehangend, insluitend inleiding, liggaam en samevatting/slot -Respons bevredigend -Idees redelik samehangend en oortuigend -Redelike

Vasomotor response is related to the capacity of the vessel to maintain the homeostasis of the vascular tone, ensuring that the blood flow matches the demand of the skeletal muscles

- Remove meat animals from treated pastures 30 days before slaughter. - Do not harvest hay for all types of livestock for 37 days after treatment. Rotation intervals - same as shown

Key words: Ahtna Athabascans, Community Subsistence Harvest, subsistence hunting, GMU 13 moose, Alaska Board o f Game, Copper River Basin, natural resource management,

Тип ресурсу Призначення Алфавітний підхід Статистичний підхід Семантичний підхід Файлова система Персональний ресурс Автоматично Не застосовується