Learning fixed dimension linear thresholds from fragmented data

(1)

Original citation:

Goldberg, Paul W. (1999) Learning fixed-dimension linear thresholds from fragmented

data. University of Warwick. Department of Computer Science. (Computer Science

Research Report). (Unpublished) CS-RR-362

Permanent WRAP url:

http://wrap.warwick.ac.uk/61089

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

A note on versions:

(2)

Fragmented Data

Paul W. Goldberg Dept. of Computer Siene,

University of Warwik, Coventry CV4 7AL, U.K.

pwgds.warwik.a.uk

September 10, 1999

Abstrat

We investigate PAC-learning in a situation in whih examples (onsisting of an inputvetorand0/1label)havesomeoftheomponentsoftheinputvetoronealed from the learner. This is a speial ase of Restrited Fous of Attention (RFA) learning. Ourinteresthereisin1-RFAlearning,whereonlyasingleomponentofan inputvetorisgiven,foreahexample. Wearguethat1-RFAlearningmeritsspeial onsiderationwithinthe widereld of RFA learning. It is themost restritiveform of RFA learning(so that positive resultsapply in general), and it modelsa typial \datafusion"senario,wherewehavesetsofobservationsfromanumberofseparate sensors,butthese sensors areunorrelated soures.

Within this setting we study the well-known lass of linear threshold funtions, the harateristi funtions of Eulidean half-spaes. The sample omplexity (i.e. sample-size requirement asafuntionof theparameters) of thislearningproblem is aeted by the input distribution. We show that the sample omplexity is always nite,foranygiveninputdistribution,butwealsoexhibitmethodsfordening\bad" input distributionsfor whih the sample omplexity an grow arbitrarily fast. We identify fairly general suÆient onditions for an input distribution to give rise to sampleomplexitythat ispolynomialinthePACparameters

1 and Æ

1

. Wegive analgorithm (usinganempirial -over) whosesampleomplexityis polynomialin theseparameters andthedimension(numberofinputs),forinputdistributionsthat satisfyouronditions. The runtimeis polynomialin

1 and Æ

1

providedthat the dimensionis anyonstant. We showhow to adaptthe algorithm to handleuniform mislassiationnoise.

(3)

The aim of supervised learning is to nd out as muh as possible about some unknown funtion (alled the target funtion) using observations of its input/output behavior. In this paperwe fous on linear threshold funtions. These map vetors of inputs to binary outputs aording to the rule that the output should equal 1 provided that some linear ombination of the inputs exeeds some threshold value, otherwise the output equals 0. Thus a linear threshold funtion an be desribed by a vetor of real oeÆients, one for eahinput, and a real-valuedthreshold.

Probably ApproximatelyCorret (PAC) learningisawell-known frameworkfor study-ing supervised learning problems in whih outputs of the funtions under onsideration may take one of two values (suh as 0 and 1), so that any funtion partitions the input domain into two sets. We give the basi denitions of PAC learningbelowin setion 1.2; see textbookssuh as [2,30℄for a detailedintrodution tothe theory.

The problemof learninglinear threshold funtionsinthe PAC framework has reeived a lot of attention in the literature, some of whih is desribed below. In this paper we onsider a natural variant of the problem in whih the algorithm has aess to examples of the target funtion in whih only a single input omponent (together with the output value, 0or1)are given. It isassumed thatforeahexampleofinput/outputbehavior,the hoie of whihinput has itsvalue given, is made uniformlyatrandom.

The paper is organized as follows. In this setion we give bakground, motivation for studying this variant in detail, a formal statement of the learning situation, and some preliminary results. In setion 2 we show how the joint distribution of the inputs may aet the number of examples needed to distinguish the target funtion from a single alternativelinearthresholdfuntion,havingsomegivenerror. Insetion3weuseageneral method identied in setion 2 to PAC-learn linear threshold funtions, for any onstant numberofinputs. Insetion4weonsider thespeialasewhereinputsarebinary-valued. In setion 5 we disuss the signiane of the results presented here, and mention open problems of partiular interest.

1.1 Bakground and Motivation

(4)

Most methods for learningfrom inompletedata use imputation, inwhihthe missing values in the data set are assigned values aording to some rule (for example [33℄ use mean imputation, where anunknown omponent value is given the average of the known values for that omponent). In general, imputation biases the data slightly, whih is at odds with the PAC riterion for suessful learning, being used here. Linear threshold funtions are anoversimplied modelfor the data, sine there is lass overlap (indeedthe data set ontains idential pairs of input vetors with distint reovery levels). However our algorithmis extendable to amore realisti \mislassiationnoise" model.

Our simplifyingassumption that eah example has only asingle input attribute value given has the following motivations:

1. It eliminates the strategy of disarding inomplete examples, whih is wasteful in pratie. The strategy of disarding inomplete examples may also bias the data if the missing data mehanism is more likely to oneal some values than others (i.e. anything other than what Little and Rubin [32℄ allmissing ompletely at random).

2. The restritionto aonstantnumberof values perexample isequivalenttoa simple stohasti missing-data mehanism, aswell as being aspeial ase of RFA learning. The statistial missing data literature usually assumes that there is a stohasti missing data mehanism,as opposed to RFA learning where unonealed values are seleted by the learner.

k-RFAlearningrefers toasettingwhere k omponentsof anyexampleare known to the learner; thus we fous on 1-RFA learning. The equivalene noted above an be seenbyobservingthatinoursettingalearnermaygatherpolynomial-sizedolletions of samplesfor eahset of k attributes, as easilyasit may gathera polynomial-sized sample,andheneeetivelyqueryanygiven setof k attributes. Weprefer theterm \fragmented data" over \missing data" in this situation, to emphasise that only a smallproportion of any data vetor is given.

3. The1-RFAsettingisthemoststringentorrestritivesituation,inthatpositiveresults for1-RFAlearningapply inothersettings. Italsomodelsthe\datafusion"problem, in whih olletions of examples are generated by a set of independent soures, and the aimistoombine (or \fuse") the informationderived fromthe separate soures.

Linear threshold funtions are anobvious hoie of funtion lass in the ontext intro-dued here, beause the output value generally depends on all the input values; it is not generally suÆient to knowjust asubset of them. Butinformationis stillonveyed by an example inwhih allbut one input value is onealed.

(5)

as produt distributions. It is known from this work that it is neessary to already have a lot of knowledge of the input distribution, in order to learn the funtion. We might reasonablyexpet tohavea parametrimodelforthe input distribution,and thenuse the EM algorithm [19℄ or subsequent related methods that have been devised for learning a distribution inthe presene of missingdata.

In setion 2 we fous on the question of whih distributions are helpful or unhelpful for 1-RFA learning. The sensitivity of the sample omplexity to the nature of the input distribution(partiularly whenwe donotrestritto produt distributions)isadistintive novel feature of this omputational learning problem, with a lot of theoretial interest. (By sample omplexity we mean the number of examples needed for PAC learning by a omputationally unbounded learner.) Experimental work in the data fusion literature suh as [12, 18℄ has shown the strong eet that varying assumptions about the input distribution may have on preditive performane. We aim to provide some theoretial explanationby identifyingfeaturesofaninputdistributionthatmakeit\helpful"andgive assoiated sample-sizebounds.

We mention relationships with other learning frameworks. The RFA setting is more benign than the \random attribute noise" [24, 36℄ senario. A data set with missing omponents an be onverted to one with random attribute noise by inserting random values for the missing omponents (although note that for k-RFA data, with small k, the assoiated noise rate would be quitehigh).

Finally,observethatthere isasimilaritytothe probabilisti oneptsframeworkof[29℄ in that, given a stohasti missing data mehanism, we have observations of a mapping fromaninput domainonsisting of partiallyobserved vetorstooutputs whosevalues are onditional distributions over f0;1g onditioned on the observed inputs. The dierene is that we do not just want to model the onditional distribution of outputs given any input, we also want anunderlying deterministi funtion to be well-approximated by our (deterministi)hypothesis. In this paperwe make use of the quadrati loss funtion of an observation and hypothesis, asdened in[29℄.

1.2 Formalization of the Learning Problem

We are interested in algorithms for probably approximately orret (PAC) learning as introdued by Valiant in [38, 39℄. Here we give the basi denitions and introdue some notation. An algorithm has aess to a soure of observations of a target funtion t : X ! f0;1g, in whih inputs are hosen aording to some xed probability distribution D overthe domain X,and the orret 0=1 output isgiven for eah input. It isgiven two parameters, a target auray and an unertainty bound Æ. The goal is to output (in timepolynomialin

1 and Æ

1

(6)

in[7℄). TheRFAliteraturegivesexamplesthatshowthatsomeknowledgeofDisneessary formost learningproblems,and itisoftenassumed that D isaprodut distribution(eah attribute hosen independently). In this paper we do not address the topi of partial knowledgeofD. Inthenext setionweshowthatsomeknowledgeisneessaryforlearning linear threshold funtions (the funtionlass of interest here).

Within the PAC framework, we are studying speially 1-RFA learnabilitywhere for eah example the learner an see one of the input values and the binary output value. Thus, for domain X = R

d

, an example is a member of R f1;:::;dg f0;1g, sine it ontains a real value, the identity of the oordinate taking that value, and the output label. As noted, the assumption that the oordinate's identity is hosen by the learner is equivalent(forPAClearning)tothe assumptionthat itishosen atrandom. Thisis more stringent than \missing ompletely at random" sine we have imposed an artiial limit (of 1) onthe number of observed input values. We have observed that this artiiallimit is important to disallow disarding some training examples and using others. Obviously PAC-learnabilityof 1-RFAdata implies PAC-learnabilityof k-RFA data forany larger k. Our aimistouse fragmented datatolearn linearthreshold funtions,that isfuntions mappingmembers of some unknown halfspaeof R

d

tothe output 0,and itsomplement to 1. These are funtions of the form f((x

1 ;:::;x

d

)) = 1 i P

i a

i x

i

> where a i

are unknown oeÆients and is a \threshold" value. Throughout, we use the unit ost modelof real numberrepresentation.

Our algorithm is (for a large lass of input distributions) polynomial in the PAC pa-rameters

1

and Æ 1

, provided that d is onstant. In investigating the behavior of the algorithmasafuntionofdimension d,weneedtoonsideritwithrespet toa paramater-ized lass D

d

of inputdistributions,where D d

isaprobability distributionover R d

. (This isdue tothe dependene we have noted ofsample omplexity oninput distribution.) The algorithm's runtime is typially exponential in d, but for two lasses D

d

of interest, the sample omplexity an be shown to be polynomial.

1.3 Related WorkonLinear ThresholdsandNoise-tolerant Learn-ing

The domain R d

(foronstant d)is a widely onsidered domain inthe learning theory lit-erature. Examples oflearningproblems overthis domaininludePAC-learningof boolean ombinations of halfspaes [15℄, query-based learning of unions of boxes [16℄, and unions of halfspaes[9, 4, 13℄. A tehnique of [9℄ generalized by [15℄ involves generating a set of funtions that realise alllinear partitionsof asample of input vetors. If m is the sample size then the set of partitions has size O(m

d

(7)

ature. We willnot reviewthe algorithmshere, but see Blumet al.[11℄ fora goodaount of the PAC learning results. It is well-known that in the basi PAC framework, linear threshold funtionsare learnable. Findingaonsistenthypothesis(a hyperplanethat sep-arates the given inputs with output 1 from those with output 0) an be solved by linear programming in polynomial time. The well-known results of Blumer et al. [9℄ show that any onsistenthypothesis ahievesPAC-ness, given a samplewhose size isproportionalto

1

, log(Æ 1

), and d. (This uses the fat that the Vapnik-Chervonenkis (V-C) dimension of halfspaes of R

d

is d+1, see [9℄ fordetails.)

As mentioned in the previous subsetion, we assume unit ost for representation and arithmeti operations on real values. The algorithm of [11℄ PAC-learns linear threshold funtions in the presene of random mislassiation noise, and requires the logarithmi ostmodelforreal valuerepresentation. Soalsodoesthe basiPACalgorithmof [9℄,sine known algorithms for linear programming that are polynomial in d assume logarithmi ost. (Forunit ostreal arithmeti,urrentlyitisknown howtodolinearprogrammingin polynomial time for logarithmi d, see Gartner and Welzl [25℄.) These observations raise the question of whether we an nd an algorithm that is polynomial in d as well as the PAC parameters, for logarithmi ost real arithmeti. In setion 4 where we disuss in moredetail thease whereinputsomefromthe disretebooleandomain,weexplainwhy this open problemis stilllikelyto be hard.

In this paper we show how to onvert our algorithm into a statistial query (SQ) al-gorithm(as introdued by Kearns[28℄), whih impliesthat it an be made noise-tolerant. (Overthe booleandomain f0;1g

d

amore generalresult ofthiskindalready exists,namely that learnability in the k-RFA implies SQ-learnability and hene learnability in the pres-ene of random lassiation noise, for k logarithmi in d [6℄.) An extension to RFA learnability of linear thresholds (in time polynomial in d) would then be a strengthening of the result of [11℄.

Note that if we had a method for determining a good approximation of the error of a hypothesis (using the fragmented data) then we ould PAC-learn, using a result of [7℄, whih says that PAC-learnability with a known distribution D in the standard setting is equivalent to PAC-learnability with a known distribution when instead of examples, the learning algorithm has a means of measuring the error of any hypothesis it hooses. However, wehavenot found anygeneralwayof approximatelymeasuringmislassiation rate of a hypothesis using RFA data, even for the kinds of input distributions that we identify as implyingpolynomial sampleomplexity.

1.4 Tehnial Preliminaries

(8)

that without some information about the input distribution it is often possible to dene pairs of senarios (a senario is the ombination of an input distribution and lassier) whih are substantially dierent but are indistinguishable to a RFA learner. We use the same method for linearthreshold funtions.

Given abinary-valuedfuntion C,dene pos(C) tobe the positiveexamples of C,i.e. fx:C(x)=1g and neg(C) tobethe negative examples,i.e. fx:C(x)=0g.

Fat 1 It isimpossible to learn linear thresholds over R 2

for a ompletelyunknown input distribution D, even fora omputationally unbounded learner.

x

y

0

1

2

3

4

5

6

1

2

3

4

5

0

1

2

3

4

5

6

1

2

3

4

5 neg(C)

pos(C)

neg(C’)

pos(C’)

C’

C

gure1

Dierent but indistinguishable senarios desribed in proof of fat 1:

Proof: Dene linear threshold funtions C, C 0

over the (x;y)-planeas follows.

pos(C)= f(x;y):y<1+x=2g pos(C

0

)= f(x;y):y<4 x=2g

Dene input distributions D, D 0

overthe (x;y)-planeas follows. D is uniform overthe 4 unit squares whose lower-left orners are at (0;0), (4;2), (1;2) and (5;4). D

0

is uniform overthe4unit squareswith lower-leftornersat(0;2), (4;0), (1;4) and (5;2). (These are the shaded regionsin gure 1.)

Consider 1-RFA data generated by either C in ombination with D, or C 0

in ombi-nation with D

0

. The marginal distributions (that is, the distributions of the separate x and y oordinates) are the same in both ases, as are the onditional distributions of the output labelgiven the input (sofor example, Pr(label=1 j x2[0;1℄)=1 in both ases, or Pr(label = 1 j y 2 [2;3℄) = 1=2 in both ases). But the two underlying funtions are

(9)

Sine the disrete boolean domain X = f0;1g is of speial interest, we give a similar onstrution insetion 4 for that speial ase, thus showing that some knowledge of D is stillrequired. (That onstrutionuses 4 input dimensions, ratherthan just 2.)

Theaboveonstrutiongivesindistinguishablesenariosforpairsofinputdistributions thatdierfromeahother. Weshowlaterthatforany knowninputdistribution,thereare no indistinguishable pairs of linear threshold funtions (in ontrast with funtion lasses ontaining, for example, exlusive-or and its negation, [5℄). But the following example shows how a known input distribution may aet sample omplexity. Observe rst that for pairwise omparison, the optimal strategy is to maximizethe likelihoodof the output labels given the input oordinate values. For an individual example in whih the input oordinate x

i

takes the value r 2 R and the output label is l 2 f0;1g, this likelihood is the probabilitythatpointsgeneratedby D onditionedon x

i

=r giveoutputvalue l. For a olletionof suh examples the likelihoodis the produt of the individual likelihoods.

Example 2 Suppose that D is uniformover two line segments in the (x;y)-plane, having (forsomesmallpositive)endpoints ((;0);(1;1 ))and ((0;);(1 ;1)). LetC(x;y)= 1 i y<x and let C

0

(x;y)=1 i y>x.

x

y

C

C’

0

1

0

1

1 pos(C’)

neg(C’)

neg(C)

pos(C)

gure2

C and C 0

asdened inexample 2; whih disagreeon allinputs (x;y):D is uniform over the two heavy line segments inthe square:

If the target funtion is C (respetively, C 0

), then a PAC algorithm should have a probability Æ of outputting C

0

(respetively, C), for any error bound < 1. But if either C or C

0

is the target funtion, then in order to have any evidene in favor of one over the other, it is neessary to see an example in whih the value assigned to the given input oordinate lies in the range [0;℄[[1 ;1℄. Examples of this kind our with probability 2, and all other points are uninformative (having equal likelihood for C and C

0

(10)

beomes the line segment with endpoints at (0;0) and (1;1)), the assoiated sample-size requirementsdonotbeomeinnite;insteadthe learningproblemredues toasimilarone in one dimension fewer.

2 Eet of Joint Distribution of Inputs on Sample

Complexity of Pairwise Comparisons

In this setionwe giveresults about the way the jointdistribution over input omponents may aet the sample-size requirements for a restrition of the learning problem. We assumethatonlytwoandidatefuntions C, C

0

aregiven,whihdisagreewith probability . One of them is the target funtion, and the aim is to determine whih one is the target funtion, with probability 1 Æ of orretness. Example 2 showed a lass of input distributionswhosemembersouldmakearbitrarilylargetheexpetednumberofexamples needed todistinguish a partiular pair of funtions. Note, however, that

1. No input distribution gave the requirement that any pair of positive values (;Æ) of target aurayand ondenerequired innite data.

2. The asymptoti behaviourof sample-sizerequirements is stillpolynomial. In parti-ular, we laim that given any pair of linear threshold funtions that disagree with probability , we need (max(

1 ;

1

)) examplesin orderto distinguish themwith some given probability of suess. This is stillpolynomial in , forany given >0.

Regardingpoint1above,weshowinsetion2.1(theorem4)thatthereisnoinput distribu-tion whosemarginaldistributionshavewell-dened means andvarianesthatallowssome pair ofdistintlinear thresholdfuntions thatdierby some >0 tobe indistinguishable in the limit of innite 1-RFA data. Moreover in orollary 5 we show that a nite upper boundonsamplesizeanbederivedfromD, and Æonly,andnotonthepartiularhoie of C and C

0

whih dier by . Regarding point 2, in setion 2.2 we give fairly general suÆientonditionsonaninputdistribution,forsampleomplexity tobepolynomial. We do however in setion 2.3 identify ertain \pathologial" distributions where the sample omplexity is not neessarily polynomialin

1

and Æ 1

.

2.1 Finiteness Results for Sample-size Requirements

Inwhatfollows,weassumethatallprobabilitydistributionshavewell-denedexpetations and varianes for omponents of input vetors. Regarding point 1 above, we show that for these probability distributions there is never an innite sample-size requirement one a distributionis given, despitethe fatthat distributions may be arbitrarilybad.

Lemma 3 Let D, D 0

be probabilitydistributionswithdomains R andR 0

respetively, both subsetsof R

d

. Suppose moreoverthat R and R 0

(11)

random variables x and x generated by D and D respetively, the expeted values E(x) and E(x

0

) are distint.

Proof: Sine the expeted value is a onvex ombination, we just note that E(x)2 R and E(x

0 )2R

0

, and sine R\R 0

=;, the expeted values are indeed distint. 2

C andC 0

asdenedinthestatementofthe followingtheoremareslightlymoregeneral than linear threshold funtions | we use the additional generality in the proof of orol-lary 5. For a funtion f : X ! f0;1g, let pos(f) denote fx 2 X : f(x)= 1g and let neg(f) denote fx2X : f(x)=0g.

Theorem 4 Let D be any probability distribution over R d

whose marginal distributions have well-dened means and varianes. Let C and C

0

be any pair of funtions from R d

to f0;1g suh that

1. pos(C), neg(C), pos(C 0

), neg(C 0

) are all onvex.

2. with probability 1, a point generated by D lies in pos(C)[neg(C).

3. with probability 1, a point generated by D lies in pos(C 0

)[neg(C 0

).

4. with probability , a point generated by D is givendierent labels by C and C 0

.

Then C and C 0

are distinguishable (using1-RFAdata) withprobability 1 Æ (for ;Æ>0) for some suÆiently large nite sample size (dependent on D;;Æ;C;C

0 ).

Proof: C and C 0

divide the domain R d

into 4onvex regions dened as follows.

R 00

=neg(C)\neg(C 0

) R 01

=neg(C)\pos(C 0

) R

10

=pos(C)\neg(C 0

) R 11

=pos(C)\pos(C 0

)

Let D(R ij

) be the probabilitythat apointgenerated by D liesin region R ij

. The region ofdisagreementof C and C

0 is R

01 [R

10

{by assumption4,D(R 01

[R 10

)=. Let(R ij

) denotethe expetationof pointsgeneratedby D,restritedtothe regionR

ij

|aslong as D(R

ij

)>0, (R ij

)iswell-dened byour assumptionthatomponentsofpointsgenerated by D have well-dened expetations and varianes.

The points (R 00 );(R 01 );(R 10 );(R 11

) are all distint from eah other (observing thatthe R

ij

areonvexanddisjoint,sowean uselemma3). Nextnotethat theexpeted value of negative examples of C is a weighted average of (R

00

) and (R 01

) (weighted by probabilities D(R

00

) and D(R 01

)). Similarly the expeted value of negative examples of C

0

is a weighted average of (R 00

) and (R 10

) (weighted by probabilities D(R 00

) and D(R

10 )).

We use the fat D(R 01

)+D(R 10

)= >0 todedue that the negative examples of C and C

0

have dierent expetations. Ifthe (distint) points (R 00

), (R 01

), (R 10

) donot lieonaone-dimensional line,this follows. Ifthey lieona line,the point (R

01

) annot be in the middle, sine that would ontraditonvexity of neg(C

0

). Similarly (R 10

(12)

00

of the averages are positive, the means (neg(C)) and (neg(C 0

)) must lie on opposite sides of (R

00

) onthe line.

Soweanhoose aomponentonwhihmeansofnegativeexamplesdier,and usethe observed mean of 0-labeled observations of that omponent toestimate the true expeted value. Given our assumption that the variane is well-dened (nite), there will be a suÆientlylarge samplesize suh thatwe an withhigh probabilitypreditwhihof C or C

0

islabeling the data. 2

Corollary 5 Given any input distribution D over R d

and any target values ;Æ > 0 of PACparameters, there existsasuÆiently large nitesample sizeforwhih any pair C;C

0

of linear threshold funtions an be distinguished with probability 1 Æ.

Proof: Suppose otherwise. Then for some D, , Æ there would exist a sequene of pairs (C

i ;C

0 i

), i 2 N where C i

diers from C 0 i

by , and as i inreases, the sample-size required to distinguish C

i

from C 0 i

inreases monotonially without limit. We prove by ontradition that suh asequene annotexist.

The general strategy is as follows. From the sequene (C i

;C 0 i

) extrat a subsequene (C

i ; C

0 i

)whih\onverges"inthesensethatasiinreases, theprobabilityofdisagreement between C

i

and C j

,foranyj >i,tendstozero,andlikewiseforC 0 i

and C 0 j

. Thesequenes

C i

and C 0 i

then onverge pointwise to binary lassiers C 1

and C 0 1

suh that pos(C 1 ), pos(C 0 1 ), neg(C 1

) and neg(C 0 1

) are onvex. 1

Theorem 4 says that C 1

and C 0 1

should be distinguishable with any PAC parameters ;Æ >0, for nite sample-size depending on , Æ. But this willbeontradited by the onvergene property of (C

i ;C

0 i

). Dene the C-dierene between (C

i ;C

0 i

) and (C j

;C 0 j

) (denote d((C i ;C 0 i );(C j ;C 0 j ))) to be theprobabilityPr(C

i

(x)6=C j

(x)) for x generatedby D. Wewillonstrut aninnite subsequene (C

i ;C

0 i

) suh that for j >i,

d((C i ;C 0 i );(C j ;C 0 j

))<2 1 i

:

FromaresultofPollard[35℄(see alsoHaussler[26℄),forany >0,thereisanite-over foranyolletionofsetshavingniteV-Cdimension(whihaswehavenotedinsetion1.3 is d+1 in this ase). (A -over of a metri spae is a set S of points suh that for all points x in the metrispae there is a member of S within distane of x.)

Construt C i

as follows. Let C 1

=C 1

. Now onstrut C i+1

from C i

maintaining the invariant that there are innitely many elements of the sequene (C

j ;C

0 j

) whih have C -dierene2

1 i with(C i ;C 0 i

). LetS i

beanite2 i 1

-overofthelassoflinearthreshold funtions,with respet toinputdistribution D. Let C

i

bethe (innitelymany) elements of (C

j

) that are 2 1 i

from C i

. S i

must have an element whose 2 i 1

-neighborhood ontains innitely many elements of C

i

. Let C i+1

be one of those elements, and then 1

These regionsarenotneessarilyopenorlosed halfspaesevenif pos(C 1

)[neg(C 1

) isallof R d

(13)

C i+1

iswithin 2 of innitely manyelements of C i

. Removeallother elements fromthe sequene (C

j

) and ontinue. Dene the C

0

-dierene between (C i

;C 0 i

) and (C j

;C 0 j

) (denote d 0 ((C i ;C 0 i );(C j ;C 0 j ))) to be the probability Pr(C

0 i

(x) 6= C 0 j

(x)) for x generated by D. We may use a similar

argumenttoextratfrom (C i

;C 0 i

)aninnitesubsequene (C i

;C 0 i

),forwhihwealsohave that for j >i,

d 0 ((C i ;C 0 i );(C j ;C 0 j

))<2 1 i

(aswellas d((C i ;C 0 i );(C j ;C 0 j

))<2 1 i

).

Consider the pointwise limit of this sequene, dened as follows. A point x 2 R d

generatedby D,with probability1has theproperty thatforsuÆiently large N, C i

(x) = C

j

(x) for all i;j >N and also C 0 i

(x) =C 0 j

(x) for all i;j >N. Let C 1

(x) (resp. C 0 1

(x)) denote the label assigned to x by C

i

(resp. C 0 i

) for all suÆiently large i. Let pos(C 1

) and neg(C

1

) denote the points whih get asymptoti labels 1 and 0 by C i

, with similar denitions for C

0 i

. Then pos(C 1 ), neg(C 1 ), pos(C 0 1 ), neg(C 0 1

) are all onvex (that is easily proved by noting that from the onstrution of say pos(C

1

), given any pair of points in pos(C

1

), any onvex ombination of those points must also be in pos(C 1

)). Moreover, with probability 1, a point generated by D lies in one of these sets. So they satisfy the onditions of theorem 4.

LetM <1denoteasamplesizesuÆienttodistinguishC 1

fromC 0 1

withprobability 1 Æ=2. Choose N suÆiently large suh that for random x generated by D,

Pr(C 1

(x)=C i

(x))>1 Æ=4M;

Pr(C 0 1

(x)=C 0 i

(x))>1 Æ=4M;

forall iN. Then withprobability >1 Æ=2, given M samples, C i

agreeswith C 1

and C

0 i

agreeswith C 0 1

onthose samples, forall iN. Then any methodthat ould distinguish C

1

from C 0 1

with unertainty Æ=2 using M samples an be onverted diretly to a method to distinguish C

i

from C 0 i

(for all i N) with unertainty at most Æ. (In partiular replae output of C

1

with C i

and replae outputof C

0 1

with C 0 i

.) This ontradits the assumptionof monotoni unlimitedinrease in sampleomplexityfor terms of the sequene (C

i ;C

0 i

). 2

2.2 Identifying Polynomial AsymptotiBehavior of Sample Com-plexity

(14)

examplesof pairsof funtionsthatdierby ,and theobserved mean anthen beused to distinguish the funtions, using poly(

1

) examples.

Denition 6 Giveninput distribution D, let V(D) denotethelargest varianeof individ-ual omponents of vetors generated by D (a quantitywhih is nite givenourassumption of well-dened means and varianes for the marginal distributions of D).

Now let S(D) be the smallest aÆne linear subspae suh thatwith probability 1, points generated by D lie in that subspae. For a 1-dimensional aÆne line l in S(D), we an projet points generated by D onto l by mapping them to their nearest point on l. Now if points on l are mapped isometrially onto R by xing an origin on l and a diretion of inrease, we have a density p

l

over R. Let M(D) denote the maximum (over lines l in S(D) and points in R) of the density p

l

. Note that M(D) is innite if D assigns a non-zero probability to any proper subspae of S(D) (by hoosing a line l S(D) normal to that subspae).

The measures M(D) and V(D) are motivated by theorem 10 and examples below of distributions for whih we give upper bounds on M and V. The following fat is useful later:

Observation 7 Given any real-valued ontinuous random variable with an upper bound M onits density, its varianeisminimized bymakingituniform overanintervalof length 1=M, and the variane is 1=12M

2

. From this we obtain V(D)1=12 p

dM 2

.

Example 8 Suppose D d

is uniform over an axis-aligned unit ube in R d

. Then by ob-servation 7, V(D

d

) = 1=12. To obtain an upper bound on M(D d

), suppose l is a line through the origin, and then points generated by D

d

projeted onto l an be generated as sums of random variables uniform over [0;l

i

℄ where l i

is the salar produt of a unit ve-tor on l with a unit vetor on the i-th axis. The largest of the l

i

is 1= p

d hene the density is

p

d, so M(D d

) p

d. More generally, other distributions D for whih the measures M(D) and V(D) are well-dened inlude for example, the uniform distribution over any polytope, inluding ones of dimension less than d (for whih S(D) would be a proper subspae of R

d ).

Example 9 If D d

isa normaldistributionwhoseovariane matrixistheidentitymatrix, thenV(D

d

)=1andM(D d

)=(2) 1=2

. Moregenerally, anymultivariatenormal distribu-tion D also has well-dened M(D) and V(D), even if its ovariane matrix does nothave full rank. (See forexample Von. Mises [34℄for standard results about multivariate normal distributions.) For multivariate normal distributions D, S(D) is the spae generated by taking themean of D and addinglinear ombinationsof the eigenvetors of theovariane matrix. M(D) isequal to ((2)

1=2 )

1

where 2

(15)

quired to distinguish any pair C;C 0

of linear threshold funtions that dier by (with probability 1 Æ) ispolynomial in

1

and Æ 1

, (ie the polynomial depends just on D, not on hoie of C;C

0

.) In partiular, the sample size is O(logÆ:M(D)V(D)d 3=2

= 2

).

Proof: We use the notationintrodued intheorem 4:

R 00

=neg(C)\neg(C 0

) R 01

=neg(C)\pos(C 0

) R

10

=pos(C)\neg(C 0

) R 11

=pos(C)\pos(C 0

)

The region of disagreement is R 01

[R 10

, and we are assuming that

D(R 01

)+D(R 10

)=:

Wemay assume that in additionwe have

D(R 01

)=4; D(R 10

)=4

sine otherwisefor C and C 0

there isadiereneofatleast =2 thata randomexample is positive,andC andC

0

ouldbedistinguishedwithpoly( 1

)examplesusingthatproperty. Asbeforelet(R

01

)and(R 10

)denotetheexpetationsofpointslyingintheseregions. The marginalvarianesof points generatedby D are upper-bounded by V(D), sogiven a suÆientdistanebetweenthemeansof R

01

andR 10

,weshouldbeabletousetheobserved meansofthepositive(ornegative)examplestodistinguishC fromC

0

withhighondene. We laim that there is a lower bound on the Eulidean distane j(R

01

) (R 10

)j whih depends on M(D) and V(D), but not C or C

0

, and is polynomialin 1

. Suppose for a ontradition that

j(R 01

) (R 10

)j<=16M(D):

Let l bea 1-dimensionalline that isnormal tothe hyperplane dening C. For R R

d

let l(R ) denote the set of points on l that are losest tosome point in R (the projetion of R onto l). Then l(R

01

)\l(R 10

)=;,but

jl(f(R 01

)g) l(f(R 10

)g)j<=16M(D):

By Markov's inequality,for random x2R 01

(x generated by D restrited to R 01 ), Pr jl(fxg) l(f(R 01

)g)j<=16M(D)

>1=2

(andsimilarlyforpointsinR 10

.) Henetheprobabilityofpointsintherange[l(f(R 01

)g) =16M(D);l(f(R

01

)g)+=16M(D)℄isatleast 1 2 :

4

i.e. thedensityisatleast 1 2 : 4 =(=8M(D)) >M(D), aontradition.

So we onlude that the Eulidean distane between the means of R 01

and R 10

(16)

least =16M(D) d. So the distane between the overall means of say the negative ex-amples of C and of C

0

is 2

=2 p

d16M(D) = 2

= p

d32M(D). The marginal varianes are allupper-bounded by V(D), sothe number of observations of that omponent'svalue needed to identify whihof the two alternativemeans is orret withprobability 1 Æ, is O(logÆ:V(D)M(D)

p d=

2

). Given that eah omponent is equally likely to be observed, the overall sample omplexity beomes O(logÆ:V(D)M(D)d

3=2 =

2

). 2

M(D) and V(D) are rude measures in that for distributions D for whih they are large, the atual sample size needed may not be orrespondingly large. We onsider the questionof whenasimilarresult shouldexistforprobabilitydistributions D whihdonot satisfythe ondition oftheorem 10. Forexample,nite unionsof pointprobabilitymasses are of interest,but automatially donot have nite M(D).

Corollary 11 Suppose D is

1. a nite union of point probability masses,or, more generally,

2. amixtureof aniteunionofpoint probabilitymassesandadistribution D 0

forwhih M(D

0

) and V(D 0

) are nite

then the sample size needed to distinguish C and C 0

(dened in the same way as in theo-rem 10) ispolynomial in the PACparameters, and independent of C, C

0 .

Proof: It is straightforward to prove the rst part of this result, it is in fat a slight generalizationofthe argumentofChow[17℄. Let >0 bethe smallestweightassignedto any of thepointprobabilitymasses. Clearly if C 6=C

0

then they must haveprobability at least of disagreement.

Sine there are only nitely many points in the domain of D, there are only nitely many pairs of distint linear threshold funtions. Hene there is a non-zero lower bound on the dierene between the means of positive examples of C, and of C

0

. This provides a sample omplexity that is polynomial in

1

and Æ 1

, and independent of any other features of C and C

0 .

For an extension to the seond part of this result, again let be the smallest weight of any of the point probability masses, and then for C and C

0

whih dier by < , theirbehavioronpointsgeneratedby D

0

willdistinguishthem(sinethey annotdisagree on any of the point probability masses). Sine M(D

0

) and V(D 0

) are nite, the sample omplexity is polynomial,by theorem 10.

For any > , put Æ = 1=4 and by orollary 5 there exists a nite positive sam-ple size m(;D) suÆient to distinguish any pair C, C

0

whih dier by . Let M = max

2[;1℄

m(;D), whih must be nite, sine otherwise we would have a positive for whih the sample omplexity is innite. Use a sample size of M for > . For smaller values of Æ we an obtain sample omplexity logarithmi in Æ

1

by taking the majority voteof alogarithmi(inÆ

1

)numberofhypotheseswhihhave ondeneparameter1=4. 2

(17)

Complexity

Informed by the suÆient onditions identied for polynomial behaviour, we next dene a distributionwhihdoes not give rise to polynomialbehaviour. That is, for any funtion f, we an onstrut rather artiialinput distributions that ause at least f(

1

) 1-RFA examples tobe needed todistinguish ertainpairs of linear threshold funtionsthat dier by , for all >0.

Theorem 12 Let f be some arbitrary inreasing funtion. There exists a bounded input distribution D(f) on R

3

suhthatforall thereexistlinear thresholdfuntionsC 0

and C 1 whih dier by and require at least f(

1

) samples to be distinguishable with ondene 1 Æ, for Æ <1=2.

Proof: The domain of D is restrited to a sequene of pairs of line segments (l i

;l 0 i ) dened as follows. All the linesegmentsare parallel tothe linegiven by x=y=z,are of unit length,and haveendpointsinthe planes given by x+y+z =0 and x+y+z=

p 3. Wedene their exat loationswith referene toa set of planes dened as follows.

Dene P to be a plane ontaining the line x=y=z, and let C bea linear threshold funtionwith threshold P. Let P

i

, i2N, denotea sequene ofplanesontaining x=y= z, suh that their angles with P onverge to 0 monotonially. (see gure 3. The pointof intersetion ofthe lines ingure3representsthe linex=y=z.) Thesequene P

i

denes a sequene of linear threshold funtions C

i

suh that the symmetri dierene of pos(C i

) and pos(C) stritlyontains thesymmetridiereneof pos(C

j

)and pos(C), forallj >i. The loationsof linesegments l

i , l

0 i

are speiedas follows.

l 0

lies inneg(C)\neg(C 0

): Fori1; l

i

liesin (neg(C)\neg(C i

))nneg(C i 1

): l

0 0

lies inpos(C)\pos(C 0

): Fori1; l

i

liesin (pos(C)\pos(C i

))npos(C i 1

):

Finally,thedistanesfroml i

and l 0 i

fromthelinex=y=z areonstrainedtobe1=2f(2 i

), where f isas dened in the statement of this theorem.

We omplete our denition of D by assigning probability 2 1 i to l i [l 0 i

, and that probability isuniformlydistributed over those two linesegments.

Given this denition of D, we now laim that for target error , we need to observe f(

1

) random 1-RFA examples from D in order to distinguish C from an alternative hypothesis C

i

hosen suh that i is as large as possible subjet to the onstraint that C disagrees with C

i

with probability at least . The regionof disagreement of C with C

i

is theunion [ 1 j=i+1 (l j [l 0 j

),so examplesfrom this set of line segments need to be used in order to distinguish C from C

i

. But we now observe that (by analogy with the onstrution of example 2) with high probability, any example generated from this region has the same onditional likelihood for C as for C

i . In partiular, for any point on l

j

(j > i) that is > 1=f( 1

) from an endpoint of l j

(18)

any value observed for one of its 3 oordinates, there exists a orresponding point on l j whihhas equallikelihoodof generating the same single-oordinateobservation. However points on l

j and l

0 j

should reeive opposite labels from C and from C i

, for j >i. Sowith probabilityatleast 1 1=f(

1

) D failstogenerate apointthatdistinguishes C fromC i

. 2

pos(C)

neg(C)

pos(C )

neg(C )

0

0 pos(C )

neg(C )

1

1 pos(C )

neg(C )

2

2 C

C

l’

l

0

1

2

3

0

1

2

0

1

2

gure3

onstrutionof theorem 12shown in ross setion using plane given by x+y+z=0

The \bad" input distribution dened above has marginal distributions on the input omponents x, y and z whihhave well-dened meansand varianes(this isobviousfrom the fat that the distribution is dened on a bounded region of the domain R

3

). If we dispense with the requirement of well-dened means and varianes, then we an dene similar \bad"distributions intwo dimensions,as follows.

The domainof D isrestrited tothe twolines y=x and y =x+1,forpositivevalues of x and y. As inthe statement of theorem 12, let f be an arbitrary inreasing funtion, and we dene a bad distribution D assoiated with f as follows. For i 2 N, let D be loally uniform over pairs of linesegments whose x-oordinates lie inthe range

R i

=[ i X

r=1 f(2

i );

i+1 X

r=1 f(2

i+1 )℄

Welet the probability that arandom examplelies in R i

be given by D(R i

)=2 i 1

. NowweandenetwolinearthresholdfuntionsC and C

0

(seegure4)whihdisagree on the intervals whose x-oordinates liein R

i

and agree elsewhere. We an now argue in a similar way to before that single-oordinate observations from these regions (the ones whihshouldallowustodistinguish C from C

0

)have(withprobabilityatleast 1 1=f()) equallikelihoodfor both funtions,where i is hosen to minimize2

i

subjet to 2 i

(19)

0

1 x

y

neg(C’)

pos(C’)

pos(C)

neg(C)

f(2 )

i+1

gure4

The domain isrestrited tothe twoheavy lines: C and C 0

disagree on pointsourring between the twovertial dottedlines: This region of disagreementhas probability 2

i :

3 A PAC Algorithm

In this setionwe givea PAC learningalgorithm whoseruntime is polynomial in 1

and Æ

1

provided Dhas nitemeasures M(D) and V(D),orsatisesorollary11. Moreoverif we have a lass of distributions D

d

over R d

, d=1;2;3;:::, for whih M(D d

) and V(D d

) are polynomial ind (forexample the sequenes of distributionsinexamples 8and 9)then the algorithm has sample omplexity polynomial in

1 , Æ

1

and d, but the runtime is exponential in d. We start by desribing the algorithm, then give results to justify the steps. The algorithmis initiallypresented inthe standard PAC setting. In setion 3.3 we show how to express it as a \statistial query" algorithm, as introdued by Kearns [28℄, whoshowed thatsuhalgorithmsarenoise-tolerant. Firstweneedthefollowingdenition.

Denition 13 Thequadratiloss[29℄of anexample (x;l) (with respettoa lassier C) where xistheinput andl isabinaryvaluedlabel,isthequantity(l Pr(label=1jx;C))

(20)

hosen) omponent of avetor x in the domain R d

, where x wasgenerated by D.

3.1 The Algorithm

1. GenerateasetS of(dlogÆ= 3

)(unlabeled)pointsinR d

fromtheinputdistribution D.

2. Generate a set H of andidate hypotheses using the standard method of [9℄ (see below), suh that for eah binary labeling of S onsistent with a linear threshold funtion,H ontainsexatlyonelinearthresholdfuntionthatinduesthatlabeling.

3. Generate a set of labeled 1-RFA data and for eah member H 2 H, use that data set toestimatetheexpetedquadratilossof 1-RFAdataw.r.t. H (the average over allexamples of their quadratilosses). We show that a suÆientsample omplexity for this step is

O(d 7

logÆM(D) 12

V(D) 6

= 24

):

4. Outputthe memberof H with the smallestquadratilossasobserved onthe 1-RFA data.

The method of [9℄ works as follows. Let S = fx 1

;:::;x m

g. The set of all sequenes of labels onsistent with the rst i elements of S is onstruted indutively from the set onsistentwiththersti 1elementsasfollows. Foreahsequeneoflabelsonsistentwith fx

1 ;:::;x

i 1

g, hek whether eah ofthe two possibleextensions of that labelsequene to asequene of i labels, isonsistentwith fx

1

;:::;x i

g. Ifso,add that labelsequene tothe olletionthat isonsistentwith the rst i elements. This methodjust requiresthat itbe possible to eÆiently test whether a funtion in the lass of interest is onsistent with a partiular set of labeled data, whih is of ourse possible for linear threshold funtions in xed dimension. Finally, for eah label sequene for the entire set S, return a onsistent funtion (inour ase, a linear threshold funtion).

Regardingstep3,inthestandardPACframeworkweanusetheempirialestimatefor thequadratiloss, andinsetion3.2weprovethatthe samplesize usedaboveissuÆient. In setion 3.3we show howstep 3 an bedone using statistialqueries, whih shows that the algorithman bemade robust toa uniform mislassiationnoise proess.

3.2 Justiation of the Algorithm

(21)

exist whose observed disagreement on the sample diers from their true disagreement by more than =2, isupper-bounded by

16

16:64

12dlog

2

(32:64em=(d)) e

2

m=(128:4) :

It anbeveried thatthis an beupper-bounded by Æ if m=(dlogÆ= 3

). (Notethat in the bounds of[3℄, d isthe value ofthe fat-shattering funtionwith parameter , whihfor binary lassiers is equalto the V-C dimension, forany .)

The next part of the algorithm nds the hypothesis with the smallest quadrati loss. Sine our set of andidate hypotheses is of polynomial size, we ould just nd an optimal one usingpairwiseomparisons. Our reasonsforpreferringtouse quadratilossare rstly thatwe havethe problemthatthe set H of andidatefuntionsdoesnotgenerallyontain thetargetfuntion;sofarourresultsforpairwiseomparisonhaveassumedthatoneofthe funtionsbeingomparedisthetarget. Theseondreasonisthatminimizingthequadrati lossseems potentially more amenable to heuristis for optimization over anexponentially large set of andidate hypotheses (eg. when d is not onstant).

We an use results of [29℄ to laim that minimizingquadrati loss is a good strategy. Forour purposes quadratilossis a good lossfuntion for the following two reasons.

1. Likethenegativeloglikelihoodlossfuntion,theexpetedquadratilossofa hypoth-esisisminimizedwhenhypothesisonditionalprobabilitiesequalthetrueonditional probabilities.

2. Unlike the negative log likelihood,quadratiloss is bounded (takes values in [0;1℄), soautomatiallywehaveaguarantee that(withhighprobability)observed expeted quadratiloss onverges quikly totrue expeted quadratiloss.

(The disadvantage of quadrati loss by omparison with negative loglikelihood is that it may onlybeused for2-lass lassiation, whih is whatwe have here.)

Notation: For a lassier C let QL(C) denote its expeted quadrati loss (on random examples assumed tobe labeled by some targetfuntion) and let

^

QL (C) denoteobserved expeted quadrati loss for some sample of labeled points. We have noted that

^ QL(C) onverges reasonablyquikly toQL(C),sine quadratilossisbounded (liesin [0;1℄). We alsoneed to beable tolaim that if C is any target funtion wehave:

1. IfC and C 0

dierby ,then QL(C 0

) QL(C) isupperbounded by somepolynomial in

2. If C and C 0

dier by , then QL(C 0

) QL(C) is lower bounded by some other polynomial in

These two properties will validate the approah of nding minimal quadrati loss over members of an -over. Regarding the rst, it is easy to see that QL(C

0

(22)

thatalthoughwedonot havethe exatvaluesofquadratilossformembersofthe -over, we an stillestimate them wellenough forour purposes inpolynomial time.

Denition 14 Thevariation distane between two probabilitydistributions D, D 0

overR is dened to be

var(D;D 0

)= Z

r2R

jD(r) D 0

(r)jdr

Our strategytoprovetheorem 18istorelateerror ofahypothesis C 0

(fortarget C)to the variation distane between the marginal distributions onsome input omponent x of its positive (respetively, negative) examples, and the marginal distributions on x of the positive (respetively, negative) examples of C (lemma 15). Then the variation distane is related to expeted quadrati loss using lemma 16 in onjuntion with lemma 17. We assume throughout that ontinuous densities D(r) and D

0

(r) are Lebesgue integrable, so that itfollows that jD(r) D

0

(r)j and maxf0;D 0

(r) D(r)g are alsoLebesgue integrable (and integrate to var(D;D

0 ) and

1 2

var(D;D 0

) respetively over R).

Lemma 15 Let D and D 0

betwo probabilitydistributionsover R, suhthatthe dierene between their means is and their varianes are both upper-bounded by

2

. Then their variation distane var(D;D

0

) is at least minf1;(=) 2

=8g.

Proof: We may assume that the mean of D is 0 and the mean of D 0

is 0. We obtain an upperbound on in terms of var(D;D

0

) and 2

, and onvert that result into a lowerbound on var(D;D

0

) interms of and 2

. Dene distribution D

00

as follows:

D 00

(r)=

2 var(D;D

0 )

maxf0;D 0

(r) D(r)g:

The oeÆient 2 var(D;D

0 )

normalizes D 00

|weare assumingof oursethat var(D;D 0

)>0. If var(D;D

0

)=0 then =0 and the result holds. The following proeduresamples from D

0 :

1. sample r 2R from D

2. if D(r)>D 0

(r),aept r with probability D 0

(r)=D(r), else rejet r.

3. If r was rejeted above,sample from D 00

.

Observe that the probability that r is rejeted in step 2 is 1 2

var(D;D 0

). Let s be the expeted value of rejeted points. The upper-bound on variane of D gives an upper bound onthe (absolute value of the) expeted value of rejeted pointsas follows:

2

s 2

: 1 2

var(D;D 0

(23)

jsj q

2=var(D;D 0

):

Now isequal to the rejetionprobability 1 2

var(D;D 0

),multipliedby the expeted value of points sampledfrom D

00

minusthe expeted value of rejeted points,i.e.

= 1 2 var(D;D 0 ) E(D 00 ) s :

Againusing the upper bound onvariane, this time variane of D 0 : E(D 00 ) q 2=var(D;D 0 ):

Combiningthe two expressions above we have

1 2

var(D;D 0 ) q 2=var(D;D 0

)+ s

Usingour upper bound for jsj (in partiular s q

2=var(D;D 0

)) and rearranging,

(2 var(D;D 0

))var(D;D 0 )2 q 2=var(D;D 0 ):

Rearranging the above,

2 3=2 [var(D;D 0 )℄ 1=2

2 var(D;D 0

)

Provided that var(D;D 0

)1 we have

2 3=2 [var(D;D 0 )℄ 1=2 Hene var(D;D 0 )(=) 2

=8 or var(D;D 0

)>1

2

Lemma 16 Let C be thetargetlinear thresholdfuntionand C 0

someother linear thresh-oldfuntion. Let D(pos(C))j

x

, D(neg(C))j x , D(pos(C 0 ))j x , D(neg(C 0 ))j x

, be the distribu-tions of the x omponent of positive and negative examples of C and C

0

. Suppose that we have var(D(pos(C))j x ;D(pos(C 0 ))j x ) > var(D(neg(C))j x ;D(neg(C 0 ))j x

) > :

Thenfor1-RFA dataforwhih x istheobserved omponent, wehave alowerbound of =8 on the expeted dierene between the onditional probabilities of output label 1 for C and C

0

(24)

distributed aording to Dj x

, the marginaldistribution of D on x, that

E

Pr(label=1j x=r;C) Pr(label =1 j x=r;C 0

)

<=8:

LetD(pos(C))denotetheprobabilitythatarandomvetorliesintheregionpos(C). Then we have

D(pos(C)) D(pos(C 0

))

<=8:

AssumethatD(pos(C))>1=4 andD(pos(C 0

))>1=4. (IfnotwewouldhaveD(neg(C))> 1=4 and D(neg(C

0

)) > 1=4, and that ase would be handled similarly to what follows.) Wehave assumed for ontradition that

Z

r2R

) Dj

x

(r)dr<=8

(where Dj x

is the marginaldistribution of D on x.) Observe that D(pos(C))j

x (r)=

Pr(l abel =1 jx=r;C) D(pos(C))

and similarlyfor C 0

. Hene the vari-ation distane var(D(pos(C))j

x

;D(pos(C 0

)j x

)) is equalto

Z

r2R

Pr(label=1j x=r;C) D(pos(C))

Pr(label=1 j x=r;C 0

) D(pos(C

0 ))

dr

4:=8+4 Z

r2R

Pr(label=1 jx=r;C) Pr(label=1j x=r;C 0

) dr

<4=8+4=8=;

whihontradits one of the assumptions made inthe lemma.

We have established the lowerbound of =8. 2

Lemma 17 Let C be the target funtion and C 0

some other funtion and suppose that is the expeted dierene between the onditional probabilities of output 1 for C and C

0 , over random inputs from input distribution D. Then

QL(C 0

) QL(C) 2

:

Proof: Let x be an input omponent, and suppose that for some 1-RFA input x =r, we have

Pr(label=1 jx=r;C) = p; Pr(label=1j x=r;C

0

) = p+:

Then the expeted quadrati lossof C for input x=r is

QL(C j x=r)=p(1 p) 2

(25)

For C we have

QL(C 0

j x=r) = p(1 p ) 2

+(1 p)(p+) 2 = p(1 p)

2

+(1 p)p 2

+ 2 = QL(C j x=r)+

2

By onvexity, the expeted quadrati loss of C 0

averaged over random input values is minimized by assuming that for all r 2 R, the dierene in onditional probabilities is uniform, sothat for any input x=r,

Pr(label =1 j x=r;C 0

) Pr(label =1 j x=r;C 0

) =:

So for inputs onsisting of observations of x, the dierene between expeted quadrati losses of C

0

and C is atleast 2

. 2

We nowuse all these lemmasin the following

Theorem 18 For the lass of linear threshold funtions over R d

, suppose that the input distribution D has nite values M(D) and V(D) as dened in denition 6, and that the target funtion has quadrati loss Q

. Then any funtion with error has quadrati loss at least Q

+p() for polynomial p where

p()=

8

2 26

d 2

:M(D) 4

V(D) 2

:

Proof: We onsider two ases:

1. for random x2R d

, jPr(C(x)=1) Pr(C 0

(x) =1)j>=2

2. for random x2R d

, jPr(C(x)=1) Pr(C 0

(x) =1)j=2

Case 1: for any input omponent x, Z

r2R

) :Dj

x

(r)dr>=2:

Hene by lemma17,

QL(C 0

) QL(C)> 2

=4:

Case 2: we use the notationintroduedin theorem 4:

R 00

=neg(C)\neg(C 0

) R 01

=neg(C)\pos(C 0

) R

10

=pos(C)\neg(C 0

) R 11

=pos(C)\pos(C 0

)

The region of disagreement is R 01

[R 10

, and by the assumption of the theorem,

D(R 01

)+D(R 10

(26)

D(R 01

)=4; D(R 10

)=4:

Weontinue bylower-bounding j(R 01

) (R 10

)j,upper-boundingthe marginalvarianes of pointsfrom R

01

and R 10

,thene gettingalowerbound for var(D(R 01 )j x ;D(R 10 )j x ) for some omponent x, then use lemmas 16and 17 toget the lower bound onquadrati loss.

From the proofof theorem 10we have

j(R 01

) (R 10

)j=16M(D):

Let (R )j x

and 2

(R )j x

denote the expetation and variane of x-oordinates of points generated by D that liein RR

d

. Forsome omponent x we have (R 01 )j x (R 10 )j x =16 p dM(D):

Wealso have

2 (R 01 )j x V(D)= 4 and 2 (R 10 )j x V(D)= 4

using the assumed upper bound on the marginal varianes of D and the probabilities of points lying in R

01

and R 10

. Hene using lemma 15 we have that the variation distane between the x-value of pointslying in R

01

and points lyingin R 10

isat least

min n 1; 2 =256d:M(D) 2 8V(D)=(=4) o =min n 1; 3 2 13 d:M(D) 2 V(D) o = 3 2 13 d:M(D) 2 V(D)

using observation 7 and the fat that 1. The variation distanes between 0-labeled examples of C and C

0

, and between 1-labeled examples of C and C 0

are at least times this amount, ie.

= 4 2 13 d:M(D) 2 V(D) :

Hene the expeted dierene between onditional probabilitiesof output 1 for C and C 0 is by lemma 16,atleast

4 2 16 d:M(D) 2 V(D) :

Finally,we use lemma 17to obtain

QL(C 0

) QL(C)

8 2 32 d 2 :M(D) 4 V(D) 2 :

The lower bound of ase 2 an be seen tobe stritly weaker than the lower bound for ase 1, sothe ombinationis just the lower bound for ase 2. 2

(27)

Theorem 19 For the lass of linear threshold funtions over R , suppose that the in-put distribution D satises the riteria of orollary 11, and that the target funtion has quadrati loss Q

. Then any funtion witherror has quadrati lossat least Q

+p() for some positive inreasing polynomial p.

Thisextensiontotheweakeronstraintsoftheorem11justinvolvesboundingthemeans of the regionsof disagreement away fromeah other (asdone in the proofs of theorem 10 and orollary 11) and then proeeding asinthe above proof.

We have now shown how the expeted quadrati loss of a hypothesis is polynomially related to its disagreement with the target funtion. The following result uses this rela-tionship to justify the strategy of nding a hypothesis of minimal quadrati loss (over a -over K that may not neessarily ontain the target funtion), as well as showing that the observed quadrati losses of elements of K are suÆiently good estimates of the true quadratilosses.

Theorem 20 Let C be a set of binary lassiers with V-C dimension d, and let QL be the quadrati loss funtion as dened earlier. Suppose that there are positive inreasing polynomials p, p

0

suh that if any C 2C has error , we have

Q

+p( ) QL(C)Q

+p 0

( )

(where Q

isthequadrati lossof thetarget funtion.) Thenthestrategyof minimizingthe observed quadrati loss over an empirial -over ahieves PAC-ness, for =p

0 1 (

1 2

p()) and sample size O(dlogÆ=

3 ).

Comment: Theresultwouldholdforanylossfuntionthathadtheassoiatedpolynomials p and p

0

. We have shown in theorem 18 that a suitable p exists for the quadrati loss funtion, and observed earlier that for quadratilosswe an put p

0

( )=. Proof: Let = p

0 1 (

1 2

p()), so 1

is polynomial in 1

. Let K be the -over. We have jKj=O((dlogÆ=

3 )

d

), andwe used O(dlogÆ= 3

) unlabeledexamples togenerate it. Let C2K have error . Then

QL(C)Q

+p 0

()=Q

+ 1 2

p()

Let C 0

2K have error >. Then

QL(C 0

)Q

+p()

(28)

any given elementof K has observed expetedquadrati losswithin =4 oftrue. Given m samples,the probabilitythat somememberof K hasobserved lossdieringfromtrue loss is (by Hoeding's inequality)upper bounded by exp ( 2m(=4)

2

) =exp ( m 2

=8): (Hoeding's inequality [27℄ is as follows: Let X

j

, 1 j m be independent random variables suh that aX

j

b, 1j m for some 1ab 1. Then

Pr 1 m m X i=1 [X i E(X i )℄ exp 2m 2 (b a) 2

where wehave a=0, b=1.) So weneed exp ( m

2

=8) Æ=jKj, i.e.

exp ( m 2

=8) Æ=O((dlogÆ= 3 ) d ) m 2

=8O

logÆ+dlog( 3

) dlog(dlogÆ)

The seond term dominates, soput

m=O d 2 log( 1 3 )

The overall sample size is

O

dlogÆ= 3 + d 2 log( 1 3 )

where the rst term isthe samples used to obtainthe -over and the seond term is the samples used to measure the expeted quadrati losses of members of the -over. <

so the rst term dominates. 2

Comment: The runtime is polynomial for onstant d. The omputational bottlenek is the generation of a potentially large -over K and the measurement of all its elements individually. Under some onditions there may bepotentialfor heuristieliminationfrom onsiderationof some elementsof K.

Putting it alltogether, we apply theorem 20 inonjution with theorem 18. Wehave

p 0

( )= ; p( )=

8 2 32 d 2 M(D) 4 V(D) 2

Hene = 1 2 p()= 8 =2 33 d 2 M(D) 4 V(D) 2

. The sample omplexity is thus

O

d 7

logÆM(D) 12 V(D) 6 24 :

This ispolynomial inÆ 1

and 1

(29)

ThestudyofPAC-learninginthepreseneofuniformmislassiationnoisewasintrodued in Angluin and Laird [1℄. The assumption is that with some xed probability <

1 2

, any example presented to the learnerhas had itslass label reversed. This is a more realisti model for the data set that motivated this work, in view of the known lass overlap. However the algorithm we have presented so far has assumed that the data are noise-free (so that the 1-RFA data ame from vetors that are linearly separable). In the presene of noise, the algorithm is not generally guaranteed to onverge to the target funtion. It is shown in [6℄ how toonvert k-RFA learning algorithmsto SQ learning algorithmsover the boolean domain f0;1g

d

, for k logarithmi in the dimension. Over the real domain not all learningalgorithmsare amenable to that onversion. We show how toonvert our algorithmfor linear threshold funtions.

Thestatistialquery(SQ)learningframeworkofKearns[28℄isarestritionof thePAC frameworkinwhih thelearnerhas aesstounlabeleddata, andmay makequeries ofthe followingform: Any query speies a prediate whih takes as input a labeledexample ( should beevaluatablein polynomial time), and an error tolerane . The response to the query is an estimate of the probability that a random labeled example satises | the estimateis aurate towithin additiveerror . The 'sused in the queries should be polynomial inthe target auray .

Queries of the above form an be answered using a labeled data set in the standard PAC setting. Kearns shows in [28℄ that they an moreover be answered using a data set with uniform mislassiationnoise as dened above. If

b

is a given upper bound on an unknown noiserate ,thenanSQ algorithmwould bepolynomialin1=(

1 2

b

),aswellas other parameters of interest (whih is how the denition of PAC learning extends to the denition of noise-tolerant PAC learning).

We show how step 3 an be re-ast in the SQ framework. That is, for a given linear threshold funtion H, estimateits expeted quadrati losswith smalladditiveerror .

Let 0

==4jKj,where K isthe -overonstrutedby thealgorithm. Allmembers H of K have their expeted quadrati losses estimated to withinadditive error

0

. For eah interval [0;1℄ of the form [k

0

;(k+1) 0

℄ where k is an integer, we make the statistial query: is the property that an example has quadrati loss (w.r.t. H) in the range [k

0

;(k + 1) 0

℄, and = 02

. Then the answers to these queries provide a histogram approximation to the true distribution of quadrati loss of labeled examples w.r.t. H. This histogramapproximatesaorresponding histogramof the truedistribution towithin variation distane

0

, sothe omputed mean is within 0

of the true mean.

4 The Disrete Boolean Domain

An important speial ase of the problem is when the input distribution has its domain of support restrited to the boolean domain f0;1g

d

(30)

1. ThesampleomplexityispolynomialinthePACparametersforanyxedd,sinethe distributionsatisesthe onditionsof orollary11. (Thatresultis known from[17℄.) It is unknown whether the sample omplexity is alsopolynomial in d.

2. There are only 4d dierent observations possible (an observation being the identity of one of the d oordinates together with two possible inputvaluesand twopossible outputvalues, 0or1),sotheprobabilityofallofthem maybelearnedwithadditive error,intimepolynomialind andthe reiproalofthe error,byastandard Cherno bound analysis.

3. Forxed d, thereis axed numberof distintlinear threshold funtions, sothere is noneed for disretization,e.g. viaan empirial-over.

We show that some knowledge of the input distribution D is still required in this restrited setting. Here weneed4dimensions toallowapair ofindistinguishablesenarios tobeonstruted.

Fat 21 Itisimpossibletolearn linearthresholdsoverthedisretebooleandomain f0;1g d

(for d4), if the input distribution is unknown.

Proof: Put d = 4, it is simple to extend to higher values of d. Let X be the domain f0;1g

4

. For i = 0;1;2;3;4, let X i

X be the set of binary vetors ontaining i ones. Dene

pos(C)= X 2

[X 3

[X 4 pos(C

0

)= X 3

[X 4 Alternatively, we ould say that for input (x

1 ;x

2 ;x

3 ;x

4

) 2 X, C and C 0

respetively have output value 1 i

P 4 i=1

x i

> 1:5 or respetively P

4 i=1

x i

> 2:5. These are two linear threshold (in fat boolean threshold) funtions, whih we laim are indistinguishable, for appropriate hoies of input distribution.

Dene distributions D and D 0

(input distributions over X) as follows. D assigns probability1=5toeahX

i

,withtherestritiontoX i

beinguniform. D 0

assignsprobability 0 to X

4

and X 1

, 3=5 to X 3

, 1=10 to X 2

, and 3=10 to X 0

, and is also uniform over eah X

i .

Given these denitions, it an be veried that D and D 0

have the same marginal distributions over eah input omponent x

i

(in both ases, Pr(x i

= 1) = Pr(x i

= 0) = 0:5). We alsolaimthat the onditional probabilities Pr(l jx

i

=j;C;D) and Pr(l j x i

= j;C

0 ;D

0

) where l isanbinary output label,are alsothe same. Inpartiular, a alulation shows that for i=1;2;3;4,

Pr(l=1j x i

=0) = 3=10; Pr(l=1j x

i

=1) = 9=10:

(31)

d, and inthe remainder of this setionwe onsider the problemfor general d.

It is unknown how to eÆiently learn pereptrons (linear threshold funtions where inputs ome from f0;1g

d

) under the uniform input distribution. This is an open problem whihpredates learningtheory,andisinfatthequestionof howtoapproximatelyreover a pereptron from approximations to its Chow parameters [17℄. (A pereptron is a linear threshold funtion over the boolean domain.) The Chow parameters (whih are the rst-orderFourieroeÆients,see[20℄)arethe setofonditionalprobabilitiesthatweseeinour 1-RFA setting, with D uniform over the boolean domain. It is known from [14, 17℄ that these parameters do determine the threshold funtion. As the sample size inreases, the 2n onditionalprobabilitieswillonverge totheirtrue values, and itshould bepossible to reonstrut the oeÆients of a suitable linear threshold funtion given these true values, althougheventhenwedonotknowhowtodosoinpolynomialtime. Inanyase,itdoesnot followthat it an bedone if the observed probabilitieshave smalladditive perturbations, as would happen with a nite-sized sample. Indeed it isapparently an open question [21℄ whether aomputationally unbounded learner an be sure tohave enough informationin a polynomial-sizedsample.

Indeed,some hypothesistestingproblemsare hardinthis setting. Suppose weonsider the uniform distribution over the unit hyperube f0;1g

n

. If we have exat data, then it is #P-hard to test whether a hypothesis is onsistent with it [22℄. (It is in fat open whether oneanapproximatethe numberofpositiveexamplesononeside ofahyperplane expressed in terms of oeÆients and threshold, with small relative error, see [22℄. The problem we have is in fat the 0/1 knapsak problem.) We an however test additively approximate onsisteny, by random sampling. Note also that our main problem here is ndinga (approximate) onsistent hypothesisas opposed totesting one.

Regarding the question of what sublasses of pereptrons are 1-RFA learnable, it is known that boolean threshold funtionsare 1-RFA learnable,for the uniforminput distri-bution. A booleanthreshold funtion isdened by aset of literalsand athreshold , and evaluates to 1 provided that at least of the literals are satised. This fat is a speial ase of the fat from [20℄ that k-TOP is k-RFA learnable. k-TOP is a lass of boolean funtionsinwhihinsteadofmonomialswehaveparityfuntionsover k oftheinputs(and then the outputs are input to a threshold gate as in the denition of boolean threshold funtions).

5 Conlusion and Open Problems