Abstra t Meta-learning oers the poten- tial of extending the apabilities of ur- rent learning algorithms by making their me hanism exiblea ording to the domain or task under study

(1)

Ri ardo Vilalta

IBM T.J. Watson Resear h Center

30 SawMill RiverRd.,

Hawthorne, NY.,10532 U.S.A.

Youssef Drissi

IBM T.J. Watson Resear h Center

30 Saw MillRiverRd.,

Hawthorne, NY., 10532U.S.A.

Abstra t Meta-learning oers the poten-

tial of extending the apabilities of ur-

rent learning algorithms by making their

me hanism exiblea ording to the domain

or task under study. An impediment to

move forward in this dire tion is that no

lear onsensus exists on the exa t mean-

ing of the term meta-learning; dierent re-

sear h groups hold dierent views. This

paper proposes a perspe tive view of meta-

learningin whi h the entral goal isto build

self-adaptive learners, namely learning al-

gorithms that improve through experien e

by hanging their bias dynami ally. We

propose a general framework addressing the

problem of how to build self-adaptive learn-

ers, and show resear h dire tions that high-

light the hallenges lying in front of us to

rea h su h goal.

keywords: indu tivelearning, lassi ation,meta-

knowledge.

1 Introdu tion

Indu tivelearning,or lassi ation,takespla e

when alearner or lassier(e.g., de isiontree,

neuralnetwork,supportve torma hine)isap-

pliedto somedatatoprodu eahypothesisex-

plainingatarget on ept;thesear hforagood

hypothesis depends on the xed bias [7 ℄ em-

bedded by the learner. The algorithm is said

to be able to learn be ause the quality of the

hypothesisnormallyimproveswithanin reas-

ing number of examples. Nevertheless, sin e

pli ationsofthe algorithmoverthesame data

alwaysprodu esthesamehypothesis,indepen-

dently of performan e; no knowledge is om-

monly extra teda rossdomains ortasks [8℄.

In ontrast, meta-learning studies how the

hypothesis output by a learner an improve

throughexperien e. Thegoalisto understand

howlearningitself anbe ome exiblea ord-

ing to the domainortask under study. Meta-

learningdiersfrombase-learninginthes ope

of thelevelofadaptation: meta-learningstud-

ieshowto hoosetherightbiasdynami ally,as

opposedtobase-learningwherethebiasisxed

a priori,oruser-parameterized. The goal isto

dis overways ofdynami allysear hingforthe

best learning strategy as the number of tasks

in reases [11 , 9℄. Hen e, meta-learning advo-

atestheneedfor ontinuousadaptationofthe

learner. Ifalearner failstoperformeÆ iently,

one would expe t the learning me hanism it-

selftoadaptin asethesametaskispresented

again. Learning an then take pla e not only

attheexample(i.e.,base)level,butalsoatthe

a ross-task (i.e.,meta)level.

Despite thepromisingresear h dire tionof-

feredbymeta-learning,noapparent onsensus

exists of what is meant by su h term. Ex-

amples of dierent views abound: building a

meta-learner of base-learners [2 ℄, sele ting in-

du tive biases dynami ally[3 ℄, buildingmeta-

rules mat hing taskproperties with algorithm

performan e [4 ℄, indu tive transfer and learn-

ing to learn[8 ℄, learning lassier systems [5℄,

et . After addressing some ommon views of

meta-learning, thispaper proposes a perspe -

tive view where the main goal is to build a

(2)

generalframeworktobuildself-adaptivelearn-

ers, and delineate resear h dire tions that ad-

dressthe hallengeslyinginfrontofustorea h

su h goal.

Thispaperisorganizedasfollows. Se tion2

gives denitions and ba kground information

on lassi ation. Se tion 3 provides our own

perspe tive view of the nature and potential

avenues of resear h in meta-learning. Finally,

Se tion 4 ends with a summary and on lu-

sions.

2 Preliminaries

Our study is entered on the lassi ation

problem ex lusively. The problem is to learn

howto assignthe orre t lassto ea h ofa set

of dierentobje ts(i.e., events, situations). A

learning algorithm L is rst trained on a set

of pre- lassied examplesT

train :f(

~

X

i

;

i )g

m

i=1 .

Ea h obje t

~

X isa ve tor inan n-dimensional

featurespa e,

~

X=(X

1

;X

2

;;X

n

). Ea hfea-

ture X

k

an take on adierent numberof val-

ues.

~

X

i

islabeledwith lass

i

a ordingto an

unknowntargetfun tionF,F(

~

X

i )=

i

(weas-

sumeadeterministi targetfun tion,i.e.,zero-

Bayes risk) . In lassi ation, ea h

i takes

one of a xed number of ategori al values.

T

train

will onsist of independently and iden-

ti ally distributed (i.i.d.) examples obtained

a ording to a xed but unknown joint prob-

ability distribution in the spa e of possible

feature-ve tors X. The goal in lassi ation

istoprodu eahypothesishthatbestapproxi-

matesF,namelythatminimizesalossfun tion

(e.g., zero-one loss)inthe input-outputspa e,

X C, a ordingto distribution .

Classi ation begins when learning algo-

rithm L re eives as inputa training set T

train

and ondu ts a sear h over a hypothesis spa e

H

L

untilitndsahypothesish

L 2H

L

,thatap-

proximates thetruefun tionF. Thusa learn-

ingalgorithm Lmaps atrainingset intoa hy-

pothesis, L : T ! H

L

, where T is the spa e

of all trainingsets of sizem. The sele ted hy-

pothesish anthenbeusedtoguess the lass

Learning algorithm L embeds a set of as-

sumptions, or bias, that ae ts the learning

pro essintwoways: itrestri tsthenatureand

sizeofthehypothesisspa eH

L

,anditimposes

an ordering or ranking over all hypotheses

inH

L

. The bias ofa learningalgorithm L

A is

stronger than the bias of another learning al-

gorithmL

B

ifthe sizeof thehypothesis spa e

onsidered by L

A

is smaller than the size of

the hypothesisspa e onsideredbyL

B

(i.e.,if

jH

L

A jjH

L

B

j). In this ase, thebias embed-

dedbyL

A

onveysmoreextra-evidentialinfor-

mation[12 ℄thanthebiasinL

B

,whi henables

us to narrow down the number of andidate

hypotheses estimatingthe true target on ept

F. We say thebias of a learning algorithm is

orre tifthetarget on eptis ontainedinthe

hypothesis spa e (i.e., if F 2 H ). An in or-

re t bias pre ludes nding a perfe t estimate

to target on eptF.

3 A Perspe tive View

Inbase-learning, thehypothesisspa e H

L of a

learningalgorithmLisxed. Applyingade i-

sion tree, neuralnetwork, or a supportve tor

ma hine over some data produ es a hypothe-

sis that depends on the xed bias embedded

bythelearner. Ifwe represent thespa e of all

possible learning tasks 1

as S, then algorithm

L an learneÆ ientlyoveralimitedregionR

L

inS that favors thebiasembeddedinL;algo-

rithm L an neverbemade to learneÆ iently

over all tasks in S as long as its bias remains

xed [10 , 13 ℄. One may rightlyargue thatthe

spa e of all tasks ontains many random in-

stan es; failing to learn over those instan es

arries in fa t no negative onsequen es. For

this reason, we will assume R

L

belongs to a

subset of stru tured tasks, S

stru t

S, where

ea h task is non-random and an be as ribed

a low degree of omplexity (e.g., Kolmogorov

1

Let a learning task be a 3-tuple, (F;m;), om-

prisingatarget on eptF,atraining-setsizem,anda

sampledistributionfromwhi htheexamplesinthe

trainingsetaredrawn.

(3)

Stru tured on epts S

stru t

R

L

A

R

L

B T

1

T

3

T

2

T

4

Figure 1: Ea h learningalgorithm overs a region of (stru tured) tasks favored by its bias. Task

T

1

isbestlearnedbyalgorithmL

A , T

2

is bestlearnedbyalgorithmL

B

, whereasT

3

isbestlearned

bybothL

A and L

B

. Task T

4

lies outsidethes opeof L

A and L

B .

omplexity[6 ℄).

One goal in meta-learning is to learn what

ausesLto dominate inregionR

L

. Theprob-

lem an be de omposed in two parts: 1) de-

terminetheproperties ofthe tasksinR

L that

make L suitable for su h region, and 2) de-

terminethe properties of L (i.e.,what are the

omponentsembeddedinalgorithmLandhow

they intera t withea h other) that ontribute

to dominate in R

L

. A solution to the prob-

lemabove would provide guidelinesfor hoos-

ing the right algorithm on a parti ular task.

AsillustratedinFigure1,ea htaskT

i

maylie

insideoroutsidetheregionthatfavorsthebias

embedded by a learning algorithm L. In Fig-

ure1,taskT

1

is bestlearnedbyalgorithmL

A

be ause it lies within the region R

L

A

. Sim-

ilarly, T

2

is best learned by algorithm L

B ,

whereas T

3

is best learned by both L

A and

L

B

. A solution to the meta-learning problem

an indi atehowto mat hlearningalgorithms

with task properties, in this way yielding a

prin ipled approa h to the dynami sele tion

of learningalgorithms.

In addition, meta-learning an solve the

problem of learning tasks lying outside the

s ope of available learning algorithms. As

shown inFigure 1, taskT

4

liesoutside the re-

gions of both L

A

and L

B . If L

A

and L

B are

theonlyavailablealgorithmsat hand, taskT

4

Oneapproa hto solve theproblemaboveisto

use a meta-learner to ombine the predi tions

of base-learnersin orderto shiftthedominant

region over thetaskunder study. InFigure 1,

the goal would be to embed the meta-learner

with a bias favoringa region of tasks that in-

ludesT

4 .

3.1 Self-Adaptive Learners

The ombination of base-learners by a meta-

learner oers no guarantee of overing every

possible(stru tured)taskofinterest. We laim

apotentialavenueofresear h inmeta-learning

is toprovidethefoundationsto onstru tself-

adaptivelearningalgorithmsthat hangetheir

internal me hanism a ording to the task un-

deranalysis. In Figure1,thiswould meanen-

abling a learningalgorithm to move along the

spa e of stru tured on epts S

stru t

until the

algorithmlearnsto overthetaskunderstudy.

We assume this an be a hieved through the

ontinuous a umulation of meta-knowledge

indi ating the most appropriate form of bias

for ea h dierent task. Beginning withno ex-

perien e,thelearningalgorithmwouldinitially

useaxedformofbiastoapproximatethetar-

get on ept. Asmoretasks areobserved,how-

ever,thealgorithmwouldbeabletousethea -

umulated meta-knowledge to hange its own

bias a ording to the hara teristi s of ea h

(4)

TrainingSet Self-Adaptive Learner

Meta-

Learner

Rulesof

Experien e -

Hypothesis

?

Performan e

Assessment

?

Meta-Feature

Generator

Performan e

Table

*

H H H H H Y 6

Figure 2: A owdiagram of aself-adaptivelearner.

Figure2 is a(hypotheti al) owdiagram of

a self-adaptive learner. The input and out-

put omponents to the system are a training

set and a hypothesis respe tively. Ea h time

a hypothesis is produ ed, a performan e as-

sessment omponentevaluatesits quality. The

resulting information be omes a new entry in

a performan e table; an entry ontains a ve -

torofmeta-features hara terizingthetraining

set, and the bias employed by the algorithm

if the quality of the hypothesis ex eeds some

a eptable threshold. We assume the self-

adaptive learner embeds a meta-learner that

takes asinputtheperforman etable and gen-

erates a set of rules of experien e (i.e., meta-

hypothesis) mapping any training set into a

formofbias. Thela kofrulesofexperien eat

the beginning of the learner's life would for e

theme hanismtouseaxedformofbias. But

as more trainingsets are observed, we expe t

the expertise of the meta-learner to dominate

in de iding whi h form of bias best suits the

hara teristi sof thetrainingset.

The self-adaptive learner des ribed in Fig-

ure 2 poses major hallenges to the meta-

learning ommunity. Weprovideourownview

ofpossibleresear hdire tionsaddressingthese

hallenges.

3.1.1 The Quality of Bias

First, how an we assess the quality of a hy-

pothesis?, or how an we assess the quality

gorithm? To answer these questions, assume

the bias of learner L, B

L

, is fully spe ied by

a two-tuple B

L

= (H

L

;), where H

L is the

hypothesis spa e from whi h L must sele t a

hypothesis h

L

, and spe ies an ordering of

allhypothesesinH

L

givenatrainingsetT

train .

In base-learning, B

L

is xed: learner L is de-

signedto hoosethe hypothesish

L 2H

L with

best ranking a ording to after looking at

T

train

(a sub-optimalsear hstrategymightfail

to nd the hypothesis with best-ranking as a

trade-oforeÆ ien yduetothesizeofH

L ). A

self-adaptive learner annot assume B

L xed;

thebiasmustbe exiblea ordingtothe har-

a teristi s of thetask under study. Hen e, an

adaptivelearnermustbeableto sear hamong

dierentfamiliesofhypothesisspa esfH

i gand

dierent orderingsf

i g.

Now, if we had a way to measure the dis-

tan e between pairs of hypothesis spa es and

pairs of hypothesis orderings, then the qual-

ity of the learning bias ould be dened as a

fun tion of these distan e measures. In other

words,thequalityoftheoutputofanadaptive

learner depends on the proximity of the hy-

pothesis spa e and hypothesis ordering to the

true target values. Currently, meta-learning

la ksanytheory pointinginthisdire tion.

One lastobservation is ne essary. Sin e the

training set is sampled a ording to a distri-

bution, one may try to obtain theexpe ted

valuesof thesedistan esbyaveragingoverdif-

ferenttrainingsetsofsizema ordingto. In

(5)

thenone maytry to average overdierent hy-

pothesis spa esof same sizea ording to su h

distribution[1℄.

To on lude, the hallenge lies on dening

thedistan ebetweenpairsofhypothesisspa es

and hypothesis orderings. Measures like pre-

di tivea ura y orROC- urves onveyalmost

no informationaboutthese distan es.

Relevant Meta-Features

Se ond, how an we hara terize a domain in

terms of relevant meta-features? Ultimately

a task is well hara terized by the probability

distributionof lasslabels intheinput-output

spa e. Sin e we assume here a deterministi

target fun tion(i.e.,zero-Bayesrisk), ea hex-

ample is assumed to have a unique lass with

probabilityone. Inthis ase it isthe distribu-

tionfromwhi htheexamplesaredrawnthat

di tates the distribution of lass labels 2

. For

example, one distribution may produ e dense

lustersofpositiveandnegativeexamplessep-

aratedbyregionsof(almost)emptyspa e;an-

other distribution may produ e lusters with

a mixed proportionof lasses, namely lusters

of lass-uniformexamples thatoften interse t,

whi h ompli atesleaning.

The hallenge liesondeningmeta-features

identifyingthenature of . We believe meta-

features mustbe loselyrelated to theme ha-

nismofL,su hthatnouniversalwayofden-

ing exists. A hara terization of relevant

totheperforman eofalearningalgorithmLis

intimatelyrelated to theme hanismof L,i.e.,

to thebias of L.

For example, assume a target on ept F in

whi htheboundariesbetweenregionsof lass-

uniformexamplesarenotlinear. We knowap-

plyingalineardis riminantLovertrainingset

T

train

is prone to produ e a poor estimate of

F. Thequestionis how an we hara terize

2

A probabilisti estimation of the target fun tion

would be ne essaryif Bayesrisk is not zero. Inthat

aseanexamplewouldbeassigneda lasswith ertain

probabilitya ordingtoaxedbutunknownprobabil-

train

inwhi hLwillfailgivinggoodapproximations

to F? To answer thisquestion, let us assume

we have a means to measure the distan e be-

tweentwohypothesis. Leth

bethehypothesis

outputbyLinthespa eoflinear-dis riminant

hypotheses. Let d(h

;F), be the distan e be-

tweenh

andtarget fun tionF. Ourgoalisto

ndmeta-features hara terizing that show

strong orrelation with d(h

;F). We need to

know how far or lose is L to learningF eÆ-

iently. Inourexampleoflineardis riminants,

a andidate meta-feature an be set to mea-

surehowfarisade isionboundaryfromlinear-

ity. Su hmeta-featurebyitselfmaybediÆ ult

to dene; we believe, however, that the major

hallengeslieondeninga distan e-metri be-

tweenpairsofhypotheses, and ondeningrel-

evantmeta-featuresthatdependonasmu h

asL.

Flexibility at the Meta-Level

Finally, one must be aware of a problem re-

lated to the exibility embedded by the self-

adaptive learnerof Figure 2: whereasthe bias

is now sele ted dynami ally, the meta-learner

is not self-adaptive and employs a xed form

of bias. The meta-learnerin Figure2 takesas

inputaperforman etable whereea hexample

ontains a ve tor of meta-features and a label

or lass orrespondingtothebiasemployed by

the base algorithm. Clearly the meta-learner

an be seen as a learning algorithm too, but

la king the adaptability as ribed to the base

learner. Ideallywewouldlikethemeta-learner

to be self-adaptive, that is able to improve

through experien e. One solution ould be to

ontinue with the same logi al fashion as in

Figure2,anddeneameta-meta-learnerhelp-

ing the meta-learner improve through experi-

en e. The problem, however, does not disap-

pearbe ause themeta-metalearner would ex-

hibit a xed form of bias. The hallenge lies

onhowto stoptheapparentlyinnite hainof

meta-learners neededtoa hieve omplete ex-

ibilityin thesele tionof bias.

(6)

learners for ea h other. We envision a self-

adaptive learner working in two modes: the

normal mode, in whi h the learner improves

through the a umulation of meta-knowledge

as des ribed in Se tion 3.1 (Figure 2), and a

meta-learningmodeinwhi hthelearnerplays

theroleofameta-learnerfortheotherlearner.

At a xedpointintime, thetwo self-adaptive

learners work on dierent modes: while one

works on normal mode, the other must work

on meta-learning mode helping the other im-

prove throughexperien e. Ea h learner would

then exhibit full exibilityin the dynami se-

le tion ofbias.

Clearly the solutionabove overlooks an as-

sortment of implementation details. Our goal

is simplyto point to promisingresear h dire -

tions that an bring the onstru tion of self-

adaptivelearnersintoreality. These tionsde-

s ribedaboveprovideinterestinggoalsthatwe

hopewillstimulate theresear h ommunityto

ontribute to theeldof meta-learning.

4 Con lusions

Despite many dierent views urrently a tive

inmeta-learning,noapparent onsensusexists

of what is meant by su h term. This paper

outlinesa perspe tiveviewofmeta-learningin

whi hthegoalistobuildself-adaptivelearners

thatimprove theirbias dynami allya ording

tothe hara teristi softhedomainunderanal-

ysis. We believe su h view an unify urrent

eorts,leading themin apromisingdire tion.

The onstru tion of self-adaptive learners

poses major hallenges to the resear h om-

munity. Thispaperhighlightsseveral of those

hallenges and provides suggestions on possi-

ble resear h avenues. For example, how an

we assess if thebias adopted bya learningal-

gorithm suits the domain under analysis?, or

how an we onstru t relevant meta-features

to hara terize a domain?. Our analysis un-

veils the importan e of having a metri over

the spa e of hypotheses, the spa e of families

metri s antell ushow faror lose isournal

estimation to the truetarget fun tion. Unfor-

tunately, few has been done in this area [14 ℄;

future work will investigate plausible models

forthistype of metri s.

Finally, a major hallenge in the onstru -

tion of self-adaptive learners is how to in or-

porate a exiblebias in both thebase-learner

and the meta-learner (Se tion 3.1.1). Adding

meta-learners on topof existing onesdoesnot

eliminate the problem as long as there is at

least one meta-learner having a xed form of

bias. Futureworkwillexplorehowtomaketwo

self-adaptivelearnersserveasmeta-learnersfor

ea h other, inthisway ensuringfull exibility

inthedynami sele tion of bias.

A knowledgments

ThisworkwassupportedbyIBMT.J.Watson

Resear hCenter.

Referen es

[1℄ Baxter Jonathan. A Model Of Indu tive

Bias. Journal of Arti ial Intelligen e Re-

sear h, 12, 149{198, 2000.

[2℄ Chan Philipand Stolfo S. On The A u-

ra y Of Meta-Learning For S alable Data

Mining. Journal of Intelligent Integration

of Information, 1998.

[3℄ DesJardins Marie And Gordon Diana.

EvaluationAndSele tionOfBiases InMa-

hine Learning. Ma hine Learning, 20, 5{

22,1995.

[4℄ Gama J. and Brazdil P. . Chara teriza-

tion Of Classi ation Algorithms. In 7th

Portuguese Conferen e on Arti ial Intel-

ligen e,EPIA, 189{200, 1995.

[5℄ Lanzi PierLu a, StolzmannWolfgang and

Wilson Stewart. Learning Classier Sys-

tems. Le ture Notes in Arti ial Intel-

ligen e, Springer-Verlag, New York, NY,

(7)

du tionToKolmogorovComplexityAndIts

Appli ations. Springer Verlag, New York,

1993.

[7℄ Mit hell Tom. The Need For Biases In

LearningGeneralizations.Te h.rep.CBM-

TR-117, Computer S ien e Department,

Rutgers University, New Brunswi k, NJ

08903, 1980.

[8℄ Pratt LorienandThrunSebastian. Se ond

Spe ial Issue On Indu tive Transfer. Ma-

hine Learning,28, 1997.

[9℄ Rendell Larry, Seshu Raj and T heng

David. Layered Con ept-Learning And

Dynami ally-Variable Bias Management.

InPro eedingsof Tenth InternationalJoint

Conferen e on Arti ial Intelligen e, 08{

314, Milan,Italy, 1987.

[10℄ S haer Cullen. A Conservation Law For

Generalization Performan e. In Pro eed-

ings of the Eleventh International Confer-

en e on Ma hine Learning, 259{265, San

Fran is o: Morgan Kaufmann,1994.

[11℄ ThrunSebastian. LifelongLearningAlgo-

rithms. InThrun S., and Pratt, L.(Eds.),

Learning To Learn, Chap. 8 : 181{209.

KluwerA ademi Publishers,1998.

[12℄ WatanabeSatosi.KnowingAndGuessing.

John Wileyand Sons, NewYork, 1969.

[13℄ Wolpert David. The La k Of A Priori

Distin tions Between Learning Algorithms

And The Existen e Of A Priori Distin -

tionsBetweenLearningAlgorithms.Neural

Computation, 8,1341-142, 1996.

[14℄ Wolpert David. Any Two Learning Al-

gorithms Are (Almost) Exa tly Identi al.

Unpublishedmanus ript.NASAAmesRe-

sear h Center, 2001.

Abstra t Meta-learning o ers the poten- tial of extending the apabilities of ur- rent learning algorithms by making their me hanism exiblea ording to the domain or task under study

Abstra t Meta-learning oers the poten- tial of extending the apabilities of ur- rent learning algorithms by making their me hanism exiblea ording to the domain or task under study