Ri ardo Vilalta
IBM T.J. Watson Resear h Center
30 SawMill RiverRd.,
Hawthorne, NY.,10532 U.S.A.
Youssef Drissi
IBM T.J. Watson Resear h Center
30 Saw MillRiverRd.,
Hawthorne, NY., 10532U.S.A.
Abstra t Meta-learning oers the poten-
tial of extending the apabilities of ur-
rent learning algorithms by making their
me hanism exiblea ording to the domain
or task under study. An impediment to
move forward in this dire tion is that no
lear onsensus exists on the exa t mean-
ing of the term meta-learning; dierent re-
sear h groups hold dierent views. This
paper proposes a perspe tive view of meta-
learningin whi h the entral goal isto build
self-adaptive learners, namely learning al-
gorithms that improve through experien e
by hanging their bias dynami ally. We
propose a general framework addressing the
problem of how to build self-adaptive learn-
ers, and show resear h dire tions that high-
light the hallenges lying in front of us to
rea h su h goal.
keywords: indu tivelearning, lassi ation,meta-
knowledge.
1 Introdu tion
Indu tivelearning,or lassi ation,takespla e
when alearner or lassier(e.g., de isiontree,
neuralnetwork,supportve torma hine)isap-
pliedto somedatatoprodu eahypothesisex-
plainingatarget on ept;thesear hforagood
hypothesis depends on the xed bias [7 ℄ em-
bedded by the learner. The algorithm is said
to be able to learn be ause the quality of the
hypothesisnormallyimproveswithanin reas-
ing number of examples. Nevertheless, sin e
pli ationsofthe algorithmoverthesame data
alwaysprodu esthesamehypothesis,indepen-
dently of performan e; no knowledge is om-
monly extra teda rossdomains ortasks [8℄.
In ontrast, meta-learning studies how the
hypothesis output by a learner an improve
throughexperien e. Thegoalisto understand
howlearningitself anbe ome exiblea ord-
ing to the domainortask under study. Meta-
learningdiersfrombase-learninginthes ope
of thelevelofadaptation: meta-learningstud-
ieshowto hoosetherightbiasdynami ally,as
opposedtobase-learningwherethebiasisxed
a priori,oruser-parameterized. The goal isto
dis overways ofdynami allysear hingforthe
best learning strategy as the number of tasks
in reases [11 , 9℄. Hen e, meta-learning advo-
atestheneedfor ontinuousadaptationofthe
learner. Ifalearner failstoperformeÆ iently,
one would expe t the learning me hanism it-
selftoadaptin asethesametaskispresented
again. Learning an then take pla e not only
attheexample(i.e.,base)level,butalsoatthe
a ross-task (i.e.,meta)level.
Despite thepromisingresear h dire tionof-
feredbymeta-learning,noapparent onsensus
exists of what is meant by su h term. Ex-
amples of dierent views abound: building a
meta-learner of base-learners [2 ℄, sele ting in-
du tive biases dynami ally[3 ℄, buildingmeta-
rules mat hing taskproperties with algorithm
performan e [4 ℄, indu tive transfer and learn-
ing to learn[8 ℄, learning lassier systems [5℄,
et . After addressing some ommon views of
meta-learning, thispaper proposes a perspe -
tive view where the main goal is to build a
generalframeworktobuildself-adaptivelearn-
ers, and delineate resear h dire tions that ad-
dressthe hallengeslyinginfrontofustorea h
su h goal.
Thispaperisorganizedasfollows. Se tion2
gives denitions and ba kground information
on lassi ation. Se tion 3 provides our own
perspe tive view of the nature and potential
avenues of resear h in meta-learning. Finally,
Se tion 4 ends with a summary and on lu-
sions.
2 Preliminaries
Our study is entered on the lassi ation
problem ex lusively. The problem is to learn
howto assignthe orre t lassto ea h ofa set
of dierentobje ts(i.e., events, situations). A
learning algorithm L is rst trained on a set
of pre- lassied examplesT
train :f(
~
X
i
;
i )g
m
i=1 .
Ea h obje t
~
X isa ve tor inan n-dimensional
featurespa e,
~
X=(X
1
;X
2
;;X
n
). Ea hfea-
ture X
k
an take on adierent numberof val-
ues.
~
X
i
islabeledwith lass
i
a ordingto an
unknowntargetfun tionF,F(
~
X
i )=
i
(weas-
sumeadeterministi targetfun tion,i.e.,zero-
Bayes risk) . In lassi ation, ea h
i takes
one of a xed number of ategori al values.
T
train
will onsist of independently and iden-
ti ally distributed (i.i.d.) examples obtained
a ording to a xed but unknown joint prob-
ability distribution in the spa e of possible
feature-ve tors X. The goal in lassi ation
istoprodu eahypothesishthatbestapproxi-
matesF,namelythatminimizesalossfun tion
(e.g., zero-one loss)inthe input-outputspa e,
X C, a ordingto distribution .
Classi ation begins when learning algo-
rithm L re eives as inputa training set T
train
and ondu ts a sear h over a hypothesis spa e
H
L
untilitndsahypothesish
L 2H
L
,thatap-
proximates thetruefun tionF. Thusa learn-
ingalgorithm Lmaps atrainingset intoa hy-
pothesis, L : T ! H
L
, where T is the spa e
of all trainingsets of sizem. The sele ted hy-
pothesish anthenbeusedtoguess the lass
Learning algorithm L embeds a set of as-
sumptions, or bias, that ae ts the learning
pro essintwoways: itrestri tsthenatureand
sizeofthehypothesisspa eH
L
,anditimposes
an ordering or ranking over all hypotheses
inH
L
. The bias ofa learningalgorithm L
A is
stronger than the bias of another learning al-
gorithmL
B
ifthe sizeof thehypothesis spa e
onsidered by L
A
is smaller than the size of
the hypothesisspa e onsideredbyL
B
(i.e.,if
jH
L
A jjH
L
B
j). In this ase, thebias embed-
dedbyL
A
onveysmoreextra-evidentialinfor-
mation[12 ℄thanthebiasinL
B
,whi henables
us to narrow down the number of andidate
hypotheses estimatingthe true target on ept
F. We say thebias of a learning algorithm is
orre tifthetarget on eptis ontainedinthe
hypothesis spa e (i.e., if F 2 H ). An in or-
re t bias pre ludes nding a perfe t estimate
to target on eptF.
3 A Perspe tive View
Inbase-learning, thehypothesisspa e H
L of a
learningalgorithmLisxed. Applyingade i-
sion tree, neuralnetwork, or a supportve tor
ma hine over some data produ es a hypothe-
sis that depends on the xed bias embedded
bythelearner. Ifwe represent thespa e of all
possible learning tasks 1
as S, then algorithm
L an learneÆ ientlyoveralimitedregionR
L
inS that favors thebiasembeddedinL;algo-
rithm L an neverbemade to learneÆ iently
over all tasks in S as long as its bias remains
xed [10 , 13 ℄. One may rightlyargue thatthe
spa e of all tasks ontains many random in-
stan es; failing to learn over those instan es
arries in fa t no negative onsequen es. For
this reason, we will assume R
L
belongs to a
subset of stru tured tasks, S
stru t
S, where
ea h task is non-random and an be as ribed
a low degree of omplexity (e.g., Kolmogorov
1
Let a learning task be a 3-tuple, (F;m;), om-
prisingatarget on eptF,atraining-setsizem,anda
sampledistributionfromwhi htheexamplesinthe
trainingsetaredrawn.
Stru tured on epts S
stru t
R
L
A
R
L
B T
1
T
3
T
2
T
4
Figure 1: Ea h learningalgorithm overs a region of (stru tured) tasks favored by its bias. Task
T
1
isbestlearnedbyalgorithmL
A , T
2
is bestlearnedbyalgorithmL
B
, whereasT
3
isbestlearned
bybothL
A and L
B
. Task T
4
lies outsidethes opeof L
A and L
B .
omplexity[6 ℄).
One goal in meta-learning is to learn what
ausesLto dominate inregionR
L
. Theprob-
lem an be de omposed in two parts: 1) de-
terminetheproperties ofthe tasksinR
L that
make L suitable for su h region, and 2) de-
terminethe properties of L (i.e.,what are the
omponentsembeddedinalgorithmLandhow
they intera t withea h other) that ontribute
to dominate in R
L
. A solution to the prob-
lemabove would provide guidelinesfor hoos-
ing the right algorithm on a parti ular task.
AsillustratedinFigure1,ea htaskT
i
maylie
insideoroutsidetheregionthatfavorsthebias
embedded by a learning algorithm L. In Fig-
ure1,taskT
1
is bestlearnedbyalgorithmL
A
be ause it lies within the region R
L
A
. Sim-
ilarly, T
2
is best learned by algorithm L
B ,
whereas T
3
is best learned by both L
A and
L
B
. A solution to the meta-learning problem
an indi atehowto mat hlearningalgorithms
with task properties, in this way yielding a
prin ipled approa h to the dynami sele tion
of learningalgorithms.
In addition, meta-learning an solve the
problem of learning tasks lying outside the
s ope of available learning algorithms. As
shown inFigure 1, taskT
4
liesoutside the re-
gions of both L
A
and L
B . If L
A
and L
B are
theonlyavailablealgorithmsat hand, taskT
4
Oneapproa hto solve theproblemaboveisto
use a meta-learner to ombine the predi tions
of base-learnersin orderto shiftthedominant
region over thetaskunder study. InFigure 1,
the goal would be to embed the meta-learner
with a bias favoringa region of tasks that in-
ludesT
4 .
3.1 Self-Adaptive Learners
The ombination of base-learners by a meta-
learner oers no guarantee of overing every
possible(stru tured)taskofinterest. We laim
apotentialavenueofresear h inmeta-learning
is toprovidethefoundationsto onstru tself-
adaptivelearningalgorithmsthat hangetheir
internal me hanism a ording to the task un-
deranalysis. In Figure1,thiswould meanen-
abling a learningalgorithm to move along the
spa e of stru tured on epts S
stru t
until the
algorithmlearnsto overthetaskunderstudy.
We assume this an be a hieved through the
ontinuous a umulation of meta-knowledge
indi ating the most appropriate form of bias
for ea h dierent task. Beginning withno ex-
perien e,thelearningalgorithmwouldinitially
useaxedformofbiastoapproximatethetar-
get on ept. Asmoretasks areobserved,how-
ever,thealgorithmwouldbeabletousethea -
umulated meta-knowledge to hange its own
bias a ording to the hara teristi s of ea h
TrainingSet Self-Adaptive Learner
Meta-
Learner
Rulesof
Experien e -
Hypothesis
?
Performan e
Assessment
?
Meta-Feature
Generator
Performan e
Table
*
H H H H H Y 6
Figure 2: A owdiagram of aself-adaptivelearner.
Figure2 is a(hypotheti al) owdiagram of
a self-adaptive learner. The input and out-
put omponents to the system are a training
set and a hypothesis respe tively. Ea h time
a hypothesis is produ ed, a performan e as-
sessment omponentevaluatesits quality. The
resulting information be omes a new entry in
a performan e table; an entry ontains a ve -
torofmeta-features hara terizingthetraining
set, and the bias employed by the algorithm
if the quality of the hypothesis ex eeds some
a eptable threshold. We assume the self-
adaptive learner embeds a meta-learner that
takes asinputtheperforman etable and gen-
erates a set of rules of experien e (i.e., meta-
hypothesis) mapping any training set into a
formofbias. Thela kofrulesofexperien eat
the beginning of the learner's life would for e
theme hanismtouseaxedformofbias. But
as more trainingsets are observed, we expe t
the expertise of the meta-learner to dominate
in de iding whi h form of bias best suits the
hara teristi sof thetrainingset.
The self-adaptive learner des ribed in Fig-
ure 2 poses major hallenges to the meta-
learning ommunity. Weprovideourownview
ofpossibleresear hdire tionsaddressingthese
hallenges.
3.1.1 The Quality of Bias
First, how an we assess the quality of a hy-
pothesis?, or how an we assess the quality
gorithm? To answer these questions, assume
the bias of learner L, B
L
, is fully spe ied by
a two-tuple B
L
= (H
L
;), where H
L is the
hypothesis spa e from whi h L must sele t a
hypothesis h
L
, and spe ies an ordering of
allhypothesesinH
L
givenatrainingsetT
train .
In base-learning, B
L
is xed: learner L is de-
signedto hoosethe hypothesish
L 2H
L with
best ranking a ording to after looking at
T
train
(a sub-optimalsear hstrategymightfail
to nd the hypothesis with best-ranking as a
trade-oforeÆ ien yduetothesizeofH
L ). A
self-adaptive learner annot assume B
L xed;
thebiasmustbe exiblea ordingtothe har-
a teristi s of thetask under study. Hen e, an
adaptivelearnermustbeableto sear hamong
dierentfamiliesofhypothesisspa esfH
i gand
dierent orderingsf
i g.
Now, if we had a way to measure the dis-
tan e between pairs of hypothesis spa es and
pairs of hypothesis orderings, then the qual-
ity of the learning bias ould be dened as a
fun tion of these distan e measures. In other
words,thequalityoftheoutputofanadaptive
learner depends on the proximity of the hy-
pothesis spa e and hypothesis ordering to the
true target values. Currently, meta-learning
la ksanytheory pointinginthisdire tion.
One lastobservation is ne essary. Sin e the
training set is sampled a ording to a distri-
bution, one may try to obtain theexpe ted
valuesof thesedistan esbyaveragingoverdif-
ferenttrainingsetsofsizema ordingto. In
thenone maytry to average overdierent hy-
pothesis spa esof same sizea ording to su h
distribution[1℄.
To on lude, the hallenge lies on dening
thedistan ebetweenpairsofhypothesisspa es
and hypothesis orderings. Measures like pre-
di tivea ura y orROC- urves onveyalmost
no informationaboutthese distan es.
Relevant Meta-Features
Se ond, how an we hara terize a domain in
terms of relevant meta-features? Ultimately
a task is well hara terized by the probability
distributionof lasslabels intheinput-output
spa e. Sin e we assume here a deterministi
target fun tion(i.e.,zero-Bayesrisk), ea hex-
ample is assumed to have a unique lass with
probabilityone. Inthis ase it isthe distribu-
tionfromwhi htheexamplesaredrawnthat
di tates the distribution of lass labels 2
. For
example, one distribution may produ e dense
lustersofpositiveandnegativeexamplessep-
aratedbyregionsof(almost)emptyspa e;an-
other distribution may produ e lusters with
a mixed proportionof lasses, namely lusters
of lass-uniformexamples thatoften interse t,
whi h ompli atesleaning.
The hallenge liesondeningmeta-features
identifyingthenature of . We believe meta-
features mustbe loselyrelated to theme ha-
nismofL,su hthatnouniversalwayofden-
ing exists. A hara terization of relevant
totheperforman eofalearningalgorithmLis
intimatelyrelated to theme hanismof L,i.e.,
to thebias of L.
For example, assume a target on ept F in
whi htheboundariesbetweenregionsof lass-
uniformexamplesarenotlinear. We knowap-
plyingalineardis riminantLovertrainingset
T
train
is prone to produ e a poor estimate of
F. Thequestionis how an we hara terize
2
A probabilisti estimation of the target fun tion
would be ne essaryif Bayesrisk is not zero. Inthat
aseanexamplewouldbeassigneda lasswith ertain
probabilitya ordingtoaxedbutunknownprobabil-
train
inwhi hLwillfailgivinggoodapproximations
to F? To answer thisquestion, let us assume
we have a means to measure the distan e be-
tweentwohypothesis. Leth
bethehypothesis
outputbyLinthespa eoflinear-dis riminant
hypotheses. Let d(h
;F), be the distan e be-
tweenh
andtarget fun tionF. Ourgoalisto
ndmeta-features hara terizing that show
strong orrelation with d(h
;F). We need to
know how far or lose is L to learningF eÆ-
iently. Inourexampleoflineardis riminants,
a andidate meta-feature an be set to mea-
surehowfarisade isionboundaryfromlinear-
ity. Su hmeta-featurebyitselfmaybediÆ ult
to dene; we believe, however, that the major
hallengeslieondeninga distan e-metri be-
tweenpairsofhypotheses, and ondeningrel-
evantmeta-featuresthatdependonasmu h
asL.
Flexibility at the Meta-Level
Finally, one must be aware of a problem re-
lated to the exibility embedded by the self-
adaptive learnerof Figure 2: whereasthe bias
is now sele ted dynami ally, the meta-learner
is not self-adaptive and employs a xed form
of bias. The meta-learnerin Figure2 takesas
inputaperforman etable whereea hexample
ontains a ve tor of meta-features and a label
or lass orrespondingtothebiasemployed by
the base algorithm. Clearly the meta-learner
an be seen as a learning algorithm too, but
la king the adaptability as ribed to the base
learner. Ideallywewouldlikethemeta-learner
to be self-adaptive, that is able to improve
through experien e. One solution ould be to
ontinue with the same logi al fashion as in
Figure2,anddeneameta-meta-learnerhelp-
ing the meta-learner improve through experi-
en e. The problem, however, does not disap-
pearbe ause themeta-metalearner would ex-
hibit a xed form of bias. The hallenge lies
onhowto stoptheapparentlyinnite hainof
meta-learners neededtoa hieve omplete ex-
ibilityin thesele tionof bias.
learners for ea h other. We envision a self-
adaptive learner working in two modes: the
normal mode, in whi h the learner improves
through the a umulation of meta-knowledge
as des ribed in Se tion 3.1 (Figure 2), and a
meta-learningmodeinwhi hthelearnerplays
theroleofameta-learnerfortheotherlearner.
At a xedpointintime, thetwo self-adaptive
learners work on dierent modes: while one
works on normal mode, the other must work
on meta-learning mode helping the other im-
prove throughexperien e. Ea h learner would
then exhibit full exibilityin the dynami se-
le tion ofbias.
Clearly the solutionabove overlooks an as-
sortment of implementation details. Our goal
is simplyto point to promisingresear h dire -
tions that an bring the onstru tion of self-
adaptivelearnersintoreality. These tionsde-
s ribedaboveprovideinterestinggoalsthatwe
hopewillstimulate theresear h ommunityto
ontribute to theeldof meta-learning.
4 Con lusions
Despite many dierent views urrently a tive
inmeta-learning,noapparent onsensusexists
of what is meant by su h term. This paper
outlinesa perspe tiveviewofmeta-learningin
whi hthegoalistobuildself-adaptivelearners
thatimprove theirbias dynami allya ording
tothe hara teristi softhedomainunderanal-
ysis. We believe su h view an unify urrent
eorts,leading themin apromisingdire tion.
The onstru tion of self-adaptive learners
poses major hallenges to the resear h om-
munity. Thispaperhighlightsseveral of those
hallenges and provides suggestions on possi-
ble resear h avenues. For example, how an
we assess if thebias adopted bya learningal-
gorithm suits the domain under analysis?, or
how an we onstru t relevant meta-features
to hara terize a domain?. Our analysis un-
veils the importan e of having a metri over
the spa e of hypotheses, the spa e of families
metri s antell ushow faror lose isournal
estimation to the truetarget fun tion. Unfor-
tunately, few has been done in this area [14 ℄;
future work will investigate plausible models
forthistype of metri s.
Finally, a major hallenge in the onstru -
tion of self-adaptive learners is how to in or-
porate a exiblebias in both thebase-learner
and the meta-learner (Se tion 3.1.1). Adding
meta-learners on topof existing onesdoesnot
eliminate the problem as long as there is at
least one meta-learner having a xed form of
bias. Futureworkwillexplorehowtomaketwo
self-adaptivelearnersserveasmeta-learnersfor
ea h other, inthisway ensuringfull exibility
inthedynami sele tion of bias.
A knowledgments
ThisworkwassupportedbyIBMT.J.Watson
Resear hCenter.
Referen es
[1℄ Baxter Jonathan. A Model Of Indu tive
Bias. Journal of Arti ial Intelligen e Re-
sear h, 12, 149{198, 2000.
[2℄ Chan Philipand Stolfo S. On The A u-
ra y Of Meta-Learning For S alable Data
Mining. Journal of Intelligent Integration
of Information, 1998.
[3℄ DesJardins Marie And Gordon Diana.
EvaluationAndSele tionOfBiases InMa-
hine Learning. Ma hine Learning, 20, 5{
22,1995.
[4℄ Gama J. and Brazdil P. . Chara teriza-
tion Of Classi ation Algorithms. In 7th
Portuguese Conferen e on Arti ial Intel-
ligen e,EPIA, 189{200, 1995.
[5℄ Lanzi PierLu a, StolzmannWolfgang and
Wilson Stewart. Learning Classier Sys-
tems. Le ture Notes in Arti ial Intel-
ligen e, Springer-Verlag, New York, NY,
du tionToKolmogorovComplexityAndIts
Appli ations. Springer Verlag, New York,
1993.
[7℄ Mit hell Tom. The Need For Biases In
LearningGeneralizations.Te h.rep.CBM-
TR-117, Computer S ien e Department,
Rutgers University, New Brunswi k, NJ
08903, 1980.
[8℄ Pratt LorienandThrunSebastian. Se ond
Spe ial Issue On Indu tive Transfer. Ma-
hine Learning,28, 1997.
[9℄ Rendell Larry, Seshu Raj and T heng
David. Layered Con ept-Learning And
Dynami ally-Variable Bias Management.
InPro eedingsof Tenth InternationalJoint
Conferen e on Arti ial Intelligen e, 08{
314, Milan,Italy, 1987.
[10℄ S haer Cullen. A Conservation Law For
Generalization Performan e. In Pro eed-
ings of the Eleventh International Confer-
en e on Ma hine Learning, 259{265, San
Fran is o: Morgan Kaufmann,1994.
[11℄ ThrunSebastian. LifelongLearningAlgo-
rithms. InThrun S., and Pratt, L.(Eds.),
Learning To Learn, Chap. 8 : 181{209.
KluwerA ademi Publishers,1998.
[12℄ WatanabeSatosi.KnowingAndGuessing.
John Wileyand Sons, NewYork, 1969.
[13℄ Wolpert David. The La k Of A Priori
Distin tions Between Learning Algorithms
And The Existen e Of A Priori Distin -
tionsBetweenLearningAlgorithms.Neural
Computation, 8,1341-142, 1996.
[14℄ Wolpert David. Any Two Learning Al-
gorithms Are (Almost) Exa tly Identi al.
Unpublishedmanus ript.NASAAmesRe-
sear h Center, 2001.