Institutional Knowledge at Singapore Management University
Research Collection School Of Information Systems
School of Information Systems
6-2017
Scalable transfer learning in heterogeneous,
dynamic environments
Trung Thanh Nguyen
Adobe Systems
Tomi Silander
Xerox Research Centre Europe
Zhuoru LI
National University of Singapore
Tze-Yun LEONG
Singapore Management University
DOI:
https://doi.org/10.1016/j.artint.2015.09.013
Follow this and additional works at:
https://ink.library.smu.edu.sg/sis_research
Part of the
Artificial Intelligence and Robotics Commons
This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore Management University. For more information, please [email protected].
Citation
Nguyen, Trung Thanh; Silander, Tomi; LI, Zhuoru; and Tze-Yun LEONG. Scalable transfer learning in heterogeneous, dynamic
environments. (2017).
Artificial Intelligence
. 247, 70-94. Research Collection School Of Information Systems.
Contents lists available atScienceDirect
Artificial
Intelligence
www.elsevier.com/locate/artint
Scalable
transfer
learning
in
heterogeneous,
dynamic
environments
Trung Thanh Nguyen
a,
∗
,
1,
Tomi Silander
b,
1,
Zhuoru Li
c,
Tze-Yun Leong
d,
c aAdobe,345ParkAvenue,SanJose,CA,USAbXeroxResearchCentreEurope,6chemindeMaupertuis,38240 Meylan,France cSchoolofComputing,NationalUniversityofSingapore,Singapore
dSchoolofInformationSystems,SingaporeManagementUniversity,Singapore
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory:
Receivedinrevisedform17September 2015
Accepted29September2015 Availableonline3October2015
Keywords:
Model-basedreinforcementlearning Transferlearning
Onlinefeatureselection
Reinforcement learning is a plausible theoretical basis for developing self-learning, autonomous agents or robots that can effectively represent the world dynamics and efficientlylearntheproblemfeaturestoperformdifferenttasksindifferentenvironments. Thecomputationalcostsandcomplexitiesinvolved,however,areoftenprohibitivefor real-world applications.This studyintroduces ascalable methodologyto learn and transfer knowledgeofthetransition(andreward)modelsformodel-basedreinforcementlearning in a complex world. We propose a variant formulation of Markov decision processes that supports efficient online-learning of the relevant problem features to approximate the world dynamics. We apply the new feature selection and dynamics approximation techniquesinheterogeneous transferlearning, where the agent automaticallymaintains andadaptsmultiplerepresentationsoftheworldtocopewiththedifferentenvironments itencountersduringitslifetime.Weproveregretboundsforourapproach,andempirically demonstrateits capabilitytoquicklyconvergetoanearoptimal policyinbothrealand simulatedenvironments.
©2015ElsevierB.V.All rights reserved.
1. Introduction
Nextgenerationroboticsaimsatdevelopingautonomousagentswithintegrativeintelligence;theseagentsorgroupsof agents can automatically orsemi-automatically collect sensor inputs, make decisions,learn new skills,and interactwith humansorotheragentstocompletemultiple,complextasksindifferentreal-worlddomains[1].Emergingglobalresearch and development trends point tocollaborative robotsthat can work andinteractwithpeople [2],dexterous robotswith advancedmanipulationskillsand“softtouches”[3],andcognitiverobotsthatareself-improving,robust,andflexible[4].In Asia,forexample,socialandeconomicdemandshaveledtointensifiedactivitiestodevelopassistiverobotstocomplement adwindlingworkforceandhealthcarerobotstohelpcaringforanageingpopulation[5].
Amajorchallengeindesigningintelligentrobotsistoequipthemwiththecapabilitiestoeffectivelyusepastexperiences andatthesametimeefficientlylearnnewskillstoperformdifferenttasksindifferentenvironments.Real-worlddynamics isusuallyuncertain,unstructured,andnon-stationary;thetasks,objectives,resources,andconditionsoftherobotsmayalso
*
Correspondingauthor.E-mailaddress:[email protected](T.T. Nguyen).
1 ThemainpartoftheworkwasdonewhiletheauthorwasattheNationalUniversityofSingapore. http://dx.doi.org/10.1016/j.artint.2015.09.013
0004-3702/©2015ElsevierB.V.All rights reserved.
Published in Artificial Intelligence, Volume 247, June 2017, Pages 70-94. http://doi.org/10.1016/j.artint.2015.09.013
changeovertime.Theskillsrequiredforsuccessfulproblemsolvinganddecisionmakingarenoteasilydescribedassimple rules,guidelines,orprocesses.A self-learningagentthatcanselectusefulenvironmentalfeaturesandlearntobehave,toa certaindegree,basedontrialanderrorisapromisingapproachtobuildingrobotsthatactintherealworld.
1.1. Reinforcementlearninginrobotics
Reinforcementlearning (RL) [6]trainsan autonomous agent tobehave intelligently based on thefeedback it receives wheninteractingwiththeenvironment.An RLtaskiscommonlyrepresentedasaMarkovdecisionprocess (MDP)witha specificdomainthatincludestherelevantactionsandstateswiththedefiningfeatures,andunknowndynamics(transition) and/orgoalorfeedback(reward)functions.ThemainchallengesofapplyingRLinroboticdomainsincludethe multidimen-sionalstate andaction spacesthat incurhighcomputational costs, the physical constraintsthat make acting inthe real world muchmoredifficult thaninsimulated worlds,the uncertaintyduetopartial observability ofthephysical environ-ments andinherent noiseinthesensormeasurements, andthedifficulty intailoringthefeedbackorrewardfunctionsto guideintelligentbehavioroftherobots[7].
Despitethechallenges,RLhasrecentlybeenappliedincomplextaskssuchashelicoptermaneuvering[8],soccerrobots
[9],androbot navigation[10] withreasonablesuccess.In thesecases,carefulselectionsofproblemrepresentations,prior experiences,andefficientapproximationshaverenderedRLcomputationallyfeasibleinthecomplexsettings.
In thiswork, we design a representationofthe world dynamics inmodel-based RL that allows efficientandscalable approximation ofthe agent’s actioneffects. At thecore ofour methodis the abilityto automatically selectthe relevant features of the environment that allow the agent to predict how the environment reacts to its actions. This ability to automaticallylearnto focusattention onthecriticalfactorsoftheproblemisone ofthecrucialelementsneededtomake intelligentbehaviorincomplexenvironmentscomputationallyfeasible.
The value ofonlinefeature extractionis furtheraccentuatedin situationswhere theagent encounters manydifferent environments in its lifetime, and wheretransferlearning isessential. While manually designing a smallset ofimportant features isnon-trivialformanyreal-worldtasks,doing sofordifferentenvironments, some ofwhichmaybe unknown a priori, iseven harder.Forexample,the Marsexplorationrover, Opportunity,is nowexpectedto travel tensof kilometers throughdifferentterrains,withpossiblyvaryingcharacteristicsanddynamics,tocollectscientificobjects[11].Mostofthe environmentalfeaturesthatitwillencounterareunlikelytobespecifiedbeforehand.
1.2. Transferlearninginroboticreinforcementlearning
IntheoriginalRLframework,theagent’sknowledgeistaskanddomainspecific.Asmallchangeinthetaskoritsdomain mayrendertheagent’saccumulatedknowledgeuseless;costlyrelearningfromscratchisoftenneeded.Intransferlearning, anagentapplies theknowledgeorexperiencegainedinprevious(source)taskstoinfluenceandimprovetheperformance ofnew,related(target)tasks.TransferinRLassumesthatknowledgefromoneormoresourcetask(s)isusedtolearnone ormore target task(s)faster than iftransfer was not used.In contrastto multi-tasklearning,which assumesthat all the tasksare drawnfromanunderlying,possiblyunknown distribution,transferlearningis moregeneralinallowing transfer amongsttaskswithheterogeneousdomains,dynamics,andgoals[12].Transferlearningishencemostrelevantinbuilding
lifelonglearningagentsorrobots–agentsthat canlearn,retain,andtransferknowledgeacquiredfrommultipletasksover multiple domains, ina sequential manner, to develop more accurate solutions orpolicies in learning newtasks over its lifetime[13].
TransferinRLisdifferentfromanothermaincategoryoftransferlearningtasksthat focusonclassification,regression, andclusteringinnon-dynamicdomains[14].Themaintechnicalchallengesinvolved,however,aresimilarinidentifyingthe commonalitiesbetweenthesourcetasksordomains andthetargettasks ordomains.TransferinRL addressestheissues ofwhat totransfer, howtotransfer, andwhentoandwhen notto transferto avoidorminimizenegative effects onthe learningperformanceinthenewenvironment.
Manyexisting RLtransfer techniques,however,assume thesamestate-action spacesi.e., homogeneousdomains in dif-ferenttasks.Thisassumptionmaynotworkwellinreal-lifeapplications.Forexample,manyenvironmentalcuesthathelp an agent navigate througha forest aresimply missingwhenthe agenttriesto navigate atsea.Whilerecent effortshave addressedinter-task transfer inheterogeneous domains ordifferent state-action spaces [15–19],such mappings are hard todefinewhentheagentoperatesinreal-worldenvironmentswithlargestate-actionspacesandmultiplegoalstates,with possiblydifferentstatefeaturedistributionsandworlddynamics.Atrade-offbetweencomputationalcomplexityandsample efficiencyisusuallyinvolvedinautomaticallylearningthemappings[20].
Weproposeanefficient,onlinesystemthattriestotransferoldknowledge,butatthesametimeevaluatesnewoptions toseeiftheywork better.The agentgathersexperienceduring its lifetimeandentersa newenvironmentequippedwith
expectationson how differentaspects of theworld affectthe outcomesof theagent’sactions. The main idea isto allow an agent to collect a libraryofworld models orrepresentations, calledviews,that it can consult to focusattentionin a newtask.Inthispaper,weconcentrateonapproximatingthedynamicsortransitionmodel.Thefeedbackorrewardmodel librarycanbelearnedinananalogousfashion.Effectiveutilizationofthelibraryofworldmodelsallowstheagenttocapture thetransitiondynamicsofthenewenvironmentquickly;thisshouldleadtoajumpstartinlearningandfasterconvergence to a nearoptimalpolicy. Amain challenge isin learning to selecta proper view fora newtaskin a newenvironment,
withoutanypredefinedmappingstrategies.Theabilitytolearnnewviewsisalsocriticalinthecomplex,andfeature-rich realenvironments.
Throughscoringthelearnedviewsandsimultaneousevaluationwithouttheviews,oursystemmitigatespotential nega-tiveeffectsbybalancingtheexploration-and-exploitationtrade-off.Thisisincontrasttominimizingnegativetransfereffects throughdeterminingtheunderlyingrelationshipsbetweentheexperiencesandthenewtasksanddomains [14].Such rela-tionshipsmaynotexistandareusuallyhardtodetermineintheheterogeneous,dynamicenvironmentsthatwefocuson.
1.3. Anewtransferlearningframework
We presenta newtransferlearning framework thatsupports all theessential steps ofafully autonomousRLtransfer agent,ascomparedwithmostexistingworkthatmainlyfocusesononeortwoofthesesteps[12]:
1. Selectanappropriatesourcetaskorsetofsourcetaskstotransfertoatargettask; 2. Learntherelationshipsbetweenthesourceandthetargettasks;and
3. Transfertherelevantknowledgefromthesourcetasktothetargettasktoimprovelearningperformance. Oursystemcomprisesthefollowingcomponents:
•
Situation-calculusMarkovdecisionprocess(CMDP),anewvariantoffactoredMDPthatcompactlyrepresentsthe rele-vantactioneffectsinadynamicenvironment;•
A collectionofviews, whichsummarizethe experienceslearnedfromconstructing transitionmodels indifferent en-vironments based on multinomial logistics regression; the views are scored based on their relevance in the new environments;•
Aview ortransitionmodellearningalgorithm,multinomial DualAveragingwithGroupLasso(mDAGL),thatsupports onlinefeatureselectionandrelevance-focusedupdating;•
A newreinforcement learning algorithm,logistic regressionRL (loreRL), that appliesmDAGL on aCMDPto learn the transitionmodelsbyautomaticallyfocusingontherelevantfeaturesintheenvironment;and•
Anewtransferlearningalgorithm,TransferExpectationS(TES),thatmayutilizethepreviouslylearnedtransitionmodels toimprovereinforcementlearninginnewenvironments.The restofthe paperisorganized asfollows.We willnext formalize theproblemanddescribe themethod ofonline feature selection and collecting viewsinto a library. We will then present an efficient implementation of the proposed transfer learning technique.Afterdiscussingrelatedwork,we willdemonstratethe effectivenessofoursystemthrougha set ofsimulated experimentsanda real roboticapplication. We will concludeby discussingthe lessonslearnedand the implicationsforfutureroboticresearch.
2. Background
AtaskoraproblemMforanintelligentagentorrobottosolve,e.g.,navigatetothemasterbedroomorfetchacupfrom thekitchentable,istypicallymodeledasaMarkovdecisionprocess(MDP)definedbyatupleM
=
(
S,
A,
T,
R)
,whereS is asetofstates; Aisasetofactions;T:
S×
A×
S→ [
0,
1]
isatransitionfunctionormodel,suchthat T(
s,
a,
s)
=
P(
s|
s,
a)
indicatestheprobabilityoftransitingtoastate supontakinganactiona instates; R
:
S×
A→
R
isarewardfunctionor modelindicatingimmediateexpectedrewardafteranactionaistakeninstates.InRL,thestate-actionspaceS×
Adefines thedomainofthetask,thetransitionmodelT andtherewardmodelR definetheobjectiveofthetask[21];thetransition modelT and(sometimes)therewardmodelRareunknown.ThemainchallengeoftransferlearninginRLisintransferring acrossheterogeneousdomains,i.e.,wherethestate-actionspaces,andpossiblythefeaturespaces,aredifferent.For each task M, thetaskor problemsolutionis to finda policy
π
that specifies an action toperform ineach state so that theexpectedaccumulatedfuturereward(withpossibly higherweights formore recentrewards)foreachstate is maximized [6].The optimalpolicyπ
isusually derived by amodel-based ora model-freeapproach inRL;thelatterdoes not considerthe transitionmodelT inderiving thesolution.The two RLapproachesare basedon differentassumptions, problemcharacteristics,andconstraints.Weadoptthemodel-basedapproachasitiseasiertoincorporatedomainorexpert knowledgeintothetransitionandrewardmodels;knowledgetransferacrossdifferentenvironmentscanalsobefacilitated throughthemodels.In model-basedRL,theoptimalpolicy isderived byfirst estimatingthetransitionmodel T (andthe rewardmodel R, ifnecessary) throughinteractingwiththeenvironment.Thisstudyfocuseson learningthemodels,andremainsrelatively agnostic aboutthe actual planning techniques. However, since the purpose is to build a model about world dynamics, planningsystemsthatrelyonsimulatorsappearmostnatural.
3. Amulti-viewtransferlearningframework
A key idea ofthiswork is that the agent canrepresent theworld dynamics fromits sensorystate space indifferent ways.Suchdifferentviewscorrespondtotheagent’sdecisionstofocusattentionononlysomefeaturesofthestateinorder
Fig. 1.Our life-long learning agent.
Algorithm1Overviewoftheproposedlearningframework.
Input:ViewlibraryL.
Output:UpdatedviewlibraryL.
Initialize the system for a task
LH:historicalrecordofhowgoodeachviewwasinprevioustasks
L←L/*AworkingcopyofLtofindthedynamicsofthecurrentenvironment*/
B←initializebeliefofhowwelleachviewcanapproximatethecurrent environmentdynamicsbasedonLH.
/* whileinteractingwiththeenvironment */ fort=0,1,2,. . .do
Select views to approximate the world transition dynamics
{W}←selectthemostpromisingviewsfromLbasedonB Interact with environment based on that approximate model π←plananactionpolicybasedon{W}
st:currentstateintheenvironment
at←chooseanactiontoperforminstaccordingtoactionpolicyπ
Performactionatandobservefeedback:actionoutcomesfromtheenvironment
Score all views with the new feedback S←scoreallviewsinLwiththenewfeedback
B←updatebeliefabouttheviewsinLbasedonthescoreS
Adjust all views with new feedback
L←Adaptallviewstowardthiscurrentenvironmentbasedonthenewfeedback Breakwhenthetaskends
end for
Update view library L
L←Updatetheviewlibrary,e.g.
IfaviewWisdifferentfromexistingviewsinL,
whichmeansthetransitiondynamicsofthecorrespondingactionisnew,
thenaddthatviewWtoL;
elsereplacetheoldviewinLwiththenewlyupdatedviewW.
to quicklyapproximatethe statetransition function.Inother words, aview represents an expectationofthe agent about thetransitiondynamicsresultingfromone(orseveral)ofitsactionsinthetaskenvironment.Inanewtask,theagentwill selectappropriateviewstosolvethetask,andtolearnnewviewsiftheenvironmentisnovel.Fig. 1illustratestheworkflow ofourlife-longlearningagent.Partsofthisworkhavebeenpublishedin[22–24].
WesketchtheoverallbehaviorofouragentinAlgorithm 1.
Whentheagentdoesnothaveanyinitialinformationaboutthetransitiondynamicsoftheenvironmentinanewtask, itselects“expectations”orviewsbasedonhistorythattellshowwelleachviewinthelibraryhasworkedinprevioustasks. Weassumethatthemorefrequentlyaviewhasworkedinthepast,themorelikelyitwillworkagaininanewtask.
The agent thenoperates accordingtothe policy learnedbasedon thetransitionmodel builtfromtheselected views, assuming that the reward model is known. However, since the views have been selected without any reference to the actualcharacteristicsofthenewenvironment,itishighlylikelythattheseviewsareinappropriateforthecurrenttask.In other words, the“expected” transitionmodelmay notcorrectly approximatethe truetransitiondynamics ofthe current environment.Theresultingpolicymayjustperformpoorly,leadingtonegativerewards.
To limit such negative transfer effects, ouragent exploitsthe outcomes or feedback foreach ofits actions, interms of state changes and rewards, to score all the views in the library. The score estimates the capability of each view in approximating the transition dynamics in the current environment; it is a primary criterion to re-select the views for subsequentdecisions.
Thenewtaskand/orenvironment,however,mayactuallybeverydifferentfromallthepastexperiencesoftheagent.In thiscase,noneoftherecordedviewscancaptureoradapttothenewtransitiondynamics.Hence,environmentfeedbackis alsousedtodevelopandincorporatenewviewsintothelibrary.Thesestepsarerepeateduntilastoppingcriterionismet.
Fig. 2.Different ways to decompose a state transition function.
At the endof a task, the selected viewswill be recorded asnew additions in the library ifthe transition dynamics captured issignificantly differentfromthe existingviews. Otherwise,the agent wouldjustupdate theexisting views ac-cordingly.Viewsthathavenotbeenusedfora“longtime”willbeprunedtomanagethelibrarysize.
We address three main technicalchallenges in this framework: First, the transitionmodel T
(
S,
A,
S)
is task specific, which isprobably a reason why there have not been many studies that transfer the transition model. Second, learning orupdating aview oratransitionmodelonlineina complexandfeature-rich environmentiscomputationallyexpensive. Third,theviewscoringmethodmustbesimpletobecalculatedonline,basedonfeedbackfromtheenvironment.Theview libraryalsoneedstobeefficientlyupdated.3.1. SituationcalculusMDP:CMDP
In afactoredMDP, eachstate sisrepresented bya vector ofnstate-attributes si
,
i=
1,
. . . ,
n.Eachstate attributeis arandomvariablethatcantakeonmultiplevalues;eachstatesisdefinedbytheCartesianproductofthenstateattributes. For example, the binary state attributes Water (with values present and absent) and Martians (with values present and
absent) willdefine all the states that describe the “interestingness” ofa particularlocation on Mars:{(water, Martians), (water,no-Martians),(no-water,Martians),(no-water,no-Martians)}.
ThetransitionfunctionormodelforthefactoredstatesiscommonlyexpressedindynamicBayesiannetworks(DBNs),a temporalvariantofBayesiannetworks.IntheDBNrepresentationofthetransitionmodel,T
(
s,
a,
s)
=
ni=1P(
si|
Paai(
s),
a)
, wheresiisastate-attribute,Paai indicatesasubsetofstate-attributesinscalledtheparentsofsi(Fig. 2a),i.e.,theattributesof s from which there are arcs to si in the DBN. Learning T requires learning the subsets Paa
i andthe parameters for
conditional distributions,ortheDBNlocalstructures.AcriticalissueinthisMDPformulationistheambiguousdefinitions of the factored attributesor variables. The state-attributes orfeatures serve to both define the state spaceand capture informationaboutthetransitionmodel,evenifthesetwopurposescanbeverydifferent.Forexample,twostate-attributes, the (x, y)-coordinates uniquely identifya state andcompactly factorizethe state space ina grid-world. A policy can be learnedonthisfactoredspace.Thestate transitiondynamics,however,maydependonotherfeatures ofthestate,suchas the surface material atthelocation (state), thepresence orabsenceof Martians,etc. Suchfeatures are oftenincluded in thestaterepresentations.Whileessentialinformulatingthetransitionorrewardmodels,thesefeaturesmaycomplicatethe planningorlearningprocessesbyincreasingthesizeandcomplexityofthestatespace.
We presenta variant ofthe factored MDP that definesa “compact butcomprehensive” factorizationof thetransition function andsupportsefficientlearningoftherelevantfeatures. Weseparatethestateidentifyingstate-attributesfromthe “merely”informativestate-featuresinourrepresentation(see Fig. 2b).Thisway,wecanapply anefficientfeatureselection methodonalargenumberofstatefeaturestocapturethetransitiondynamics,whilemaintainingacompactstatespace.
SimilartotheapproachproposedbyKonidarisandBarto[25],thestateattributescouldbedefinedintermsofthe agent-spacerepresentationbasedonthecapabilitiesaffordedbythesensorsoractuatorsoftherobot,e.g.,the(x,y)-coordinates oftherobotlocation.The statefeatures,ontheotherhand,couldbedefinedintermsoftheproblem-spacerepresentation based onthe domainorenvironmental characteristicsofthe task, e.g.,the terrain orwetnessofthe floorsurface, which mayormaynotbeaccuratelydetectablebytheagentorrobot.
LearningDBNstructuresofthetransitionfunctiononline,i.e.,whiletheagentisinteractingwiththeenvironment,isstill computationallyprohibitiveinmostdomains.Ontheotherhand,recentstudies[26,27]haveshownencouragingresultsin learning thestructureoflogisticregressionmodels,whichcaneffectivelyapproximatethelocalstructuresinDBNsevenin highdimensionalspaces.Whiletheseregressionmodelscannotfullycapturetheconditionaldistributions,theirexpressive
powercanbeimprovedbyaugmentinglow dimensionalstate representationwithnon-linearfeaturesofthestate vectors. Wewillintroduceanonlinesparsemultinomiallogisticregressionmethodthatsupportsefficientlearningofthestructured representationofthetransitionfunctioninSection4.
Inaddition,wepredicttherelativechangesofstatesinsteadofdirectlyspecifyingthenextstate-attributesinatransition (seeFig. 2c).InRL,anactionwillstochasticallycreateaneffectthatdetermineshowthecurrentstatechangestothenext one [28,10,17].Mediating statechanges viaactioneffectsisa commonstrategyintheclassic situationcalculus [29].Since the numberofrelative changes or actioneffects isusually muchsmaller than thesize of thestate space, or thesize of state-attributedomains,thecorrespondingpredictiontaskshouldbeeasier.Thelearningproblemcanthenbeexpressedas amulti-classclassificationtaskofpredictingtheactioneffects.
Usually,the transitionfunctionsandrewardfunctionsaredefinedinterms ofthestate features– aspectsofthe state representationthathelpinpredictingtheactioneffects.Inaspecifictaskorenvironment,thesameactionindifferentstates (e.g.,differentlocations)withthesamefeaturestendtoyieldthesameactioneffects(i.e.,relativechangesinthestates).
Formally,asituationcalculusMDP (CMDP)isdefinedbya tupleM
=
(
S,
f,
A,
T,
E,
R,
γ
)
,whereS,
A,
T,
R,
γ
havethe samemeaningasinaregularMDP.S=
S1,
S2,
..,
Snisthestatespaceimplicitlyrepresentedbyvectorsofn state-attributes.The function f
:
S→
R
m extractsm state-featuresfrom each state. E is an actioneffect variable such that the transition functioncanbefactoredasT
(
s,
a,
s)
=
P(
s|
s,
a)
=
n i=1 P(
si|
s,
f(
s),
a)
=
n i=1 e∈E P(
si|
e,
s)
P(
e|
s,
f(
s),
a).
(1)Fig. 2cshowsanexampleofthisdecomposition.Theagentusesthefeaturefunction f toidentifytherelevantfeatures, andthenusesbothstateattributesandfeaturestopredicttheactioneffects.Wealsoassumethattheeffecte andcurrent state s determinethenext state s,thus P
(
s|
e,
s)
is either0or1.Thisdefinesthesemanticmeaningoftheeffectwhich is assumedto be known by the agent. The remaining taskis to learn P(
e|
s,
a)
=
P(
e|
x(
s),
a)
,where x(
s)
=
(
s,
f(
s))
isa vector containing boththe state attributesandthestate features.Assuming theeffecte is discrete,learning P(
e|
x(
s),
a)
isa classificationproblem. InSection 4we willshow howtosolve thisproblemby usingmultinomial logisticregression methods.
3.2. Transferringexpectations:TES
In ourframework, the knowledgegathered andtransferred by the agent iscollected into a library
T
ofonline effect predictorsorviews. Aviewconsistsofastructure component f¯
thatpicks thefeatures whichshouldbe focused on,and aquantitativecomponentthatdefineshowthesefeaturesshouldbecombinedtoapproximatethedistributionofaction effects. Formally,a view isdefined as
τ
=
(
¯
f,
)
,such that P(
E|
S,
a;
τ
)
=
P(
E| ¯
f(
S),
a;
)
=
τ
(
S,
a,
E)
,in which¯
f is an orthogonalprojectionofx(
s)
tosomesubspaceofR
m,wheremisthedimensionofthesubspace.Eachviewτ
isspecialized inpredictingtheeffectsofoneactiona(
τ
)
∈
A andityieldsaprobabilitydistributionfortheeffectsoftheactionainany state.Thispredictionisbasedonthefeaturesofthestateandtheparameters(
τ
)
oftheviewthatmaybeadjustedbased ontheactualeffectsobservedinthetaskenvironment.We denote the subset ofviews that specify the effects for action a by
T
a⊂
T
.The main challenge is to build and maintain acomprehensive setofviewsthat canbe usedinnewenvironments likely resemblingthe old ones,butatthe sametimeallowadaptationtonewtaskswithcompletelynewtransitiondynamicsandfeaturedistributions.Atthebeginningofeverynewtask,theexisting libraryiscopiedintoaworkinglibrarywhichisalsoaugmentedwith fresh,uninformedviews,one foreachaction,that arereadytobeadaptedtonewtasks.Wethenselect,foreachaction,a viewwithagood“trackrecord”,i.e.,ithasbeen“used”orappliedmanytimesorwithahighrecency-weightedscore.This viewisusedtoestimatetheoptimalpolicybasedonthetransitionmodelspecifiedinEquation(1),andthepolicyisused to pickthefirst actiona.The actioneffectis thenused toscore allthe viewsinthe workinglibraryandtoadjusttheir parameters.Ineachroundtheselectionofviewsisrepeatedbasedontheirscores,andthenewoptimalpolicyiscalculated basedonthenewselections.Attheendofthetask,theactuallibraryisupdatedbypossiblyrecruitingtheviewsthathave “performedwell”andretiringthosethathavenot.A morerigorousversionoftheprocedureisdescribedinAlgorithm 2.
3.2.1. Scoringtheviews
Toassess the quality ofa view
τ
,we measure its predictive performance by a cumulative log-score. Thisis a properscore[30]thatcanbeeffectivelycalculatedonline.
Givenasequence Da
=
(
d1,
d2,
. . . ,
dN)
ofobservationsdi=
(
si,
a,
ei)
inwhichactiona hasresultedineffectei instate si,thescoreforana-specializedτ
isS
(τ
,
Da)
=
N
i=1
Algorithm2TES:TransferringExpectationsusingalibraryofviews.
Input:T= {τ1,τ2,. . .}:viewlibrary;CMDPj:anewjthtask;:viewgoodnessevaluator
LetT0beasetoffreshviews–oneforeachaction Ttmp←T∪T0 /*theworkinglibraryforthetask*/
foralla∈Ado Tˆ[a]←arg maxτ∈Ta(τ,j) endfor /*selectingviews*/
fort=0,1,2,. . .do
at← ˆπ(st),whereπˆ isobtainedbysolvingMDPusingtransitionmodelTˆ
Performactionatandobserveeffectet
forallτ∈Tat
tmp∪Tat do Score[τ]←Score[τ]+logτ(st,at,et) endfor
forallτ∈Tat
tmp do Updateviewτbasedon(f(st),at,et) endfor
ˆ
T[at]←arg maxτ∈Tat
tmpScore[τ] /*
selectingviews*/
end for
foralla∈Ado τ∗←arg maxτ∈Ta tmpScore[τ];
Ta←growLibrary(Ta,τ∗,Score,j) /*updatinglibrary*/
endfor
if|T|>Mthen T←T− {arg minτ∈T(τ,j)} endif /*pruninglibrary*/
Algorithm3Growsub-library
T
a. Input:Ta,τ∗,Score,j:taskindex;c:constant;Hτ∗= {}:emptyhistoryrecord
Output:updatedlibrarysubsetTaandwinninghistoriesH τ∗
caseτ∗∈Ta
0 do Ta←Ta∪ {τ∗} /*addnewbietolibrary*/ otherwise do Letτ¯∈T betheoriginal,notadaptedversionofτ∗
caseScore[τ∗]−Score[ ¯τ]>c do Ta←Ta∪ {τ∗} otherwise do Ta←Ta∪ {τ∗}− { ¯τ}
Hτ∗←Hτ¯ /*inherithistory*/
Hτ∗←Hτ∗∪ {j}
where
τ
(
si,
a,
ei;
θ
i(
τ
))
istheprobabilityofeffectei givenbythevieworeffectpredictorτ
basedonthefeaturesofstate siandtheparametersθ
i(
τ
)
thatmayhavebeenadjustedusingpreviousdata(
d1,
d2,
. . . ,
di−1)
.3.2.2. Growingthelibrary
Aftercompleting atask, thehighestscoring viewsforeach actionare consideredforrecruiting intotheactual library. Thewinning“newbies”ornewentriesareautomaticallyaccepted.Inthiscase,thedatahasmostprobablycomefromthe distributionthatisfarfromtheanycurrentmodels,otherwiseoneofthecurrentmodelswouldhavehadanadvantageto adaptandwin.Inother words,thismeansthat theexistingviewshavegeneratedlow ornegative rewardsintheprocess ofderiving theoptimalpolicyinRL;they havenot beenusedfurtherandanewsetofworlddynamicshasbeenlearned. Instead oftrying to directlyidentify theunderlying commonalitiesbetweenthe old andnewtasksand/or environments, ourframeworkmitigatespotentialnegativetransfereffectsbylearningthenewviewsortransitiondynamicsinparallelto exploitingandexamining iftheexisting viewscan improvelearning.This abilityisimportantin realdomains wherethe underlyingsystemdynamicsmaynotbeeasilydetermined.
Thewinners
τ
∗thatareadjustedversionsofoldviewsτ
¯
areacceptedasnewmembersiftheyscoresignificantlyhigher thantheiroriginalversions,basedonthelogarithmoftheprequentiallikelihoodratio[31](
τ
∗,
τ
¯
)
=
S(
τ
∗,
Da)
−
S(
τ
¯
,
Da)
. Otherwise, theoriginalversionsτ
¯
gettheir parameters updatedto thenewvalues.Thisprocedureisjusta heuristicand otherinclusionandupdatingcriteriamaywellbeconsidered.ThepolicyisdetailedinAlgorithm 3.3.2.3. Pruningthelibrary
Tokeepthelibraryrelativelycompact,aplausiblepolicyistoremoveviewsthathavenotperformedwellforalongtime, possiblybecausetherearebetterpredictorsortheyhavebecomeobsoleteinthenewtasksorenvironments.Toimplement sucharetiringscheme,eachview
τ
maintainsalistHτ oftaskindicesthatindicatesthetasksforwhichtheviewhasbeenthe bestscoringpredictorforits specialtyactiona
(
τ
)
.We canthen calculatetherecencyweighted trackrecordforeach view.Inpractice,wehaveadoptedtheprocedurebyZhuetal.[32]thatintroducestherecencyweightedscoreattime T as(τ
,
T)
=
t∈Hτ
e−μ(T−t)
,
where
μ
controlsthespeedofdecayofpastsuccess.Otherdecayfunctionscould naturallyalsobeused.Thepruningcan thenbedonebyintroducingathresholdforrecencyweightedscoreoralwaysmaintainingthetopmviews.4. Aviewlearningalgorithm
In TES, a view can be implemented by any probabilistic classification model that can be quickly learnedonline. This requirementexcludesmostofthebatchorofflineclassificationmethods.Inthisstudy,weintroduceascalableonlinesparse multinomial logistic regression algorithm to incrementally learn a view. The proposed algorithm optimizes an objective
Algorithm4ThemDAGLalgorithm.
Input:λ,α
Leth(W)beastronglyconvexfunctionwithmodulus1 LetW1=W0=arg min
Wh(W) LetG¯0= ¯0 fort=1,2,3,. . .do (yt,xt)←observedata (Wt+1,G¯t)←mDAGL-update(t,yt,xt,Wt,G¯t−1,λ,α) end for
functionsimilartothegrouplasso[33]whichhasbeenrecentlysuggestedforefficientfeatureselectionamongaverylarge setoffeatures[27].
Multinomial logistic regression is a simple yet effective classification method. Assuming K classes of d-dimensional vectors x
∈
R
d, we represent each class k witha d-dimensional prototype vector Wk. Classificationof an input vector xisbasedon how“similar” itisto theprototypevectors. Similarityismeasuredwithinner product
Wk,
x=
di=1Wkixi,
wherexi denotesfeaturei.Theprobabilityofaclassisdefinedby P
(
y=
k|
x;
Wk)
∝
eWk,x.TheparametervectorsofthemodelformtherowsofamatrixW
=
(
W1,
. . . ,
WK)
T.Letlt
(
Wt)
= −
logP(
yt|
xt;
Wt)
denotetheitem-wiselog-lossofamodelwithcoefficient matrix Wt predicting a datapoint
(
yt,
xt)
observedattimet.Atypicalobjectiveofan onlinelearningsystemistominimizethetotallossbyupdating its Wt overtime.However,theresultingmodelwilloftenbeverycomplicatedandover-fitting.Toachieveaparsimoniousmodel,weexpressouraprioribeliefthatmostfeaturesareirrelevant orsuperfluousbyintroducingaregularization term
(
W)
=
λ
di=1√
K
W·i2,whereW·i2 denotesthe2-normoftheithcolumnof W,andλ
isapositiveconstant. Thisregularization issimilar tothat of grouplasso. It communicatestheidea that itislikely that awhole columnof W has zero values (especially, for large
λ
). A column of all zeros suggeststhat the corresponding feature is not necessary for classification.Theobjectivefunctioncannowbewrittenas
L
(
T)
=
T t=1 lt(
Wt)
+
(
Wt)
=
T t=1−
log e Wt yt,x t keW t k,xt+
λ
d i√
KW·ti2,
whereWt isthecoefficientmatrixlearnedusingt
−
1 previouslyobserveddataitems.Thequalityofasequenceofparam-etermatricesWt
,
t∈
(
1,
. . . ,
T)
withrespecttoafixedparametermatrixW canbemeasuredbytheamountofextraloss, orregret RT(
W)
=
L(
T)
−
LW(
T)
=
T t=1(
lt(
Wt)
+
(
Wt))
−
T t=1(
lt(
W)
+
(
W)).
Wewanttolearnaseriesofparameters Wt toachieve smallregretwithrespecttoagoodmodelW that hasa small lossLW
(
T)
.4.1. Onlinelearningofmultinomialregularizedlogisticregressionwithgrouplasso:mDAGL
Recently, Xiao etal. [26] introduced a dual averaging method forsolving lasso online, andYang etal. [27] extended the work forsolving group lasso. The methods are simple, efficient,and scalablefor learning the regularized regression models. Following the same approach, we introduce mDAGL (Algorithm 4), a new algorithm to incrementally learn W
to optimize the specified objective function in the sparse multinomial logistic regression. Typically, group lasso is used to regularizegroupsofcoefficientswhere eachcoefficient corresponds toa particularfeature.In ourcaseofmultinomial logistic regression, a group comprises the coefficients of a feature; in other words, a group is a column in coefficient matrixW.
In thestandard online stochastic gradientdescent method, afterobserving the datavector
(
yt,
xt)
, we adjustthe pa-rametersofthemodeltowardthedirectionsthatmaximizethelikelihoodoftheobservation(orequivalently,minimizethe item-wiselog-loss lt). The dualaveraging method isa version of the stochastic gradientdescent, thus we again havetocomputethegradientGt ofthelog-lossfunctionforeachobservation
(
yt,
xt)
.Butinsteadofmovingtheparametersbased onthesedirectionsGt,we useGt toupdate theaverageofgradients,G¯
t,forobservationsencountered thisfar, andmovethe parameters away from thoseaverage directions (away, since we are minimizing).We will next describe themethod moreformally.
Algorithm5mDAGL-update.
Input:t,yt,xt,Wt,G¯t−1,λ,α
Gt←useequation(2)with(yt,xt),Wt
¯
Gt←t−1 t G¯t−1+
1 tGt
Wt+1←useequation(A.1)withG¯t
,βt=α√t,λ
return(Wt+1 ,G¯t
)
Algorithm6TheloreRLalgorithm.
Input:mDAGLregularizationparametersλ,α,CMDPvariablesS,f,A,E,R,γ,exploration. LetW=(W1,W2,. . . ,W|A|)=(W0,W0,. . . ,W0)
LetG¯=(G¯1,G¯2,. . . ,G|¯A|)=(0,0,. . . ,0)
s0←randominitialstate fort=1,2,3,. . .do
π←SolveMDPusingtransitionmodelT(W¯)
a←π(st,) #-greedyactionselection
Takeactionayieldingeffecte,nextstatest+1 (Wa,G¯a)←mDAGL(t,e,x(st),Wa,G¯a,λ,α)
end for
Weinitializetheparameters W toa K
×
dmatrixofallzeros.LetGtkibe thepartialderivativesoffunctionlt
(
W)
withrespect toWki atWt (Gtki
=
∂W∂ltki(
Wt)
). Wedefine G¯
t tobea matrixofaveragepartial derivatives,i.e., G¯
tki=
1t tτ=1Gτki,wheretakingthegradientofthelossfunctionlτ withrespecttotheparameterWkigivesustheformula
Gτki
= −
xτi(
I(
yτ=
k)
−
P(
k|
xτ;
Wτ)).
(2)We notice that this partial derivative points either away or towards theobserved feature xτi depending onwhether the observationbelongstotheclasskornot.
Foranydataobservedattimet,weusetheaveragegradientG
¯
t toupdatethecoefficientmatrixW via Wt+1=
arg min W¯
Gt,
W+
(
W)
+
β
t t h(
W)
,
(3)where
β
t is a non-negative, non-decreasing sequence of real numbers,and·
,
·
denotes an inner product betweentwomatrices;
¯
Gt,
W=
k,iG¯
tkiWki.Thefirsttermisminimizedby W thatpoints tothedirection− ¯
Gt.Whilethefirsttermpreferslongvectors, theregularizationterm
(
W)
balances thisout. Thethirdtermintroducesan extraregularizationin terms ofastronglyconvexfunctionh(
W)
thatisneededforconvergenceandsparsity.In practiceweusetheFroebenius normh(
W)
=
k,iWki2 andβ
t=
√
t.
Solving theminimization problemaboveleads toanupdaterule(Algorithm 5)inwhicheachcolumnoftheWt+1 isa
scaled versionofthecorresponding columnofthe G
¯
t.Furthermore,ifthelength oftheaveragegradientmatrixcolumnissmallenough,thecorrespondingparametercolumnshouldbetruncatedtozero.Thisamountstofeatureselection.Thefull definitionandproofoftherulearedetailedinAppendix A.
Aregretanalysisconfirmsthatthesolutionwillconvergeandthattheaveragemaximalregretasymptoticallyapproaches zerowithrateO
(
√1t
)
.ThefullanalysisisdetailedinAppendix B. 4.2. Model-basedRLwithmultinomiallogisticregression:loreRLOur maintaskis toturn transitionmodellearning intothe learning ofconditional distributions P
(
E|
s,
f(
s),
a)
using multinomiallogisticregressionforwhichattentiontorelevantfeaturescanbeefficientlyimplementedonlineviamDAGL.Thekeystepsofourmethod,calledloreRL(RLwithregularizedlogisticregression),arepresentedinAlgorithm 6.Inputs toloreRLaretheCMDPcomponents(exceptthetransitionfunction),regularizationparameters
λ
andα
ofmDAGLalgorithm, andthethatdetermines theprobability oftakinga randomaction.Wefirstinitializelogistic regressionparameters Wa
andtheaveragegradientmatricesG
¯
a foreachactiona∈
A.Wealsorandomlyselectastartingstates0.Ateachtimestep,arandomactionaischosenwithasmallprobability
,butotherwisewecalculatetheoptimalpolicy
π
foranMDPwiththetransitionmodelT(
W)
isbasedonthecurrenteffectpredictors.Whilewehaveusedvalueiteration (likeinRmax)forfindingtheoptimalpolicy,anyothermodel-basedRLtechniquecanbeusedaswell.Wedonotfocuson theplanningpartofRLhere,butsearchheuristicssuchasthoseusedinDyna-Q[34] orPrioritizedSweeping[35]can be deployed foramorescalablealgorithm.Afterperformingan actionainstate st andobservingits effecte,theexperience(
e,
st,
f(
st))
willbepresentedtothemDAGLalgorithmthatupdatestheparametermatrixWaandthegradientmatrixG
¯
a.Aswejustdo
-greedyrandomsampling,itisimpossibletoguaranteePACconvergencetoanoptimalpolicy.Assuming that observed datais i.i.d.,we can provethat difference in optimalvalue functionsoftwo CMDPs withdifferentlogistic regressionbasedtransitionfunctionsisboundedbythedifferenceintheirparameters.Thisleadstoacorollaryfor conver-gencetonearoptimalpolicy.
Theorem1(Differencein valuefunction). LetM1
=
(
S,
f,
A,
T(
WM1),
E,
R,
γ
)
and M2=
(
S,
f,
A,
T(
WM2),
E,
R,
γ
)
betwoCMDPswithoptimalpolicies
π
1andπ
2respectively.LetusdenotebyVπMthevaluefunctionforpolicyπ
inCMDPM.Let1
=
2 max a∈A,e∈EW (a),M1 e−
We(a),M21sup s x(
s)
1,
then max s∈S V M2 π1(
s)
−
V M2 π2(
s)
≤
2γVmax1 1
−
γ
,
where W(a),M1e andWe(a),M2 refertothevectorofcoefficientscorrespondingtoclassE
=
e underactiona inmodelM1 andM2respectively,
·
1isthe1-normofvector,andVmaxisthemaximumvalueofanystateforanypolicyineitheroftheCMDPs.Bytaking M2 to be a CMDP basedon the optimal W∗ and M1 an estimatedCMDP basedon mDAGL,we can derive
a vanishingboundforvalue difference ofpolicies.Incasethetruetransitionmodelis representableby a sparse W∗,we wouldmostprobablyconvergetoanearoptimalpolicy.ThefullproofforthetheoremisinAppendix C.
Whenwecannotexpressthetruetransitiondynamicsaslogisticregressionbasedonavailablestatefeatures,itishardto giveguaranteesofperformance.However,wecanstillhavesomeconfidenceindoingwell.ThelogisticregressionmodelPl∗
closest(inKullback–Leiblerdistance)tothetruemodelPtrue (possiblynotalogisticregressionmodel)istheone2 thathas
thesmallestexpectedlog-loss.Whileouroptimalitycriterionistheexpectedregularizedlog-loss,weexpecttheregularized log-lossoptimalmodel P∗ tobe closeto Pl∗ thusalmostascloseto Ptrue aswecanget.ThisrelativelysmallKL-distance
can beconvertedto relativelysmalldistances inactual transitionprobabilities,whichcan thenfurtherbe convertedto a relativelysmallboundonvalue differencesbythesameargumentsusedinprovingTheorem 1(inAppendix C).Therefore, sinceourmodelwouldverylikelyconvergecloseto P∗,wecanexpecttodoalmostaswellasP∗.
5. Experiments
We examinethe performance ofour expectationtransferalgorithm TES that transfers viewsto speed-upthe learning processacrossdifferentcomplex,feature-rich,heterogeneous,anddynamicenvironments.WeshowthatTEScanefficiently:
1. learntheappropriateviewsonline;
2. selectviewsusingtheproposedscoringmetric; 3. achieveagoodjumpstart;and
4. performwellinthelongrun.
We first evaluate our approach in a simulated navigation domain where the assumptions hold. We then conduct a case-studyinarealroboticdomaintoseeifthetheoreticalresultsareusefulinpractice.
5.1. Simulatedrobotnavigation
Environment.Weconsideragrid-basedrobotnavigationprobleminwhicheachgrid-cellhasthesurfaceofeithersand, soil, water,brick,or fire. Inaddition, theremay be wallsbetweencells. The surfacesandwalls determinethe stochastic dynamicsof theworld. However, theagent also observesnumerous other features intheenvironment. The agent hasto learntofocusontheimportantfeaturestolearntheenvironmentdynamicsmodel,andconsequentlytoachieveitsgoal.
Action,states,andrewards. Theagent canperformfouractions(moveup, down,left,right),whichwilllead ittoone ofthefourstatesarounditorleaveitinitscurrentstate.Effects oftheactions arecapturedinfiveoutcomes(movedup, left,down,right,didnotmove).Thestatesaredefinedbythe(x,y)-coordinatesoftheagent.Theagentspends0
.
01 unitsof energytoperformanaction.Itloses1 unitiffallingintoastateoffire,butgains1 unitwhensuccessfullyreachinganexit door.Goal.The goalisto reachanyexit doorintheworld consumingaslittleenergyaspossible.A taskendswhen agent reachesaterminalstate,i.e.,anyexitdoororstatewithfire.
Tasks. We designfifteentasks withgrid sizes rangingfrom20
×
20 to30×
30.Eachtaskhas adifferent state space anddifferent terminal states.Each state (cell) ischaracterized by its surface materials andthe wallsaround it, and200 additionalirrelevant,randombinaryfeatures.Thetasksmayinvolvedifferentdynamicsaswellasdifferentdistributionsof the surface materials.In ourexperiments,the environment transitiondynamicsisgenerated usingthree differentsetsof multinomial logistic regressionmodels; eachcombinationofcell surfacesandwallsaround thecell willlead todifferent2 Suchmodelmaynotalwaysexistsincetheparametersetisopen.However,forourargument,anymodelwithalmostinfimumdistancetothetrue
Fig. 3.Accumulated rewards in a 900 state CMDP for various model-based RL-methods.
Table 1
Averagerunningtimeper episodein800 episodeswhenactinginanenvironmentwith210 features.(SlowRL-DT,LSE-Rmaxcouldonlyberunwith10 features.)RunonIntelXeonCPU2.13 GHz,32 GBRAM.
Algorithm fRmax fEpsG RL-DT LSE-R. bloreRL loreRL
Time (sec.) 0.26 0.25 9.09 67.53 4.3 0.55
transitiondynamicsatthecell.Theprobabilityofgoingthroughawallisroundedtozeroandthefreedprobabilitymassis evenlydistributedtoothereffects.Theagent’sstartingpositionisrandomlypickedineachepisode.
Transferlearning.The maximumsize M of theview library,initially empty,is setto be20; thresholdc
=
log 300. In a new environment,the TES-agent mainly relieson its transferred knowledge. However, we allow some-greedy explo-ration with
=
0.
05.The parameters for view learning algorithm are thatλ
=
0.
05,α
=
1.
5. We conduct leave-one-out cross-validationexperimentwithfifteendifferenttasks.In eachscenario theagent isfirstallowed toexperience fourteen tasks,over100 episodesineach,anditisthentestedontheremainingtask.Norecencyweightingisusedtocalculatethe goodnessoftheviewsinthelibrary.Wenextdiscussexperimentalresultsaveragedover20 runsshowing95% confidence intervalsforsomerepresentativetasks.5.1.1. Onlineviewlearninginfeature-richenvironments
We show the empiricalevaluationresults ofloreRLin a900 cell/state world. Weaim todemonstrate thatthe “single expectation” model-basedRL, loreRL, cana)learn viewsthat generalize andapproximate thetransitionmodel toachieve fastconvergencetonearoptimalpolicy,andb)withfeatureselection,performwell incomplex,featurerichenvironments. Wealsowanttoseeifthetheoreticalpromisesderivedunderassumptionofi.i.d.samplingcanberealizedinpractice.We compareaccumulatedrewardsofloreRLwithfactoredRmax(fRmax),inwhichthenetworkstructuresoftransitionmodels are known [36], and with factored
-greedy (fEpsG), in which the optimistic Rmax exploration offRmax is replaced by an
-greedy strategy. We also compareour method withRL-DT [37] andLSE-Rmax [38], which are the state ofthe art model-basedRLalgorithmsforlearningtransitionmodels.Forthesetests, werunloreRLwith
α
=
1.
5,λ
=
0.
05,γ
=
0.
95, exploration=
0.
05, parameter m=
10 forfRmax,m=
5 for Rmax (m=
5 is smallfor Rmax, butincreasing it didnot yieldedbetterresult),fixedm=
10,
σ
=
0.
99 forLSE-Rmax(valuesoriginallyusedin[38]).Generalizationandconvergence. We first show that when the feature space is small, loreRL performs as efficiently as thestate of the artmethods. RL-DT employs a decisiontree to generalize transitiondynamicsknowledge over states, butit isimplementedwithan
-greedy explorationstrategy. LSE-Rmaxappears tobe thebeststructure learning method forergodic factoredMDPs[38].fRmax andfEpsGhave correctDBNstructuresprovidedby an oracle.Allthe methodsare implemented with ourcustomized DBNto incorporatedomain knowledge. Rmax isincluded asa reference to show the effectofknowledgegeneralization.
As seeninFig. 3a,loreRLcan approximatetheworlddynamicsusingsamplesinallthestates,thusitconvergesasfast as fEpsG, andRL-DT to nearoptimal policy.AlthoughfRmax is provided withthe correctDBNstructure, its accumulated rewards are lower dueto aggressiveexploration tofind theoptimalmodel.Afterexploration the policy isguaranteed to be nearoptimal, butitmaystill take along time(or forever)tocatch upwithloreRL.While LSE-RmaxfollowstheRmax
scheme, it starts with a simple modeland exploresa bit lessaggressively than fRmax, gaining some advantage in early episodes.However,LSE-Rmaxappearstorequiremuchmoredatatochooseamorecomplexmodel.Itsaccumulatedreward dropsbelowfRmaxafter150 episodes,andtheangleofthecurvesuggeststhatitsDBNstructureisstillnotcorrect.Wedo notrunLSE-Rmaxformoreepisodes,asthealgorithmiscomputationallyverydemanding(Table 1).
When thefeaturesetincludesmanyirrelevantfeatures (Fig. 3b),loreRLisabletolearntherelevantonesandstillgain nearlyashighaccumulatedrewardsasfEpsGwhichhasrelevantfeaturesprovidedbyanoracle.loreRL’srunningtimeisalso notmuchlongerthanfRmax’sorfEpsG’s(Table 1).Othermethodsaretooslowtoruninthishigh-dimensionalenvironment. Theseresults alsosuggestthat with
-greedy explorationandrandomrestarts, nearoptimalpolicy canbe foundeven withouti.i.d.datasampling.
Fig. 4.Performance difference toTESin early trials in a) same dynamics, b) heterogeneous environments. c) Convergence.
Table 2
Cumulativerewardafterfirstepisodes.Forexample,inTask1TEScouldsave(0.616−0.113)/0.01=50.3 actionscomparedtoLWT. Methods Tasks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
loreRL −0.681 −0.826 −0.814 −1.068 −0.575 −0.810 −0.529 −0.398 −0.653 −0.518 −0.528 −0.244 −0.173 −1.176 −0.692 LWT 0.113 −0.966 −0.300 0.024 −1.205 −0.345 −1.104 −1.98 −0.057 −0.664 −0.230 −1.228 0.034 0.244 −0.564
TES 0.616 −0.369 0.230 −0.044 −0.541 −0.784 −0.265 0.255 0.001 −0.298 −1.184 −0.077 0.209 0.389 −0.407
Featureselection.Tounderstandtheroleoffeatureselection,wecompareloreRLwithabloreRLthatisbasedon multi-nomiallogisticregressionwithoutfeatureselection(withouttheregularizationterm).fEpsGandfRmaxarebaselines.
Fig. 3bshowstheaccumulatedrewardswhentheenvironmenthas200 irrelevantbinaryfeatures.Asseen,loreRLisstill abletoquicklyconvergetotheoptimalpolicy,andoutperformsfRmaxandbloreRL.Fig. 3cshowsperformancesofloreRLand
bloreRLafter800 episodesasafunctionofthenumberofirrelevantfeatures.Onlyminimallyaffectedbytheactualnumber ofirrelevantfeatures, loreRLcanquicklyselecttherelevantfeaturesandoutperformbloreRL.loreRLdoesnotlosemuchto
fEpsGeither.WhilefRmaxmayfindanoptimalpolicybeforeloreRLduetoaggressive exploration,its accumulatedrewards are stilllower thanloreRL’s.We alsoobservethat loreRL, throughselectinga smallsetoffeatures, runsmuchfasterthan
bloreRL(Table 1).
5.1.2. Viewselectionandmulti-viewtransferincomplexenvironments
Transferringexpectationsbetweensamedynamicstasks.ToensurethatTES iscapableofbasicmodeltransfer,wefirst evaluateitonasimpletask.WetrainandtestTESontwoenvironmentswhichhavethesamedynamicsand200 irrelevant binaryfeatures thatchallengetheagent’sabilitytolearnacompactmodelfortransfer.Fig. 4ashowshowmuchtheother methodslosetoTESintermsofaccumulatedrewardsinthetesttask.loreRLisanimplementationofTESequippedwiththe view learningalgorithmthat doesnottransfer knowledge.fRmaxisthefactoredRmaxinwhichthenetworkstructuresof transitionmodelsareprovidedbyanoracle[36];itsparametermissettobe10 inalltheexperiments.fEpsGisaheuristic in whichthe optimistic Rmax exploration of fRmax isreplaced by an
-greedy strategy (
=
0.
1). The results show that theseoraclemethodsstillhavetospendtimetolearnthemodelparameters,sotheygainloweraccumulatedrewardsthanTES.Thisalsosuggeststhatthetransferredview ofTES islikelynot onlycompactbutalsoaccurate. Fig. 4afurthershows thatloreRLandfEpsGaremoreeffectivethanfRmaxinearlyepisodes.
Viewselectionvs.randomviews.Fig. 4bshowshowdifferentviewsleadtodifferentpoliciesandaccumulatedrewards overthefirst50 episodesinagiventask.TheRandscurvesshow theaccumulatedrewarddifferenceswithrespecttoTES
when theagent followssome random combinations ofviews fromthe library.Forclarity we show only 5 such random combinations.For all thesecurves,the differencesquickly turnnegative in the beginning indicating lessrewardin early episodes.Weconcludethatourviewselectioncriterionoutperformsrandomselection.
Multipleviewsvs.singleview,andnon-transfer. We comparethe multi-view learning TES agent witha non-transfer agentloreRL,andanLWT [39]agentthattriestolearnonlyonegoodmodelfortransfer.Wealsocomparewiththeoracle methodfEpsG.As seenin Fig. 4b,TES outperformsLWT which,duetodifferencesin thetasks,also performsworse than
loreRL.Whenthe earliertraining tasksare similarto thetest task, theLWT agentperforms well. However, theTES agent alsoquicklypicksthecorrectviews,thusweneverlosemuchbutoftengainalot.WealsonoticethatTESachieveshigher accumulatedrewardsthanloreRLandfEpsGthatareboundtomakeuninformeddecisionsinthebeginning.
Wealsonoticethat duetoits fastcapabilityofcapturingtheworlddynamics,TES runningtime isjustslightlylonger than LWT’sandloreRL’s,whichdonot perform extrawork forview switchingbutneed moretime anddatatolearn the dynamicsmodels.
5.1.3. Jumpstart
Table 2 shows the average cumulative reward after the first episode (the jumpstart effect) foreach test task in the leave-one-outcross-validation.WeobservethatTESusuallyoutperformsboththenon-transferandtheLWTapproach.
Fig. 5.Three different environments. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
5.1.4. Convergence
TostudytheasymptoticperformanceofTES,wecomparewiththeoraclemethodfRmaxwhichisknowntoconvergeto a(near)optimalpolicy.Noticethatinthisfeature-richdomain,fRmaxwithoutthepre-definedDBNstructureisjustsimilar toRmax.Therefore,wealsocomparewithRmax.ForRmax,thenumberofvisitstoanystatebeforeitisconsidered“known” issetto5,andtheexplorationprobability
forknownstatesstartstodecreasefromvalue0
.
1.Fig. 4cshowstheaccumulatedrewardsandtheir statisticaldispersionsoverepisodes.Averageperformanceisreflected bytheanglesofthecurves.Asseen,TEScanachievea(near)optimalpolicyveryfastandsustainitsgoodperformanceover thelongrun.ItisonlygraduallycaughtupbyfRmaxandRmax.ThissuggeststhatTEScansuccessfullylearnagoodlibrary ofviewsinheterogeneousenvironmentsandefficientlyutilizethoseviewsinnoveltasks.
5.2. Realrobotnavigation
Thetheoreticalanalyses presentedabovehaveshowntheadvantagesofloreRLandTESoverthestateoftheart model-based RL algorithms.We havealsodemonstrated theefficiencyofour methodsthrough aset ofempiricalevaluations in simulateddomains.Theseresults,thoughvaluable,areobtainedunderassumptionsthatarefavorable forourapproach.In thissection,weaimtofurtherevaluatetheproposedmethodsinarealroboticdomainwherewecannotexpecttheeffects ofactionstofollowalogisticregressionmodel.
5.2.1. Experimentset-up
Environments. Fig. 5 showsthe three environments used in our casestudies. They are designed so that the robot’s actions wouldhavedifferenteffectsatdifferentlocations,andtheenvironment surfacesare themainfactorsaffectingthe action effects. Thesurfaces are madeofvarious materials such asbeans, soil, hay,leaves, shells, paperboard, andnylon Berbercarpet. Thesematerials havedifferentphysicaleffectsontheobjectsmovingonthem.Theslopesandobstacleson the surfacesalsocontributetothe differenteffectsoftheactions.Insome areas,thesurfacesmaychange becauseofthe robot’sactions.Werestorethesurfacestotheoriginalconditionsaftereveryepisode.
For a robot to efficientlyplan its path in theseenvironments, only a small set of features based on the slopes, the obstacles,andthematerialsofthesurfacesindifferentareasoftheenvironmentsarerelevant.Thesefeatures,however,are veryhardtodefine.Itispreferabletoleavetherobottoautomaticallyselecttherelevantfeaturesfromalargesetofsimple features.Totestourapproach,therefore,wesimplydrawgreenandbluemarksonthesurfaces.Therobot ismarkedwith two redmarks.Thereare alsoablueball,andseveraldeath-marked spotsintheenvironment. Numerousfeaturescanbe derived fromtheseartifacts. Therobotwillneedtoselectafewfeatures thatmayserveasproxiestothetruefactorsthat affectitsactioneffects.
Environment 1andEnvironment 3aredeliberatelydesignedsothattherobotshouldlearn itsviews(transitionmodel) based on the blue marks. The transitiondynamics in these two environments are very similar. However, the two envi-ronments arealso different(interms ofirrelevantfeatures): theblueballs, thedeath places,andthe greenmarksare at differentlocations.Wewillexplainthefeaturesinmoredetailshortly.Environment 2isverydifferentfromEnvironment 1 andEnvironment 3.It isdesignedsothat therobot shouldlearn itsviewsbasedon thegreenmarks insteadoftheblue ones.
We treat the environments asdiscrete MDPs. We discretizethe Environment 2 into a state spaceconsistingof 8
×
8 (x,y)-locations and8 different orientationsof arobot, whichyields a state spaceof512 states. Environment 1and 3 are larger,sowediscretizethemintoastatespaceconsistingof10×
10 (x,y)locationsand8 differentorientationsofarobot. Theenvironmentsareintwodifferentsizes,5×
5 feetand6×
6 feet(Fig. 5).Fig. 6.The robot. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
Fig. 7.The system architecture.
Robot.WeusetheLEGOMindstormsNXTv1.1kittobuildathree-wheelrobotasdepictedinFig. 6.Twofrontwheelsof therobotareattachedtotwoseparatemotors;thebackwheelisfreerolling.Thetrackwidthis11.2 cm.Therobotcarries awhitepanelontopwithabigandasmallredmarksforpositioningandorientationdetection.
Therobot systemcomprisesthreemaincomponents:a centralprocessor,an observatorysystem,andacommand con-troller(see Fig. 7). Thepositioningsystemisasub-componentoftheobservatorysystem.Informationoftheenvironment andtherobot’spositionis capturedbya webcam andsent fromtheobservatorysystem tothecentralprocessor toupdate robot’sknowledge-baseaswell asto planthe next action.Theaction commandis then transmittedvia Bluetoothto the
commandcontrollerembeddedintherobotforexecution.WeimplementthecontrollerinleJOS.3
Actions. The robot is programmed to rotateits left andright wheelsin three different ways, corresponding to three actions.Forthefirst action,therobotrotates bothitsleft andrightwheels246 degrees.Forthesecond action,therobot rotates its left wheel90 and right wheel
−
90 degreesat the same time. Forthe third action,the robot rotates its left wheel−
90 andrightwheel90 degree.Astherobot maybe stillmovingaftereach action,we letthesystemidlefor200 millisecondsafteran action,waiting forthe robottostop completely.Theseactions, underidealsituations,correspondto theactionsofmove-forward,turn-left,andturn-right,respectively.Actioneffects. Due toinaccurate robot motors,sensors, andvarious real world factors such as thesurface materials, slopes ofthesurface, andobstacles,theactions maychangetherobot’srelative location infourdifferentways, including movedforwardonecell,moveddiagonallyforwardtothecellontherobot’sleft,moved diagonallyforwardtothecellon therobot’sright,anddidnotmove.Therobot’sorientationcanalsobechangedinfivedifferentways,including:turnedto thenextorientation onleft,thesecondnextorientation onleft, thenextorientationonright,thesecond nextorientation onright,anddidnotturn.Thatwouldresultinatotalof20 differenteffects.
Sensors.Wemainlyprocessinformationfromtheweb-camintheobservatorysystem.Theweb-camisattachedtothe ceilingabovethearea.Therobot,therefore,canfullyobservetheenvironment.However,therobotcanonlycapturethebig andsmallredmarksonthetopoftherobotitself,andtheinformationofthelocationsofthegreen,blue,redmarks,and
Fig. 8.Accumulated rewards by various methods.
Table 3
Averagerunningtimeperepisodein50 episodes.RunonIntelCentrinoDuoT2400(1.83 GHz), 2×512 MB RAM.
Algorithm fRmax fEpsG man-loreRL loreRL
Time (sec.) 13.44 12.77 9.35 10.81
theballintheenvironment.Asthefeaturesaresimple,weusethebasicalgorithmsinOpenCVlibrary4 todetectthem.The
resultisnearlyperfect.
Stateattributesandfeatures.Asmentioned,anenvironmentisdiscretizedtonrows
×
mcolumns×
o orientations,sothe fullenvironmentstatespacecanbeidentifiedorfactorizedbythosethreestateattributes.However,theseattributesalonedo not containenoughinformationforpredictingactioneffectsortransitiondynamics.Therefore,itiscriticalthateachstate isalsodefinedwithalongvectorofbinarystatefeatures.The“green”binaryindicator fiG(
s)
ofastatesissetto1 iffthere isagreenmarkthatisfurtherthani unitsbutcloserthani+
1 unitsfromthe(x,y)-centerofthestates(i∈ {
0,
. . . ,
99}
). A unit equals thewidthofthe environmentdivided by 100.Similar features aredefinedfor bluemarks andtothe blue targetballyielding300 binaryfeatures.Eightindicatorsfordifferentrobotorientationsarealsoincludedinthefeature-base together withfourintentionallyredundant“thereis/is-not agreen/bluemarkinastate”-bits. Alltogether theseyield 312 binary features per state.The intuitionbehind thesefeatures isthat theyserve asproxiesto surface materials,slopes on thesurfaces,obstacles,etc.whichappeartobeimportantfactorsindeterminingthedynamicsintheenvironments,butthe robot’ssensorscannotcapture.Althoughonlyfewamongthese312 featuresareimportantformodelingrobot’sactions,the robot doesnot knowwhichonesactually matter. Therobothastolearntoselecttherelevantfeaturesbasedonfeedbackwhile interactingwiththeenvironment.Task.Therobotisassumedtoknowtherewardmodelbeforeanystart.Therobot’staskistotravelintheenvironments fromarandomstartingpointtoreachtheblueball,whichwillearnitarewardof2 points.Therobotwillreceive
−
1 point if it fallsout ofthe area or into“death spots” marked withorangerectangles, and−
0.
05 points forreaching anyother states.Anepisodeendsiftherobotreachesaterminalstateorgetsstuck,i.e.,couldnotmove,forfourconsecutiveactions. Inother words,therobot aimsforthehighestcumulative rewardineachepisode.It triestoreachtheblueballasfast aspossible,butitwill avoidvisitingthedeathspotsormoving out ofthemap.In caseit isverycostly orimpossibleto reachtheball,therobotcouldgiveupbyrunningintoadeathspotormovingoutofthemap.5.2.2. Onlineviewlearningformodel-basedRL
The robotbatterydoesnot allowustocompareouralgorithmwiththeslowRL-DT andLSE-Rmaxalgorithms,thuswe will only compare loreRLwiththe fine-tuned algorithms, including fRmax, fEpsG, andman-loreRL, inwhich we manually select important features and specify the DBN-structures forthe transition models. man-loreRL is based on multinomial logisticregressionmodelswiththe12 manuallyselectedfeatures,includingeightindicatorsfordifferentrobotorientations, andfourindicatorstellingifthereis/is-notagreen/bluemarkinastate.WeruntheexperimentswithloreRLandman-loreRL
setting
α
=
0.
5,λ
=
0.
05,γ