Scalable transfer learning in heterogeneous, dynamic environments

(1)

Institutional Knowledge at Singapore Management University

Research Collection School Of Information Systems

School of Information Systems

6-2017

Scalable transfer learning in heterogeneous,

dynamic environments

Trung Thanh Nguyen

Adobe Systems

Tomi Silander

Xerox Research Centre Europe

Zhuoru LI

National University of Singapore

Tze-Yun LEONG

Singapore Management University

, [email protected]

DOI:

https://doi.org/10.1016/j.artint.2015.09.013

Follow this and additional works at:

https://ink.library.smu.edu.sg/sis_research

Part of the

Artificial Intelligence and Robotics Commons

This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore Management University. For more information, please [email protected].

Citation

Nguyen, Trung Thanh; Silander, Tomi; LI, Zhuoru; and Tze-Yun LEONG. Scalable transfer learning in heterogeneous, dynamic

environments. (2017).

Artificial Intelligence

. 247, 70-94. Research Collection School Of Information Systems.

(2)

Contents lists available atScienceDirect

Artiﬁcial

Intelligence

www.elsevier.com/locate/artint

Scalable

transfer

learning

in

heterogeneous,

dynamic

environments

Trung Thanh Nguyen

a

,

∗

,

1

,

Tomi Silander

b

,

1

,

Zhuoru Li

c

,

Tze-Yun Leong

d

,

c a_Adobe,₃₄₅_Park_Avenue,_San_Jose,_CA,_USA

b_Xerox_Research_Centre_Europe,₆_chemin_de_Maupertuis,_{38240 Meylan,}_France c_School_of_Computing,_National_University_of_Singapore,_Singapore

d_School_of_Information_Systems,_Singapore_Management_University,_Singapore

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Receivedinrevisedform17September 2015

Accepted29September2015 Availableonline3October2015

Keywords:

Model-basedreinforcementlearning Transferlearning

Onlinefeatureselection

Reinforcement learning is a plausible theoretical basis for developing self-learning, autonomous agents or robots that can effectively represent the world dynamics and eﬃcientlylearntheproblemfeaturestoperformdifferenttasksindifferentenvironments. Thecomputationalcostsandcomplexitiesinvolved,however,areoftenprohibitivefor real-world applications.This studyintroduces ascalable methodologyto learn and transfer knowledgeofthetransition(andreward)modelsformodel-basedreinforcementlearning in a complex world. We propose a variant formulation of Markov decision processes that supports eﬃcient online-learning of the relevant problem features to approximate the world dynamics. We apply the new feature selection and dynamics approximation techniquesinheterogeneous transferlearning, where the agent automaticallymaintains andadaptsmultiplerepresentationsoftheworldtocopewiththedifferentenvironments itencountersduringitslifetime.Weproveregretboundsforourapproach,andempirically demonstrateits capabilitytoquicklyconvergetoanearoptimal policyinbothrealand simulatedenvironments.

1. Introduction

Nextgenerationroboticsaimsatdevelopingautonomousagentswithintegrativeintelligence;theseagentsorgroupsof agents can automatically orsemi-automatically collect sensor inputs, make decisions,learn new skills,and interactwith humansorotheragentstocompletemultiple,complextasksindifferentreal-worlddomains[1].Emergingglobalresearch and development trends point tocollaborative robotsthat can work andinteractwithpeople [2],dexterous robotswith advancedmanipulationskillsand“softtouches”[3],andcognitiverobotsthatareself-improving,robust,andﬂexible[4].In Asia,forexample,socialandeconomicdemandshaveledtointensiﬁedactivitiestodevelopassistiverobotstocomplement adwindlingworkforceandhealthcarerobotstohelpcaringforanageingpopulation[5].

Amajorchallengeindesigningintelligentrobotsistoequipthemwiththecapabilitiestoeffectivelyusepastexperiences andatthesametimeeﬃcientlylearnnewskillstoperformdifferenttasksindifferentenvironments.Real-worlddynamics isusuallyuncertain,unstructured,andnon-stationary;thetasks,objectives,resources,andconditionsoftherobotsmayalso

*

Correspondingauthor.

E-mailaddress:[email protected](T.T. Nguyen).

1 _The_main_part_of_the_work_was_done_while_the_author_was_at_the_National_University_of_Singapore. http://dx.doi.org/10.1016/j.artint.2015.09.013

Published in Artificial Intelligence, Volume 247, June 2017, Pages 70-94. http://doi.org/10.1016/j.artint.2015.09.013

(3)

changeovertime.Theskillsrequiredforsuccessfulproblemsolvinganddecisionmakingarenoteasilydescribedassimple rules,guidelines,orprocesses.A self-learningagentthatcanselectusefulenvironmentalfeaturesandlearntobehave,toa certaindegree,basedontrialanderrorisapromisingapproachtobuildingrobotsthatactintherealworld.

1.1. Reinforcementlearninginrobotics

Reinforcementlearning (RL) [6]trainsan autonomous agent tobehave intelligently based on thefeedback it receives wheninteractingwiththeenvironment.An RLtaskiscommonlyrepresentedasaMarkovdecisionprocess (MDP)witha specificdomainthatincludestherelevantactionsandstateswiththedefiningfeatures,andunknowndynamics(transition) and/orgoalorfeedback(reward)functions.ThemainchallengesofapplyingRLinroboticdomainsincludethe multidimen-sionalstate andaction spacesthat incurhighcomputational costs, the physical constraintsthat make acting inthe real world muchmoredifficult thaninsimulated worlds,the uncertaintyduetopartial observability ofthephysical environ-ments andinherent noiseinthesensormeasurements, andthedifficulty intailoringthefeedbackorrewardfunctionsto guideintelligentbehavioroftherobots[7].

Despitethechallenges,RLhasrecentlybeenappliedincomplextaskssuchashelicoptermaneuvering[8],soccerrobots

[9],androbot navigation[10] withreasonablesuccess.In thesecases,carefulselectionsofproblemrepresentations,prior experiences,andeﬃcientapproximationshaverenderedRLcomputationallyfeasibleinthecomplexsettings.

In thiswork, we design a representationofthe world dynamics inmodel-based RL that allows eﬃcientandscalable approximation ofthe agent’s actioneffects. At thecore ofour methodis the abilityto automatically selectthe relevant features of the environment that allow the agent to predict how the environment reacts to its actions. This ability to automaticallylearnto focusattention onthecriticalfactorsoftheproblemisone ofthecrucialelementsneededtomake intelligentbehaviorincomplexenvironmentscomputationallyfeasible.

The value ofonlinefeature extractionis furtheraccentuatedin situationswhere theagent encounters manydifferent environments in its lifetime, and wheretransferlearning isessential. While manually designing a smallset ofimportant features isnon-trivialformanyreal-worldtasks,doing sofordifferentenvironments, some ofwhichmaybe unknown a priori, iseven harder.Forexample,the Marsexplorationrover, Opportunity,is nowexpectedto travel tensof kilometers throughdifferentterrains,withpossiblyvaryingcharacteristicsanddynamics,tocollectscientiﬁcobjects[11].Mostofthe environmentalfeaturesthatitwillencounterareunlikelytobespeciﬁedbeforehand.

1.2. Transferlearninginroboticreinforcementlearning

IntheoriginalRLframework,theagent’sknowledgeistaskanddomainspeciﬁc.Asmallchangeinthetaskoritsdomain mayrendertheagent’saccumulatedknowledgeuseless;costlyrelearningfromscratchisoftenneeded.Intransferlearning, anagentapplies theknowledgeorexperiencegainedinprevious(source)taskstoinﬂuenceandimprovetheperformance ofnew,related(target)tasks.TransferinRLassumesthatknowledgefromoneormoresourcetask(s)isusedtolearnone ormore target task(s)faster than iftransfer was not used.In contrastto multi-tasklearning,which assumesthat all the tasksare drawnfromanunderlying,possiblyunknown distribution,transferlearningis moregeneralinallowing transfer amongsttaskswithheterogeneousdomains,dynamics,andgoals[12].Transferlearningishencemostrelevantinbuilding

lifelonglearningagentsorrobots–agentsthat canlearn,retain,andtransferknowledgeacquiredfrommultipletasksover multiple domains, ina sequential manner, to develop more accurate solutions orpolicies in learning newtasks over its lifetime[13].

TransferinRLisdifferentfromanothermaincategoryoftransferlearningtasksthat focusonclassiﬁcation,regression, andclusteringinnon-dynamicdomains[14].Themaintechnicalchallengesinvolved,however,aresimilarinidentifyingthe commonalitiesbetweenthesourcetasksordomains andthetargettasks ordomains.TransferinRL addressestheissues ofwhat totransfer, howtotransfer, andwhentoandwhen notto transferto avoidorminimizenegative effects onthe learningperformanceinthenewenvironment.

Manyexisting RLtransfer techniques,however,assume thesamestate-action spacesi.e., homogeneousdomains in dif-ferenttasks.Thisassumptionmaynotworkwellinreal-lifeapplications.Forexample,manyenvironmentalcuesthathelp an agent navigate througha forest aresimply missingwhenthe agenttriesto navigate atsea.Whilerecent effortshave addressedinter-task transfer inheterogeneous domains ordifferent state-action spaces [15–19],such mappings are hard todeﬁnewhentheagentoperatesinreal-worldenvironmentswithlargestate-actionspacesandmultiplegoalstates,with possiblydifferentstatefeaturedistributionsandworlddynamics.Atrade-offbetweencomputationalcomplexityandsample eﬃciencyisusuallyinvolvedinautomaticallylearningthemappings[20].

Weproposeaneﬃcient,onlinesystemthattriestotransferoldknowledge,butatthesametimeevaluatesnewoptions toseeiftheywork better.The agentgathersexperienceduring its lifetimeandentersa newenvironmentequippedwith

expectationson how differentaspects of theworld affectthe outcomesof theagent’sactions. The main idea isto allow an agent to collect a libraryofworld models orrepresentations, calledviews,that it can consult to focusattentionin a newtask.Inthispaper,weconcentrateonapproximatingthedynamicsortransitionmodel.Thefeedbackorrewardmodel librarycanbelearnedinananalogousfashion.Effectiveutilizationofthelibraryofworldmodelsallowstheagenttocapture thetransitiondynamicsofthenewenvironmentquickly;thisshouldleadtoajumpstartinlearningandfasterconvergence to a nearoptimalpolicy. Amain challenge isin learning to selecta proper view fora newtaskin a newenvironment,

(4)

withoutanypredeﬁnedmappingstrategies.Theabilitytolearnnewviewsisalsocriticalinthecomplex,andfeature-rich realenvironments.

Throughscoringthelearnedviewsandsimultaneousevaluationwithouttheviews,oursystemmitigatespotential nega-tiveeffectsbybalancingtheexploration-and-exploitationtrade-off.Thisisincontrasttominimizingnegativetransfereffects throughdeterminingtheunderlyingrelationshipsbetweentheexperiencesandthenewtasksanddomains [14].Such rela-tionshipsmaynotexistandareusuallyhardtodetermineintheheterogeneous,dynamicenvironmentsthatwefocuson.

1.3. Anewtransferlearningframework

We presenta newtransferlearning framework thatsupports all theessential steps ofafully autonomousRLtransfer agent,ascomparedwithmostexistingworkthatmainlyfocusesononeortwoofthesesteps[12]:

1. Selectanappropriatesourcetaskorsetofsourcetaskstotransfertoatargettask; 2. Learntherelationshipsbetweenthesourceandthetargettasks;and

3. Transfertherelevantknowledgefromthesourcetasktothetargettasktoimprovelearningperformance. Oursystemcomprisesthefollowingcomponents:

•

Situation-calculusMarkovdecisionprocess(CMDP),anewvariantoffactoredMDPthatcompactlyrepresentsthe rele-vantactioneffectsinadynamicenvironment;

•

A collectionofviews, whichsummarizethe experienceslearnedfromconstructing transitionmodels indifferent en-vironments based on multinomial logistics regression; the views are scored based on their relevance in the new environments;

•

Aview ortransitionmodellearningalgorithm,multinomial DualAveragingwithGroupLasso(mDAGL),thatsupports onlinefeatureselectionandrelevance-focusedupdating;

•

A newreinforcement learning algorithm,logistic regressionRL (loreRL), that appliesmDAGL on aCMDPto learn the transitionmodelsbyautomaticallyfocusingontherelevantfeaturesintheenvironment;and

•

Anewtransferlearningalgorithm,TransferExpectationS(TES),thatmayutilizethepreviouslylearnedtransitionmodels toimprovereinforcementlearninginnewenvironments.

The restofthe paperisorganized asfollows.We willnext formalize theproblemanddescribe themethod ofonline feature selection and collecting viewsinto a library. We will then present an eﬃcient implementation of the proposed transfer learning technique.Afterdiscussingrelatedwork,we willdemonstratethe effectivenessofoursystemthrougha set ofsimulated experimentsanda real roboticapplication. We will concludeby discussingthe lessonslearnedand the implicationsforfutureroboticresearch.

2. Background

AtaskoraproblemMforanintelligentagentorrobottosolve,e.g.,navigatetothemasterbedroomorfetchacupfrom thekitchentable,istypicallymodeledasaMarkovdecisionprocess(MDP)deﬁnedbyatupleM

=

(

S

,

A

,

T

,

R

)

,whereS is asetofstates; Aisasetofactions;T

:

S

×

A

×

S

→ [

0

,

1

]

isatransitionfunctionormodel,suchthat T

(

s

,

a

,

s

)

=

P

(

s

|

s

,

a

)

indicatestheprobabilityoftransitingtoastate supontakinganactiona instates; R

:

S

×

A

→

R

isarewardfunctionor modelindicatingimmediateexpectedrewardafteranactionaistakeninstates.InRL,thestate-actionspaceS

×

Adeﬁnes thedomainofthetask,thetransitionmodelT andtherewardmodelR deﬁnetheobjectiveofthetask[21];thetransition modelT and(sometimes)therewardmodelRareunknown.ThemainchallengeoftransferlearninginRLisintransferring acrossheterogeneousdomains,i.e.,wherethestate-actionspaces,andpossiblythefeaturespaces,aredifferent.

For each task M, thetaskor problemsolutionis to ﬁnda policy

π

that speciﬁes an action toperform ineach state so that theexpectedaccumulatedfuturereward(withpossibly higherweights formore recentrewards)foreachstate is maximized [6].The optimalpolicy

π

isusually derived by amodel-based ora model-freeapproach inRL;thelatterdoes not considerthe transitionmodelT inderiving thesolution.The two RLapproachesare basedon differentassumptions, problemcharacteristics,andconstraints.Weadoptthemodel-basedapproachasitiseasiertoincorporatedomainorexpert knowledgeintothetransitionandrewardmodels;knowledgetransferacrossdifferentenvironmentscanalsobefacilitated throughthemodels.

In model-basedRL,theoptimalpolicy isderived byﬁrst estimatingthetransitionmodel T (andthe rewardmodel R, ifnecessary) throughinteractingwiththeenvironment.Thisstudyfocuseson learningthemodels,andremainsrelatively agnostic aboutthe actual planning techniques. However, since the purpose is to build a model about world dynamics, planningsystemsthatrelyonsimulatorsappearmostnatural.

3. Amulti-viewtransferlearningframework

A key idea ofthiswork is that the agent canrepresent theworld dynamics fromits sensorystate space indifferent ways.Suchdifferentviewscorrespondtotheagent’sdecisionstofocusattentionononlysomefeaturesofthestateinorder

(5)

Fig. 1.Our life-long learning agent.

Algorithm1Overviewoftheproposedlearningframework.

Input:ViewlibraryL.

Output:UpdatedviewlibraryL.

Initialize the system for a task

LH_:_historical_record_of_how_good_each_view_was_in_previous_tasks

L←L/*AworkingcopyofLtoﬁndthedynamicsofthecurrentenvironment*/

B←initializebeliefofhowwelleachviewcanapproximatethecurrent environmentdynamicsbasedonLH_.

/* whileinteractingwiththeenvironment */ fort=0,1,2,. . .do

Select views to approximate the world transition dynamics

{W}←selectthemostpromisingviewsfromLbasedonB Interact with environment based on that approximate model π←plananactionpolicybasedon{W}

st:currentstateintheenvironment

at←chooseanactiontoperforminstaccordingtoactionpolicyπ

Performactionatandobservefeedback:actionoutcomesfromtheenvironment

Score all views with the new feedback S←scoreallviewsinLwiththenewfeedback

B←updatebeliefabouttheviewsinLbasedonthescoreS

Adjust all views with new feedback

L←Adaptallviewstowardthiscurrentenvironmentbasedonthenewfeedback Breakwhenthetaskends

end for

Update view library L

L←Updatetheviewlibrary,e.g.

IfaviewWisdifferentfromexistingviewsinL,

whichmeansthetransitiondynamicsofthecorrespondingactionisnew,

thenaddthatviewWtoL;

elsereplacetheoldviewinLwiththenewlyupdatedviewW.

to quicklyapproximatethe statetransition function.Inother words, aview represents an expectationofthe agent about thetransitiondynamicsresultingfromone(orseveral)ofitsactionsinthetaskenvironment.Inanewtask,theagentwill selectappropriateviewstosolvethetask,andtolearnnewviewsiftheenvironmentisnovel.Fig. 1illustratestheworkﬂow ofourlife-longlearningagent.Partsofthisworkhavebeenpublishedin[22–24].

WesketchtheoverallbehaviorofouragentinAlgorithm 1.

Whentheagentdoesnothaveanyinitialinformationaboutthetransitiondynamicsoftheenvironmentinanewtask, itselects“expectations”orviewsbasedonhistorythattellshowwelleachviewinthelibraryhasworkedinprevioustasks. Weassumethatthemorefrequentlyaviewhasworkedinthepast,themorelikelyitwillworkagaininanewtask.

The agent thenoperates accordingtothe policy learnedbasedon thetransitionmodel builtfromtheselected views, assuming that the reward model is known. However, since the views have been selected without any reference to the actualcharacteristicsofthenewenvironment,itishighlylikelythattheseviewsareinappropriateforthecurrenttask.In other words, the“expected” transitionmodelmay notcorrectly approximatethe truetransitiondynamics ofthe current environment.Theresultingpolicymayjustperformpoorly,leadingtonegativerewards.

To limit such negative transfer effects, ouragent exploitsthe outcomes or feedback foreach ofits actions, interms of state changes and rewards, to score all the views in the library. The score estimates the capability of each view in approximating the transition dynamics in the current environment; it is a primary criterion to re-select the views for subsequentdecisions.

Thenewtaskand/orenvironment,however,mayactuallybeverydifferentfromallthepastexperiencesoftheagent.In thiscase,noneoftherecordedviewscancaptureoradapttothenewtransitiondynamics.Hence,environmentfeedbackis alsousedtodevelopandincorporatenewviewsintothelibrary.Thesestepsarerepeateduntilastoppingcriterionismet.

(6)

Fig. 2.Different ways to decompose a state transition function.

At the endof a task, the selected viewswill be recorded asnew additions in the library ifthe transition dynamics captured issigniﬁcantly differentfromthe existingviews. Otherwise,the agent wouldjustupdate theexisting views ac-cordingly.Viewsthathavenotbeenusedfora“longtime”willbeprunedtomanagethelibrarysize.

We address three main technicalchallenges in this framework: First, the transitionmodel T

(

S

,

A

,

S

)

is task speciﬁc, which isprobably a reason why there have not been many studies that transfer the transition model. Second, learning orupdating aview oratransitionmodelonlineina complexandfeature-rich environmentiscomputationallyexpensive. Third,theviewscoringmethodmustbesimpletobecalculatedonline,basedonfeedbackfromtheenvironment.Theview libraryalsoneedstobeeﬃcientlyupdated.

3.1. SituationcalculusMDP:CMDP

In afactoredMDP, eachstate sisrepresented bya vector ofnstate-attributes si

,

i

=

1

,

. . . ,

n.Eachstate attributeis a

randomvariablethatcantakeonmultiplevalues;eachstatesisdeﬁnedbytheCartesianproductofthenstateattributes. For example, the binary state attributes Water (with values present and absent) and Martians (with values present and

absent) willdeﬁne all the states that describe the “interestingness” ofa particularlocation on Mars:{(water, Martians), (water,no-Martians),(no-water,Martians),(no-water,no-Martians)}.

ThetransitionfunctionormodelforthefactoredstatesiscommonlyexpressedindynamicBayesiannetworks(DBNs),a temporalvariantofBayesiannetworks.IntheDBNrepresentationofthetransitionmodel,T

(

s

,

a

,

s

)

=

n_i₌₁P

(

s_i

|

Paa_i

(

s

),

a

)

, wheresiisastate-attribute,Paai indicatesasubsetofstate-attributesinscalledtheparentsofsi(Fig. 2a),i.e.,theattributes

of s from which there are arcs to s_i in the DBN. Learning T requires learning the subsets Paa

i andthe parameters for

conditional distributions,ortheDBNlocalstructures.AcriticalissueinthisMDPformulationistheambiguousdeﬁnitions of the factored attributesor variables. The state-attributes orfeatures serve to both deﬁne the state spaceand capture informationaboutthetransitionmodel,evenifthesetwopurposescanbeverydifferent.Forexample,twostate-attributes, the (x, y)-coordinates uniquely identifya state andcompactly factorizethe state space ina grid-world. A policy can be learnedonthisfactoredspace.Thestate transitiondynamics,however,maydependonotherfeatures ofthestate,suchas the surface material atthelocation (state), thepresence orabsenceof Martians,etc. Suchfeatures are oftenincluded in thestaterepresentations.Whileessentialinformulatingthetransitionorrewardmodels,thesefeaturesmaycomplicatethe planningorlearningprocessesbyincreasingthesizeandcomplexityofthestatespace.

We presenta variant ofthe factored MDP that definesa “compact butcomprehensive” factorizationof thetransition function andsupportsefficientlearningoftherelevantfeatures. Weseparatethestateidentifyingstate-attributesfromthe “merely”informativestate-featuresinourrepresentation(see Fig. 2b).Thisway,wecanapply anefficientfeatureselection methodonalargenumberofstatefeaturestocapturethetransitiondynamics,whilemaintainingacompactstatespace.

SimilartotheapproachproposedbyKonidarisandBarto[25],thestateattributescouldbedefinedintermsofthe agent-spacerepresentationbasedonthecapabilitiesaffordedbythesensorsoractuatorsoftherobot,e.g.,the(x,y)-coordinates oftherobotlocation.The statefeatures,ontheotherhand,couldbedefinedintermsoftheproblem-spacerepresentation based onthe domainorenvironmental characteristicsofthe task, e.g.,the terrain orwetnessofthe floorsurface, which mayormaynotbeaccuratelydetectablebytheagentorrobot.

LearningDBNstructuresofthetransitionfunctiononline,i.e.,whiletheagentisinteractingwiththeenvironment,isstill computationallyprohibitiveinmostdomains.Ontheotherhand,recentstudies[26,27]haveshownencouragingresultsin learning thestructureoflogisticregressionmodels,whichcaneffectivelyapproximatethelocalstructuresinDBNsevenin highdimensionalspaces.Whiletheseregressionmodelscannotfullycapturetheconditionaldistributions,theirexpressive

(7)

powercanbeimprovedbyaugmentinglow dimensionalstate representationwithnon-linearfeaturesofthestate vectors. Wewillintroduceanonlinesparsemultinomiallogisticregressionmethodthatsupportseﬃcientlearningofthestructured representationofthetransitionfunctioninSection4.

Inaddition,wepredicttherelativechangesofstatesinsteadofdirectlyspecifyingthenextstate-attributesinatransition (seeFig. 2c).InRL,anactionwillstochasticallycreateaneffectthatdetermineshowthecurrentstatechangestothenext one [28,10,17].Mediating statechanges viaactioneffectsisa commonstrategyintheclassic situationcalculus [29].Since the numberofrelative changes or actioneffects isusually muchsmaller than thesize of thestate space, or thesize of state-attributedomains,thecorrespondingpredictiontaskshouldbeeasier.Thelearningproblemcanthenbeexpressedas amulti-classclassiﬁcationtaskofpredictingtheactioneffects.

Usually,the transitionfunctionsandrewardfunctionsaredeﬁnedinterms ofthestate features– aspectsofthe state representationthathelpinpredictingtheactioneffects.Inaspeciﬁctaskorenvironment,thesameactionindifferentstates (e.g.,differentlocations)withthesamefeaturestendtoyieldthesameactioneffects(i.e.,relativechangesinthestates).

Formally,asituationcalculusMDP (CMDP)isdeﬁnedbya tupleM

=

(

S

,

f

,

A

,

T

,

E

,

R

,

γ

)

,whereS

,

A

,

T

,

R

,

γ

havethe samemeaningasinaregularMDP.S

=

S1

,

S2

,

..,

Sn

isthestatespaceimplicitlyrepresentedbyvectorsofn state-attributes.

The function f

:

S

→

R

m extractsm state-featuresfrom each state. E is an actioneffect variable such that the transition functioncanbefactoredas

T

(

s

,

a

,

s

)

=

P

(

s

|

s

,

a

)

=

n

i=1 P

(

s_i

|

s

,

f

(

s

),

a

)

=

n

i=1

e∈E P

(

s_i

|

e

,

s

)

P

(

e

|

s

,

f

(

s

),

a

).

(1)

Fig. 2cshowsanexampleofthisdecomposition.Theagentusesthefeaturefunction f toidentifytherelevantfeatures, andthenusesbothstateattributesandfeaturestopredicttheactioneffects.Wealsoassumethattheeffecte andcurrent state s determinethenext state s,thus P

(

s

|

e

,

s

)

is either0or1.Thisdeﬁnesthesemanticmeaningoftheeffectwhich is assumedto be known by the agent. The remaining taskis to learn P

(

e

|

s

,

a

)

=

P

(

e

|

x

(

s

),

a

)

,where x

(

s

)

=

(

s

,

f

(

s

))

isa vector containing boththe state attributesandthestate features.Assuming theeffecte is discrete,learning P

(

e

|

x

(

s

),

a

)

isa classiﬁcationproblem. InSection 4we willshow howtosolve thisproblemby usingmultinomial logisticregression methods.

3.2. Transferringexpectations:TES

In ourframework, the knowledgegathered andtransferred by the agent iscollected into a library

T

ofonline effect predictorsorviews. Aviewconsistsofastructure component f

¯

thatpicks thefeatures whichshouldbe focused on,and aquantitativecomponent

thatdeﬁneshowthesefeaturesshouldbecombinedtoapproximatethedistributionofaction effects. Formally,a view isdeﬁned as

τ

=

(

¯

f

,

)

,such that P

(

E

|

S

,

a

;

τ

)

=

P

(

E

| ¯

f

(

S

),

a

;

)

=

τ

(

S

,

a

,

E

)

,in which

¯

f is an orthogonalprojectionofx

(

s

)

tosomesubspaceof

R

m,wheremisthedimensionofthesubspace.Eachview

τ

isspecialized inpredictingtheeffectsofoneactiona

(

τ

)

∈

A andityieldsaprobabilitydistributionfortheeffectsoftheactionainany state.Thispredictionisbasedonthefeaturesofthestateandtheparameters

(

τ

)

oftheviewthatmaybeadjustedbased ontheactualeffectsobservedinthetaskenvironment.

We denote the subset ofviews that specify the effects for action a by

T

a

⊂

T

.The main challenge is to build and maintain acomprehensive setofviewsthat canbe usedinnewenvironments likely resemblingthe old ones,butatthe sametimeallowadaptationtonewtaskswithcompletelynewtransitiondynamicsandfeaturedistributions.

Atthebeginningofeverynewtask,theexisting libraryiscopiedintoaworkinglibrarywhichisalsoaugmentedwith fresh,uninformedviews,one foreachaction,that arereadytobeadaptedtonewtasks.Wethenselect,foreachaction,a viewwithagood“trackrecord”,i.e.,ithasbeen“used”orappliedmanytimesorwithahighrecency-weightedscore.This viewisusedtoestimatetheoptimalpolicybasedonthetransitionmodelspeciﬁedinEquation(1),andthepolicyisused to picktheﬁrst actiona.The actioneffectis thenused toscore allthe viewsinthe workinglibraryandtoadjusttheir parameters.Ineachroundtheselectionofviewsisrepeatedbasedontheirscores,andthenewoptimalpolicyiscalculated basedonthenewselections.Attheendofthetask,theactuallibraryisupdatedbypossiblyrecruitingtheviewsthathave “performedwell”andretiringthosethathavenot.A morerigorousversionoftheprocedureisdescribedinAlgorithm 2.

3.2.1. Scoringtheviews

Toassess the quality ofa view

τ

,we measure its predictive performance by a cumulative log-score. Thisis a proper

score[30]thatcanbeeffectivelycalculatedonline.

Givenasequence Da

=

(

d1

,

d2

,

. . . ,

dN

)

ofobservationsdi

=

(

si

,

a

,

ei

)

inwhichactiona hasresultedineffectei instate si,thescoreforana-specialized

τ

is

S

(τ

,

Da

)

=

N

i=1

(8)

Algorithm2TES:TransferringExpectationsusingalibraryofviews.

Input:T= {τ1,τ2,. . .}:viewlibrary;CMDPj:anewjthtask;:viewgoodnessevaluator

LetT0beasetoffreshviews–oneforeachaction Ttmp←T∪T0 /*theworkinglibraryforthetask*/

foralla∈Ado Tˆ[a]←arg maxτ∈Ta(τ,j) endfor /*selectingviews*/

fort=0,1,2,. . .do

at← ˆπ(st),whereπˆ isobtainedbysolvingMDPusingtransitionmodelTˆ

Performactionatandobserveeffectet

forallτ∈Tat

tmp∪Tat do Score[τ]←Score[τ]+logτ(st,at,et) endfor

forallτ∈Tat

tmp do Updateviewτbasedon(f(st),at,et) endfor

ˆ

T[at]←arg maxτ∈Tat

tmpScore[τ] /*

selectingviews*/

end for

foralla∈Ado τ∗←arg maxτ∈Ta tmpScore[τ];

Ta_←_growLibrary_(Ta_,_τ∗_,_Score_,_j₎ _/*updatinglibrary*/

endfor

if|T|>Mthen T←T− {arg minτ∈T(τ,j)} endif /*pruninglibrary*/

Algorithm3Growsub-library

T

a_. Input:Ta_,_τ∗_,_Score_,_j_:_task_index;_c_:_constant;_H

τ∗= {}:emptyhistoryrecord

Output:updatedlibrarysubsetTa_and_winning_histories_H τ∗

caseτ∗∈Ta

0 do Ta←Ta∪ {τ∗} /*addnewbietolibrary*/ otherwise do Letτ¯∈T betheoriginal,notadaptedversionofτ∗

caseScore[τ∗]−Score[ ¯τ]>c do Ta_←_Ta_{∪ {}_τ∗_} otherwise do Ta_←_Ta_{∪ {}_τ∗_}_{− { ¯}_τ_}

Hτ∗←Hτ¯ /*inherithistory*/

Hτ∗←Hτ∗∪ {j}

where

τ

(

si

,

a

,

ei

;

θ

i

(

τ

))

istheprobabilityofeffectei givenbythevieworeffectpredictor

τ

basedonthefeaturesofstate siandtheparameters

θ

i

(

τ

)

thatmayhavebeenadjustedusingpreviousdata

(

d1

,

d2

,

. . . ,

di−1

)

.

3.2.2. Growingthelibrary

Aftercompleting atask, thehighestscoring viewsforeach actionare consideredforrecruiting intotheactual library. Thewinning“newbies”ornewentriesareautomaticallyaccepted.Inthiscase,thedatahasmostprobablycomefromthe distributionthatisfarfromtheanycurrentmodels,otherwiseoneofthecurrentmodelswouldhavehadanadvantageto adaptandwin.Inother words,thismeansthat theexistingviewshavegeneratedlow ornegative rewardsintheprocess ofderiving theoptimalpolicyinRL;they havenot beenusedfurtherandanewsetofworlddynamicshasbeenlearned. Instead oftrying to directlyidentify theunderlying commonalitiesbetweenthe old andnewtasksand/or environments, ourframeworkmitigatespotentialnegativetransfereffectsbylearningthenewviewsortransitiondynamicsinparallelto exploitingandexamining iftheexisting viewscan improvelearning.This abilityisimportantin realdomains wherethe underlyingsystemdynamicsmaynotbeeasilydetermined.

Thewinners

τ

∗thatareadjustedversionsofoldviews

τ

¯

areacceptedasnewmembersiftheyscoresigniﬁcantlyhigher thantheiroriginalversions,basedonthelogarithmoftheprequentiallikelihoodratio[31]

(

τ

∗

,

τ

¯

)

=

S

(

τ

∗

,

Da

)

−

S

(

τ

¯

,

Da

)

. Otherwise, theoriginalversions

τ

¯

gettheir parameters updatedto thenewvalues.Thisprocedureisjusta heuristicand otherinclusionandupdatingcriteriamaywellbeconsidered.ThepolicyisdetailedinAlgorithm 3.

3.2.3. Pruningthelibrary

Tokeepthelibraryrelativelycompact,aplausiblepolicyistoremoveviewsthathavenotperformedwellforalongtime, possiblybecausetherearebetterpredictorsortheyhavebecomeobsoleteinthenewtasksorenvironments.Toimplement sucharetiringscheme,eachview

τ

maintainsalistHτ oftaskindicesthatindicatesthetasksforwhichtheviewhasbeen

the bestscoringpredictorforits specialtyactiona

(

τ

)

.We canthen calculatetherecencyweighted trackrecordforeach view.Inpractice,wehaveadoptedtheprocedurebyZhuetal.[32]thatintroducestherecencyweightedscoreattime T as

(τ

,

T

)

=

t∈Hτ

e−μ(T−t)

,

where

μ

controlsthespeedofdecayofpastsuccess.Otherdecayfunctionscould naturallyalsobeused.Thepruningcan thenbedonebyintroducingathresholdforrecencyweightedscoreoralwaysmaintainingthetopmviews.

4. Aviewlearningalgorithm

In TES, a view can be implemented by any probabilistic classification model that can be quickly learnedonline. This requirementexcludesmostofthebatchorofflineclassificationmethods.Inthisstudy,weintroduceascalableonlinesparse multinomial logistic regression algorithm to incrementally learn a view. The proposed algorithm optimizes an objective

(9)

Algorithm4ThemDAGLalgorithm.

Input:λ,α

Leth(W)beastronglyconvexfunctionwithmodulus1 LetW1₌_W0₌_{arg min}

Wh(W) LetG¯0_{= ¯}₀ fort=1,2,3,. . .do (yt_,_xt₎_←_observe_data (Wt+1_,_G¯t₎_←_mDAGL-update₍_t_,_yt_,_xt_,_Wt_,_G¯t−1_,_λ,_α₎ end for

functionsimilartothegrouplasso[33]whichhasbeenrecentlysuggestedforeﬃcientfeatureselectionamongaverylarge setoffeatures[27].

Multinomial logistic regression is a simple yet effective classiﬁcation method. Assuming K classes of d-dimensional vectors x

∈

R

d, we represent each class k witha d-dimensional prototype vector Wk. Classiﬁcationof an input vector x

isbasedon how“similar” itisto theprototypevectors. Similarityismeasuredwithinner product

Wk

,

x

=

d

i=1Wkixi,

wherexi denotesfeaturei.Theprobabilityofaclassisdeﬁnedby P

(

y

=

k

|

x

;

Wk

)

∝

eWk,x.Theparametervectorsofthe

modelformtherowsofamatrixW

=

(

W1

,

. . . ,

WK

)

T.

Letlt

(

Wt

)

= −

logP

(

yt

|

xt

;

Wt

)

denotetheitem-wiselog-lossofamodelwithcoeﬃcient matrix Wt predicting a data

point

(

yt

,

xt

)

observedattimet.Atypicalobjectiveofan onlinelearningsystemistominimizethetotallossbyupdating its Wt _over_time._However,_the_resulting_model_will_often_be_very_complicated_and_{over-ﬁtting.}_To_achieve_a_parsimonious

model,weexpressouraprioribeliefthatmostfeaturesareirrelevant orsuperﬂuousbyintroducingaregularization term

(

W

)

=

λ

di=1

√

K

W_·i

2,where

W·i

2 denotesthe2-normoftheithcolumnof W,and

λ

isapositiveconstant. This

regularization issimilar tothat of grouplasso. It communicatestheidea that itislikely that awhole columnof W has zero values (especially, for large

λ

). A column of all zeros suggeststhat the corresponding feature is not necessary for classiﬁcation.

Theobjectivefunctioncannowbewrittenas

L

(

T

)

=

T

t=1 lt

(

Wt

)

+

(

Wt

)

=

T

t=1

−

log e Wt yt,x t

keW t k,xt

+

λ

d

i

√

K

W_·t_i

2

,

whereWt _is_the_coeﬃcient_matrix_learned_using_t

−

_{1 previously}_observed_data_items._The_quality_of_a_sequence_of

param-etermatricesWt

,

t

∈

(

1

,

. . . ,

T

)

withrespecttoaﬁxedparametermatrixW canbemeasuredbytheamountofextraloss, orregret RT

(

W

)

=

L

(

T

)

−

LW

(

T

)

=

T

t=1

(

lt

(

Wt

)

+

(

Wt

))

−

T

t=1

(

lt

(

W

)

+

(

W

)).

Wewanttolearnaseriesofparameters Wt toachieve smallregretwithrespecttoagoodmodelW that hasa small lossLW

(

T

)

.

4.1. Onlinelearningofmultinomialregularizedlogisticregressionwithgrouplasso:mDAGL

Recently, Xiao etal. [26] introduced a dual averaging method forsolving lasso online, andYang etal. [27] extended the work forsolving group lasso. The methods are simple, eﬃcient,and scalablefor learning the regularized regression models. Following the same approach, we introduce mDAGL (Algorithm 4), a new algorithm to incrementally learn W

to optimize the specified objective function in the sparse multinomial logistic regression. Typically, group lasso is used to regularizegroupsofcoefficientswhere eachcoefficient corresponds toa particularfeature.In ourcaseofmultinomial logistic regression, a group comprises the coefficients of a feature; in other words, a group is a column in coefficient matrixW.

In thestandard online stochastic gradientdescent method, afterobserving the datavector

(

yt

,

xt

)

, we adjustthe pa-rametersofthemodeltowardthedirectionsthatmaximizethelikelihoodoftheobservation(orequivalently,minimizethe item-wiselog-loss lt). The dualaveraging method isa version of the stochastic gradientdescent, thus we again haveto

computethegradientGt ofthelog-lossfunctionforeachobservation

(

yt

,

xt

)

.Butinsteadofmovingtheparametersbased onthesedirectionsGt,we useGt toupdate theaverageofgradients,G

¯

t_,_for_observations_encountered _this_far, _and_move

the parameters away from thoseaverage directions (away, since we are minimizing).We will next describe themethod moreformally.

(10)

Algorithm5mDAGL-update.

Input:t,yt_,_xt_,_Wt_,_G¯t−1_,_λ,_α

Gt_←_use_equation₍₂₎_with₍_yt_,_xt_),_Wt

¯

Gt_←t−1 t G¯t−1+

1 tGt

Wt+1_←_use_equation_(A.1)_with_G_¯t

,βt=α√t,λ

return(Wt+1 ,G¯t

)

Algorithm6TheloreRLalgorithm.

Input:mDAGLregularizationparametersλ,α,CMDPvariablesS,f,A,E,R,γ,exploration. LetW=(W1,W2,. . . ,W|A|)=(W0_,_W0_,_{. . . ,}_W0₎

LetG¯=(G¯1,G¯2,. . . ,G|¯A|)=(0,0,. . . ,0)

s0←randominitialstate fort=1,2,3,. . .do

π←SolveMDPusingtransitionmodelT(W¯)

a←π(st_,₎ _#_-greedy_action_selection

Takeactionayieldingeffecte,nextstatest+1 (Wa,G¯a)←mDAGL(t,e,x(st_),_W_a,_G¯_a,_λ,_α₎

end for

Weinitializetheparameters W toa K

×

dmatrixofallzeros.LetGt

kibe thepartialderivativesoffunctionlt

(

W

)

with

respect toWki atWt (Gt_ki

=

_∂_W∂lt_ki

(

Wt

)

). Wedeﬁne G

¯

t tobea matrixofaveragepartial derivatives,i.e., G

¯

t_ki

=

1_t

tτ=1Gτki,

wheretakingthegradientofthelossfunctionlτ withrespecttotheparameterWkigivesustheformula

Gτ_ki

= −

xτ_i

(

I

(

yτ

=

k

)

−

P

(

k

|

xτ

;

Wτ

)).

(2)

We notice that this partial derivative points either away or towards theobserved feature xτ_i depending onwhether the observationbelongstotheclasskornot.

Foranydataobservedattimet,weusetheaveragegradientG

¯

t _to_update_the_coeﬃcient_matrix_W _via Wt+1

=

arg min W

¯

Gt

,

W

+

(

W

)

+

β

t t h

(

W

)

,

(3)

where

β

t is a non-negative, non-decreasing sequence of real numbers,and

·

,

·

denotes an inner product betweentwo

matrices;

¯

Gt

,

W

=

_k_,_iG

¯

t_kiWki.Theﬁrsttermisminimizedby W thatpoints tothedirection

− ¯

Gt.Whiletheﬁrstterm

preferslongvectors, theregularizationterm

(

W

)

balances thisout. Thethirdtermintroducesan extraregularizationin terms ofastronglyconvexfunctionh

(

W

)

thatisneededforconvergenceandsparsity.In practiceweusetheFroebenius normh

(

W

)

=

_k_,_iW_ki2 and

β

t

=

√

t.

Solving theminimization problemaboveleads toanupdaterule(Algorithm 5)inwhicheachcolumnoftheWt+1 _is_a

scaled versionofthecorresponding columnofthe G

¯

t_._Furthermore,_if_the_length _of_the_average_gradient_matrix_column_is

smallenough,thecorrespondingparametercolumnshouldbetruncatedtozero.Thisamountstofeatureselection.Thefull deﬁnitionandproofoftherulearedetailedinAppendix A.

Aregretanalysisconﬁrmsthatthesolutionwillconvergeandthattheaveragemaximalregretasymptoticallyapproaches zerowithrateO

(

√1

t

)

.ThefullanalysisisdetailedinAppendix B. 4.2. Model-basedRLwithmultinomiallogisticregression:loreRL

Our maintaskis toturn transitionmodellearning intothe learning ofconditional distributions P

(

E

|

s

,

f

(

s

),

a

)

using multinomiallogisticregressionforwhichattentiontorelevantfeaturescanbeeﬃcientlyimplementedonlineviamDAGL.

Thekeystepsofourmethod,calledloreRL(RLwithregularizedlogisticregression),arepresentedinAlgorithm 6.Inputs toloreRLaretheCMDPcomponents(exceptthetransitionfunction),regularizationparameters

λ

and

α

ofmDAGLalgorithm, andthe

thatdetermines theprobability oftakinga randomaction.Weﬁrstinitializelogistic regressionparameters Wa

andtheaveragegradientmatricesG

¯

a foreachactiona

∈

A.Wealsorandomlyselectastartingstates0.

Ateachtimestep,arandomactionaischosenwithasmallprobability

,butotherwisewecalculatetheoptimalpolicy

π

foranMDPwiththetransitionmodelT

(

W

)

isbasedonthecurrenteffectpredictors.Whilewehaveusedvalueiteration (likeinRmax)forﬁndingtheoptimalpolicy,anyothermodel-basedRLtechniquecanbeusedaswell.Wedonotfocuson theplanningpartofRLhere,butsearchheuristicssuchasthoseusedinDyna-Q[34] orPrioritizedSweeping[35]can be deployed foramorescalablealgorithm.Afterperformingan actionainstate st andobservingits effecte,theexperience

(

e

,

st

,

_f

(

_st

))

_will_be_presented_to_the_mDAGL_algorithm_that_updates_the_parameter_matrix_W

aandthegradientmatrixG

¯

a.

Aswejustdo

-greedyrandomsampling,itisimpossibletoguaranteePACconvergencetoanoptimalpolicy.Assuming that observed datais i.i.d.,we can provethat difference in optimalvalue functionsoftwo CMDPs withdifferentlogistic regressionbasedtransitionfunctionsisboundedbythedifferenceintheirparameters.Thisleadstoacorollaryfor conver-gencetonearoptimalpolicy.

(11)

Theorem1(Differencein valuefunction). LetM1

=

(

S

,

f

,

A

,

T

(

WM1

),

E

,

R

,

γ

)

and M2

=

(

S

,

f

,

A

,

T

(

WM2

),

E

,

R

,

γ

)

betwo

CMDPswithoptimalpolicies

π

1and

π

2respectively.LetusdenotebyVπMthevaluefunctionforpolicy

π

inCMDPM.Let

1

=

2

max a∈A,e∈E

W (a),M1 e

−

We(a),M2

1sup s

x

(

s

)

1

,

then max s∈S V M2 π1

(

s

)

−

V M2 π2

(

s

)

≤

2γVmax

1 1

−

γ

,

where W(a),M1

e andWe(a),M2 refertothevectorofcoeﬃcientscorrespondingtoclassE

=

e underactiona inmodelM1 andM2

respectively,

·

1isthe1-normofvector,andVmaxisthemaximumvalueofanystateforanypolicyineitheroftheCMDPs.

Bytaking M2 to be a CMDP basedon the optimal W∗ and M1 an estimatedCMDP basedon mDAGL,we can derive

a vanishingboundforvalue difference ofpolicies.Incasethetruetransitionmodelis representableby a sparse W∗,we wouldmostprobablyconvergetoanearoptimalpolicy.ThefullproofforthetheoremisinAppendix C.

Whenwecannotexpressthetruetransitiondynamicsaslogisticregressionbasedonavailablestatefeatures,itishardto giveguaranteesofperformance.However,wecanstillhavesomeconﬁdenceindoingwell.ThelogisticregressionmodelP_l∗

closest(inKullback–Leiblerdistance)tothetruemodelPtrue (possiblynotalogisticregressionmodel)istheone2 thathas

thesmallestexpectedlog-loss.Whileouroptimalitycriterionistheexpectedregularizedlog-loss,weexpecttheregularized log-lossoptimalmodel P∗ tobe closeto P_l∗ thusalmostascloseto Ptrue aswecanget.ThisrelativelysmallKL-distance

can beconvertedto relativelysmalldistances inactual transitionprobabilities,whichcan thenfurtherbe convertedto a relativelysmallboundonvalue differencesbythesameargumentsusedinprovingTheorem 1(inAppendix C).Therefore, sinceourmodelwouldverylikelyconvergecloseto P∗,wecanexpecttodoalmostaswellasP∗.

5. Experiments

We examinethe performance ofour expectationtransferalgorithm TES that transfers viewsto speed-upthe learning processacrossdifferentcomplex,feature-rich,heterogeneous,anddynamicenvironments.WeshowthatTEScaneﬃciently:

1. learntheappropriateviewsonline;

2. selectviewsusingtheproposedscoringmetric; 3. achieveagoodjumpstart;and

4. performwellinthelongrun.

We ﬁrst evaluate our approach in a simulated navigation domain where the assumptions hold. We then conduct a case-studyinarealroboticdomaintoseeifthetheoreticalresultsareusefulinpractice.

5.1. Simulatedrobotnavigation

Environment.Weconsideragrid-basedrobotnavigationprobleminwhicheachgrid-cellhasthesurfaceofeithersand, soil, water,brick,or ﬁre. Inaddition, theremay be wallsbetweencells. The surfacesandwalls determinethe stochastic dynamicsof theworld. However, theagent also observesnumerous other features intheenvironment. The agent hasto learntofocusontheimportantfeaturestolearntheenvironmentdynamicsmodel,andconsequentlytoachieveitsgoal.

Action,states,andrewards. Theagent canperformfouractions(moveup, down,left,right),whichwilllead ittoone ofthefourstatesarounditorleaveitinitscurrentstate.Effects oftheactions arecapturedinﬁveoutcomes(movedup, left,down,right,didnotmove).Thestatesaredeﬁnedbythe(x,y)-coordinatesoftheagent.Theagentspends0

.

01 unitsof energytoperformanaction.Itloses1 unitiffallingintoastateofﬁre,butgains1 unitwhensuccessfullyreachinganexit door.

Goal.The goalisto reachanyexit doorintheworld consumingaslittleenergyaspossible.A taskendswhen agent reachesaterminalstate,i.e.,anyexitdoororstatewithﬁre.

Tasks. We designﬁfteentasks withgrid sizes rangingfrom20

×

20 to30

×

30.Eachtaskhas adifferent state space anddifferent terminal states.Each state (cell) ischaracterized by its surface materials andthe wallsaround it, and200 additionalirrelevant,randombinaryfeatures.Thetasksmayinvolvedifferentdynamicsaswellasdifferentdistributionsof the surface materials.In ourexperiments,the environment transitiondynamicsisgenerated usingthree differentsetsof multinomial logistic regressionmodels; eachcombinationofcell surfacesandwallsaround thecell willlead todifferent

2 _Such_model_may_not_always_exist_since_the_parameter_set_is_open._However,_for_our_argument,_any_model_with_almost_inﬁmum_distance_to_the_true

(12)

Fig. 3.Accumulated rewards in a 900 state CMDP for various model-based RL-methods.

Table 1

Averagerunningtimeper episodein800 episodeswhenactinginanenvironmentwith210 features.(SlowRL-DT,LSE-Rmaxcouldonlyberunwith10 features.)RunonIntelXeonCPU2.13 GHz,32 GBRAM.

Algorithm fRmax fEpsG RL-DT LSE-R. bloreRL loreRL

Time (sec.) 0.26 0.25 9.09 67.53 4.3 0.55

transitiondynamicsatthecell.Theprobabilityofgoingthroughawallisroundedtozeroandthefreedprobabilitymassis evenlydistributedtoothereffects.Theagent’sstartingpositionisrandomlypickedineachepisode.

Transferlearning.The maximumsize M of theview library,initially empty,is setto be20; thresholdc

=

log 300. In a new environment,the TES-agent mainly relieson its transferred knowledge. However, we allow some

-greedy explo-ration with

=

0

.

05.The parameters for view learning algorithm are that

λ

=

0

.

05,

α

=

1

.

5. We conduct leave-one-out cross-validationexperimentwithfifteendifferenttasks.In eachscenario theagent isfirstallowed toexperience fourteen tasks,over100 episodesineach,anditisthentestedontheremainingtask.Norecencyweightingisusedtocalculatethe goodnessoftheviewsinthelibrary.Wenextdiscussexperimentalresultsaveragedover20 runsshowing95% confidence intervalsforsomerepresentativetasks.

5.1.1. Onlineviewlearninginfeature-richenvironments

We show the empiricalevaluationresults ofloreRLin a900 cell/state world. Weaim todemonstrate thatthe “single expectation” model-basedRL, loreRL, cana)learn viewsthat generalize andapproximate thetransitionmodel toachieve fastconvergencetonearoptimalpolicy,andb)withfeatureselection,performwell incomplex,featurerichenvironments. Wealsowanttoseeifthetheoreticalpromisesderivedunderassumptionofi.i.d.samplingcanberealizedinpractice.We compareaccumulatedrewardsofloreRLwithfactoredRmax(fRmax),inwhichthenetworkstructuresoftransitionmodels are known [36], and with factored

-greedy (fEpsG), in which the optimistic Rmax exploration offRmax is replaced by an

-greedy strategy. We also compareour method withRL-DT [37] andLSE-Rmax [38], which are the state ofthe art model-basedRLalgorithmsforlearningtransitionmodels.Forthesetests, werunloreRLwith

α

=

1

.

5,

λ

=

0

.

05,

γ

=

0

.

95, exploration

=

0

.

05, parameter m

=

10 forfRmax,m

=

5 for Rmax (m

=

5 is smallfor Rmax, butincreasing it didnot yieldedbetterresult),ﬁxedm

=

10

,

σ

=

0

.

99 forLSE-Rmax(valuesoriginallyusedin[38]).

Generalizationandconvergence. We ﬁrst show that when the feature space is small, loreRL performs as eﬃciently as thestate of the artmethods. RL-DT employs a decisiontree to generalize transitiondynamicsknowledge over states, butit isimplementedwithan

-greedy explorationstrategy. LSE-Rmaxappears tobe thebeststructure learning method forergodic factoredMDPs[38].fRmax andfEpsGhave correctDBNstructuresprovidedby an oracle.Allthe methodsare implemented with ourcustomized DBNto incorporatedomain knowledge. Rmax isincluded asa reference to show the effectofknowledgegeneralization.

As seeninFig. 3a,loreRLcan approximatetheworlddynamicsusingsamplesinallthestates,thusitconvergesasfast as fEpsG, andRL-DT to nearoptimal policy.AlthoughfRmax is provided withthe correctDBNstructure, its accumulated rewards are lower dueto aggressiveexploration toﬁnd theoptimalmodel.Afterexploration the policy isguaranteed to be nearoptimal, butitmaystill take along time(or forever)tocatch upwithloreRL.While LSE-RmaxfollowstheRmax

scheme, it starts with a simple modeland exploresa bit lessaggressively than fRmax, gaining some advantage in early episodes.However,LSE-Rmaxappearstorequiremuchmoredatatochooseamorecomplexmodel.Itsaccumulatedreward dropsbelowfRmaxafter150 episodes,andtheangleofthecurvesuggeststhatitsDBNstructureisstillnotcorrect.Wedo notrunLSE-Rmaxformoreepisodes,asthealgorithmiscomputationallyverydemanding(Table 1).

When thefeaturesetincludesmanyirrelevantfeatures (Fig. 3b),loreRLisabletolearntherelevantonesandstillgain nearlyashighaccumulatedrewardsasfEpsGwhichhasrelevantfeaturesprovidedbyanoracle.loreRL’srunningtimeisalso notmuchlongerthanfRmax’sorfEpsG’s(Table 1).Othermethodsaretooslowtoruninthishigh-dimensionalenvironment. Theseresults alsosuggestthat with

-greedy explorationandrandomrestarts, nearoptimalpolicy canbe foundeven withouti.i.d.datasampling.

(13)

Fig. 4.Performance difference toTESin early trials in a) same dynamics, b) heterogeneous environments. c) Convergence.

Table 2

Cumulativerewardafterﬁrstepisodes.Forexample,inTask1TEScouldsave(0.616−0.113)/0.01=50.3 actionscomparedtoLWT. Methods Tasks

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

loreRL −0.681 −0.826 −0.814 −1.068 −0.575 −0.810 −0.529 −0.398 −0.653 −0.518 −0.528 −0.244 −0.173 −1.176 −0.692 LWT 0.113 −0.966 −0.300 0.024 −1.205 −0.345 −1.104 −1.98 −0.057 −0.664 −0.230 −1.228 0.034 0.244 −0.564

TES 0.616 −0.369 0.230 −0.044 −0.541 −0.784 −0.265 0.255 0.001 −0.298 −1.184 −0.077 0.209 0.389 −0.407

Featureselection.Tounderstandtheroleoffeatureselection,wecompareloreRLwithabloreRLthatisbasedon multi-nomiallogisticregressionwithoutfeatureselection(withouttheregularizationterm).fEpsGandfRmaxarebaselines.

Fig. 3bshowstheaccumulatedrewardswhentheenvironmenthas200 irrelevantbinaryfeatures.Asseen,loreRLisstill abletoquicklyconvergetotheoptimalpolicy,andoutperformsfRmaxandbloreRL.Fig. 3cshowsperformancesofloreRLand

bloreRLafter800 episodesasafunctionofthenumberofirrelevantfeatures.Onlyminimallyaffectedbytheactualnumber ofirrelevantfeatures, loreRLcanquicklyselecttherelevantfeaturesandoutperformbloreRL.loreRLdoesnotlosemuchto

fEpsGeither.WhilefRmaxmayﬁndanoptimalpolicybeforeloreRLduetoaggressive exploration,its accumulatedrewards are stilllower thanloreRL’s.We alsoobservethat loreRL, throughselectinga smallsetoffeatures, runsmuchfasterthan

bloreRL(Table 1).

5.1.2. Viewselectionandmulti-viewtransferincomplexenvironments

Transferringexpectationsbetweensamedynamicstasks.ToensurethatTES iscapableofbasicmodeltransfer,weﬁrst evaluateitonasimpletask.WetrainandtestTESontwoenvironmentswhichhavethesamedynamicsand200 irrelevant binaryfeatures thatchallengetheagent’sabilitytolearnacompactmodelfortransfer.Fig. 4ashowshowmuchtheother methodslosetoTESintermsofaccumulatedrewardsinthetesttask.loreRLisanimplementationofTESequippedwiththe view learningalgorithmthat doesnottransfer knowledge.fRmaxisthefactoredRmaxinwhichthenetworkstructuresof transitionmodelsareprovidedbyanoracle[36];itsparametermissettobe10 inalltheexperiments.fEpsGisaheuristic in whichthe optimistic Rmax exploration of fRmax isreplaced by an

-greedy strategy (

=

0

.

1). The results show that theseoraclemethodsstillhavetospendtimetolearnthemodelparameters,sotheygainloweraccumulatedrewardsthan

TES.Thisalsosuggeststhatthetransferredview ofTES islikelynot onlycompactbutalsoaccurate. Fig. 4afurthershows thatloreRLandfEpsGaremoreeffectivethanfRmaxinearlyepisodes.

Viewselectionvs.randomviews.Fig. 4bshowshowdifferentviewsleadtodifferentpoliciesandaccumulatedrewards overtheﬁrst50 episodesinagiventask.TheRandscurvesshow theaccumulatedrewarddifferenceswithrespecttoTES

when theagent followssome random combinations ofviews fromthe library.Forclarity we show only 5 such random combinations.For all thesecurves,the differencesquickly turnnegative in the beginning indicating lessrewardin early episodes.Weconcludethatourviewselectioncriterionoutperformsrandomselection.

Multipleviewsvs.singleview,andnon-transfer. We comparethe multi-view learning TES agent witha non-transfer agentloreRL,andanLWT [39]agentthattriestolearnonlyonegoodmodelfortransfer.Wealsocomparewiththeoracle methodfEpsG.As seenin Fig. 4b,TES outperformsLWT which,duetodifferencesin thetasks,also performsworse than

loreRL.Whenthe earliertraining tasksare similarto thetest task, theLWT agentperforms well. However, theTES agent alsoquicklypicksthecorrectviews,thusweneverlosemuchbutoftengainalot.WealsonoticethatTESachieveshigher accumulatedrewardsthanloreRLandfEpsGthatareboundtomakeuninformeddecisionsinthebeginning.

Wealsonoticethat duetoits fastcapabilityofcapturingtheworlddynamics,TES runningtime isjustslightlylonger than LWT’sandloreRL’s,whichdonot perform extrawork forview switchingbutneed moretime anddatatolearn the dynamicsmodels.

5.1.3. Jumpstart

Table 2 shows the average cumulative reward after the ﬁrst episode (the jumpstart effect) foreach test task in the leave-one-outcross-validation.WeobservethatTESusuallyoutperformsboththenon-transferandtheLWTapproach.

(14)

Fig. 5.Three different environments. (For interpretation of the references to color in this ﬁgure, the reader is referred to the web version of this article.)

5.1.4. Convergence

TostudytheasymptoticperformanceofTES,wecomparewiththeoraclemethodfRmaxwhichisknowntoconvergeto a(near)optimalpolicy.Noticethatinthisfeature-richdomain,fRmaxwithoutthepre-deﬁnedDBNstructureisjustsimilar toRmax.Therefore,wealsocomparewithRmax.ForRmax,thenumberofvisitstoanystatebeforeitisconsidered“known” issetto5,andtheexplorationprobability

forknownstatesstartstodecreasefromvalue0

.

1.

Fig. 4cshowstheaccumulatedrewardsandtheir statisticaldispersionsoverepisodes.Averageperformanceisreﬂected bytheanglesofthecurves.Asseen,TEScanachievea(near)optimalpolicyveryfastandsustainitsgoodperformanceover thelongrun.ItisonlygraduallycaughtupbyfRmaxandRmax.ThissuggeststhatTEScansuccessfullylearnagoodlibrary ofviewsinheterogeneousenvironmentsandeﬃcientlyutilizethoseviewsinnoveltasks.

5.2. Realrobotnavigation

Thetheoreticalanalyses presentedabovehaveshowntheadvantagesofloreRLandTESoverthestateoftheart model-based RL algorithms.We havealsodemonstrated theeﬃciencyofour methodsthrough aset ofempiricalevaluations in simulateddomains.Theseresults,thoughvaluable,areobtainedunderassumptionsthatarefavorable forourapproach.In thissection,weaimtofurtherevaluatetheproposedmethodsinarealroboticdomainwherewecannotexpecttheeffects ofactionstofollowalogisticregressionmodel.

5.2.1. Experimentset-up

Environments. Fig. 5 showsthe three environments used in our casestudies. They are designed so that the robot’s actions wouldhavedifferenteffectsatdifferentlocations,andtheenvironment surfacesare themainfactorsaffectingthe action effects. Thesurfaces are madeofvarious materials such asbeans, soil, hay,leaves, shells, paperboard, andnylon Berbercarpet. Thesematerials havedifferentphysicaleffectsontheobjectsmovingonthem.Theslopesandobstacleson the surfacesalsocontributetothe differenteffectsoftheactions.Insome areas,thesurfacesmaychange becauseofthe robot’sactions.Werestorethesurfacestotheoriginalconditionsaftereveryepisode.

For a robot to eﬃcientlyplan its path in theseenvironments, only a small set of features based on the slopes, the obstacles,andthematerialsofthesurfacesindifferentareasoftheenvironmentsarerelevant.Thesefeatures,however,are veryhardtodeﬁne.Itispreferabletoleavetherobottoautomaticallyselecttherelevantfeaturesfromalargesetofsimple features.Totestourapproach,therefore,wesimplydrawgreenandbluemarksonthesurfaces.Therobot ismarkedwith two redmarks.Thereare alsoablueball,andseveraldeath-marked spotsintheenvironment. Numerousfeaturescanbe derived fromtheseartifacts. Therobotwillneedtoselectafewfeatures thatmayserveasproxiestothetruefactorsthat affectitsactioneffects.

Environment 1andEnvironment 3aredeliberatelydesignedsothattherobotshouldlearn itsviews(transitionmodel) based on the blue marks. The transitiondynamics in these two environments are very similar. However, the two envi-ronments arealso different(interms ofirrelevantfeatures): theblueballs, thedeath places,andthe greenmarksare at differentlocations.Wewillexplainthefeaturesinmoredetailshortly.Environment 2isverydifferentfromEnvironment 1 andEnvironment 3.It isdesignedsothat therobot shouldlearn itsviewsbasedon thegreenmarks insteadoftheblue ones.

We treat the environments asdiscrete MDPs. We discretizethe Environment 2 into a state spaceconsistingof 8

×

8 (x,y)-locations and8 different orientationsof arobot, whichyields a state spaceof512 states. Environment 1and 3 are larger,sowediscretizethemintoastatespaceconsistingof10

×

10 (x,y)locationsand8 differentorientationsofarobot. Theenvironmentsareintwodifferentsizes,5

×

5 feetand6

×

6 feet(Fig. 5).

(15)

Fig. 6.The robot. (For interpretation of the references to color in this ﬁgure, the reader is referred to the web version of this article.)

Fig. 7.The system architecture.

Robot.WeusetheLEGOMindstormsNXTv1.1kittobuildathree-wheelrobotasdepictedinFig. 6.Twofrontwheelsof therobotareattachedtotwoseparatemotors;thebackwheelisfreerolling.Thetrackwidthis11.2 cm.Therobotcarries awhitepanelontopwithabigandasmallredmarksforpositioningandorientationdetection.

Therobot systemcomprisesthreemaincomponents:a centralprocessor,an observatorysystem,andacommand con-troller(see Fig. 7). Thepositioningsystemisasub-componentoftheobservatorysystem.Informationoftheenvironment andtherobot’spositionis capturedbya webcam andsent fromtheobservatorysystem tothecentralprocessor toupdate robot’sknowledge-baseaswell asto planthe next action.Theaction commandis then transmittedvia Bluetoothto the

commandcontrollerembeddedintherobotforexecution.WeimplementthecontrollerinleJOS.3

Actions. The robot is programmed to rotateits left andright wheelsin three different ways, corresponding to three actions.Fortheﬁrst action,therobotrotates bothitsleft andrightwheels246 degrees.Forthesecond action,therobot rotates its left wheel90 and right wheel

−

90 degreesat the same time. Forthe third action,the robot rotates its left wheel

−

90 andrightwheel90 degree.Astherobot maybe stillmovingaftereach action,we letthesystemidlefor200 millisecondsafteran action,waiting forthe robottostop completely.Theseactions, underidealsituations,correspondto theactionsofmove-forward,turn-left,andturn-right,respectively.

Actioneffects. Due toinaccurate robot motors,sensors, andvarious real world factors such as thesurface materials, slopes ofthesurface, andobstacles,theactions maychangetherobot’srelative location infourdifferentways, including movedforwardonecell,moveddiagonallyforwardtothecellontherobot’sleft,moved diagonallyforwardtothecellon therobot’sright,anddidnotmove.Therobot’sorientationcanalsobechangedinﬁvedifferentways,including:turnedto thenextorientation onleft,thesecondnextorientation onleft, thenextorientationonright,thesecond nextorientation onright,anddidnotturn.Thatwouldresultinatotalof20 differenteffects.

Sensors.Wemainlyprocessinformationfromtheweb-camintheobservatorysystem.Theweb-camisattachedtothe ceilingabovethearea.Therobot,therefore,canfullyobservetheenvironment.However,therobotcanonlycapturethebig andsmallredmarksonthetopoftherobotitself,andtheinformationofthelocationsofthegreen,blue,redmarks,and

(16)

Fig. 8.Accumulated rewards by various methods.

Table 3

Averagerunningtimeperepisodein50 episodes.RunonIntelCentrinoDuoT2400(1.83 GHz), 2×512 MB RAM.

Algorithm fRmax fEpsG man-loreRL loreRL

Time (sec.) 13.44 12.77 9.35 10.81

theballintheenvironment.Asthefeaturesaresimple,weusethebasicalgorithmsinOpenCVlibrary4 _to_detect_them._The

resultisnearlyperfect.

Stateattributesandfeatures.Asmentioned,anenvironmentisdiscretizedtonrows

×

mcolumns

×

o orientations,sothe fullenvironmentstatespacecanbeidentiﬁedorfactorizedbythosethreestateattributes.However,theseattributesalonedo not containenoughinformationforpredictingactioneffectsortransitiondynamics.Therefore,itiscriticalthateachstate isalsodeﬁnedwithalongvectorofbinarystatefeatures.The“green”binaryindicator f_iG

(

s

)

ofastatesissetto1 iffthere isagreenmarkthatisfurtherthani unitsbutcloserthani

+

1 unitsfromthe(x,y)-centerofthestates(i

∈ {

0

,

. . . ,

99

}

). A unit equals thewidthofthe environmentdivided by 100.Similar features aredeﬁnedfor bluemarks andtothe blue targetballyielding300 binaryfeatures.Eightindicatorsfordifferentrobotorientationsarealsoincludedinthefeature-base together withfourintentionallyredundant“thereis/is-not agreen/bluemarkinastate”-bits. Alltogether theseyield 312 binary features per state.The intuitionbehind thesefeatures isthat theyserve asproxiesto surface materials,slopes on thesurfaces,obstacles,etc.whichappeartobeimportantfactorsindeterminingthedynamicsintheenvironments,butthe robot’ssensorscannotcapture.Althoughonlyfewamongthese312 featuresareimportantformodelingrobot’sactions,the robot doesnot knowwhichonesactually matter. Therobothastolearntoselecttherelevantfeaturesbasedonfeedbackwhile interactingwiththeenvironment.

Task.Therobotisassumedtoknowtherewardmodelbeforeanystart.Therobot’staskistotravelintheenvironments fromarandomstartingpointtoreachtheblueball,whichwillearnitarewardof2 points.Therobotwillreceive

−

1 point if it fallsout ofthe area or into“death spots” marked withorangerectangles, and

−

0

.

05 points forreaching anyother states.Anepisodeendsiftherobotreachesaterminalstateorgetsstuck,i.e.,couldnotmove,forfourconsecutiveactions. Inother words,therobot aimsforthehighestcumulative rewardineachepisode.It triestoreachtheblueballasfast aspossible,butitwill avoidvisitingthedeathspotsormoving out ofthemap.In caseit isverycostly orimpossibleto reachtheball,therobotcouldgiveupbyrunningintoadeathspotormovingoutofthemap.

5.2.2. Onlineviewlearningformodel-basedRL

The robotbatterydoesnot allowustocompareouralgorithmwiththeslowRL-DT andLSE-Rmaxalgorithms,thuswe will only compare loreRLwiththe ﬁne-tuned algorithms, including fRmax, fEpsG, andman-loreRL, inwhich we manually select important features and specify the DBN-structures forthe transition models. man-loreRL is based on multinomial logisticregressionmodelswiththe12 manuallyselectedfeatures,includingeightindicatorsfordifferentrobotorientations, andfourindicatorstellingifthereis/is-notagreen/bluemarkinastate.WeruntheexperimentswithloreRLandman-loreRL

setting

α

=

0

.

5,

λ

=

0

.

05,

γ