• No results found

Scalable transfer learning in heterogeneous, dynamic environments

N/A
N/A
Protected

Academic year: 2021

Share "Scalable transfer learning in heterogeneous, dynamic environments"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutional Knowledge at Singapore Management University

Research Collection School Of Information Systems

School of Information Systems

6-2017

Scalable transfer learning in heterogeneous,

dynamic environments

Trung Thanh Nguyen

Adobe Systems

Tomi Silander

Xerox Research Centre Europe

Zhuoru LI

National University of Singapore

Tze-Yun LEONG

Singapore Management University

, [email protected]

DOI:

https://doi.org/10.1016/j.artint.2015.09.013

Follow this and additional works at:

https://ink.library.smu.edu.sg/sis_research

Part of the

Artificial Intelligence and Robotics Commons

This Journal Article is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at Singapore Management University. For more information, please [email protected].

Citation

Nguyen, Trung Thanh; Silander, Tomi; LI, Zhuoru; and Tze-Yun LEONG. Scalable transfer learning in heterogeneous, dynamic

environments. (2017).

Artificial Intelligence

. 247, 70-94. Research Collection School Of Information Systems.

(2)

Contents lists available atScienceDirect

Artificial

Intelligence

www.elsevier.com/locate/artint

Scalable

transfer

learning

in

heterogeneous,

dynamic

environments

Trung Thanh Nguyen

a

,

,

1

,

Tomi Silander

b

,

1

,

Zhuoru Li

c

,

Tze-Yun Leong

d

,

c aAdobe,345ParkAvenue,SanJose,CA,USA

bXeroxResearchCentreEurope,6chemindeMaupertuis,38240 Meylan,France cSchoolofComputing,NationalUniversityofSingapore,Singapore

dSchoolofInformationSystems,SingaporeManagementUniversity,Singapore

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Receivedinrevisedform17September 2015

Accepted29September2015 Availableonline3October2015

Keywords:

Model-basedreinforcementlearning Transferlearning

Onlinefeatureselection

Reinforcement learning is a plausible theoretical basis for developing self-learning, autonomous agents or robots that can effectively represent the world dynamics and efficientlylearntheproblemfeaturestoperformdifferenttasksindifferentenvironments. Thecomputationalcostsandcomplexitiesinvolved,however,areoftenprohibitivefor real-world applications.This studyintroduces ascalable methodologyto learn and transfer knowledgeofthetransition(andreward)modelsformodel-basedreinforcementlearning in a complex world. We propose a variant formulation of Markov decision processes that supports efficient online-learning of the relevant problem features to approximate the world dynamics. We apply the new feature selection and dynamics approximation techniquesinheterogeneous transferlearning, where the agent automaticallymaintains andadaptsmultiplerepresentationsoftheworldtocopewiththedifferentenvironments itencountersduringitslifetime.Weproveregretboundsforourapproach,andempirically demonstrateits capabilitytoquicklyconvergetoanearoptimal policyinbothrealand simulatedenvironments.

©2015ElsevierB.V.All rights reserved.

1. Introduction

Nextgenerationroboticsaimsatdevelopingautonomousagentswithintegrativeintelligence;theseagentsorgroupsof agents can automatically orsemi-automatically collect sensor inputs, make decisions,learn new skills,and interactwith humansorotheragentstocompletemultiple,complextasksindifferentreal-worlddomains[1].Emergingglobalresearch and development trends point tocollaborative robotsthat can work andinteractwithpeople [2],dexterous robotswith advancedmanipulationskillsand“softtouches”[3],andcognitiverobotsthatareself-improving,robust,andflexible[4].In Asia,forexample,socialandeconomicdemandshaveledtointensifiedactivitiestodevelopassistiverobotstocomplement adwindlingworkforceandhealthcarerobotstohelpcaringforanageingpopulation[5].

Amajorchallengeindesigningintelligentrobotsistoequipthemwiththecapabilitiestoeffectivelyusepastexperiences andatthesametimeefficientlylearnnewskillstoperformdifferenttasksindifferentenvironments.Real-worlddynamics isusuallyuncertain,unstructured,andnon-stationary;thetasks,objectives,resources,andconditionsoftherobotsmayalso

*

Correspondingauthor.

E-mailaddress:[email protected](T.T. Nguyen).

1 ThemainpartoftheworkwasdonewhiletheauthorwasattheNationalUniversityofSingapore. http://dx.doi.org/10.1016/j.artint.2015.09.013

0004-3702/©2015ElsevierB.V.All rights reserved.

Published in Artificial Intelligence, Volume 247, June 2017, Pages 70-94. http://doi.org/10.1016/j.artint.2015.09.013

(3)

changeovertime.Theskillsrequiredforsuccessfulproblemsolvinganddecisionmakingarenoteasilydescribedassimple rules,guidelines,orprocesses.A self-learningagentthatcanselectusefulenvironmentalfeaturesandlearntobehave,toa certaindegree,basedontrialanderrorisapromisingapproachtobuildingrobotsthatactintherealworld.

1.1. Reinforcementlearninginrobotics

Reinforcementlearning (RL) [6]trainsan autonomous agent tobehave intelligently based on thefeedback it receives wheninteractingwiththeenvironment.An RLtaskiscommonlyrepresentedasaMarkovdecisionprocess (MDP)witha specificdomainthatincludestherelevantactionsandstateswiththedefiningfeatures,andunknowndynamics(transition) and/orgoalorfeedback(reward)functions.ThemainchallengesofapplyingRLinroboticdomainsincludethe multidimen-sionalstate andaction spacesthat incurhighcomputational costs, the physical constraintsthat make acting inthe real world muchmoredifficult thaninsimulated worlds,the uncertaintyduetopartial observability ofthephysical environ-ments andinherent noiseinthesensormeasurements, andthedifficulty intailoringthefeedbackorrewardfunctionsto guideintelligentbehavioroftherobots[7].

Despitethechallenges,RLhasrecentlybeenappliedincomplextaskssuchashelicoptermaneuvering[8],soccerrobots

[9],androbot navigation[10] withreasonablesuccess.In thesecases,carefulselectionsofproblemrepresentations,prior experiences,andefficientapproximationshaverenderedRLcomputationallyfeasibleinthecomplexsettings.

In thiswork, we design a representationofthe world dynamics inmodel-based RL that allows efficientandscalable approximation ofthe agent’s actioneffects. At thecore ofour methodis the abilityto automatically selectthe relevant features of the environment that allow the agent to predict how the environment reacts to its actions. This ability to automaticallylearnto focusattention onthecriticalfactorsoftheproblemisone ofthecrucialelementsneededtomake intelligentbehaviorincomplexenvironmentscomputationallyfeasible.

The value ofonlinefeature extractionis furtheraccentuatedin situationswhere theagent encounters manydifferent environments in its lifetime, and wheretransferlearning isessential. While manually designing a smallset ofimportant features isnon-trivialformanyreal-worldtasks,doing sofordifferentenvironments, some ofwhichmaybe unknown a priori, iseven harder.Forexample,the Marsexplorationrover, Opportunity,is nowexpectedto travel tensof kilometers throughdifferentterrains,withpossiblyvaryingcharacteristicsanddynamics,tocollectscientificobjects[11].Mostofthe environmentalfeaturesthatitwillencounterareunlikelytobespecifiedbeforehand.

1.2. Transferlearninginroboticreinforcementlearning

IntheoriginalRLframework,theagent’sknowledgeistaskanddomainspecific.Asmallchangeinthetaskoritsdomain mayrendertheagent’saccumulatedknowledgeuseless;costlyrelearningfromscratchisoftenneeded.Intransferlearning, anagentapplies theknowledgeorexperiencegainedinprevious(source)taskstoinfluenceandimprovetheperformance ofnew,related(target)tasks.TransferinRLassumesthatknowledgefromoneormoresourcetask(s)isusedtolearnone ormore target task(s)faster than iftransfer was not used.In contrastto multi-tasklearning,which assumesthat all the tasksare drawnfromanunderlying,possiblyunknown distribution,transferlearningis moregeneralinallowing transfer amongsttaskswithheterogeneousdomains,dynamics,andgoals[12].Transferlearningishencemostrelevantinbuilding

lifelonglearningagentsorrobots–agentsthat canlearn,retain,andtransferknowledgeacquiredfrommultipletasksover multiple domains, ina sequential manner, to develop more accurate solutions orpolicies in learning newtasks over its lifetime[13].

TransferinRLisdifferentfromanothermaincategoryoftransferlearningtasksthat focusonclassification,regression, andclusteringinnon-dynamicdomains[14].Themaintechnicalchallengesinvolved,however,aresimilarinidentifyingthe commonalitiesbetweenthesourcetasksordomains andthetargettasks ordomains.TransferinRL addressestheissues ofwhat totransfer, howtotransfer, andwhentoandwhen notto transferto avoidorminimizenegative effects onthe learningperformanceinthenewenvironment.

Manyexisting RLtransfer techniques,however,assume thesamestate-action spacesi.e., homogeneousdomains in dif-ferenttasks.Thisassumptionmaynotworkwellinreal-lifeapplications.Forexample,manyenvironmentalcuesthathelp an agent navigate througha forest aresimply missingwhenthe agenttriesto navigate atsea.Whilerecent effortshave addressedinter-task transfer inheterogeneous domains ordifferent state-action spaces [15–19],such mappings are hard todefinewhentheagentoperatesinreal-worldenvironmentswithlargestate-actionspacesandmultiplegoalstates,with possiblydifferentstatefeaturedistributionsandworlddynamics.Atrade-offbetweencomputationalcomplexityandsample efficiencyisusuallyinvolvedinautomaticallylearningthemappings[20].

Weproposeanefficient,onlinesystemthattriestotransferoldknowledge,butatthesametimeevaluatesnewoptions toseeiftheywork better.The agentgathersexperienceduring its lifetimeandentersa newenvironmentequippedwith

expectationson how differentaspects of theworld affectthe outcomesof theagent’sactions. The main idea isto allow an agent to collect a libraryofworld models orrepresentations, calledviews,that it can consult to focusattentionin a newtask.Inthispaper,weconcentrateonapproximatingthedynamicsortransitionmodel.Thefeedbackorrewardmodel librarycanbelearnedinananalogousfashion.Effectiveutilizationofthelibraryofworldmodelsallowstheagenttocapture thetransitiondynamicsofthenewenvironmentquickly;thisshouldleadtoajumpstartinlearningandfasterconvergence to a nearoptimalpolicy. Amain challenge isin learning to selecta proper view fora newtaskin a newenvironment,

(4)

withoutanypredefinedmappingstrategies.Theabilitytolearnnewviewsisalsocriticalinthecomplex,andfeature-rich realenvironments.

Throughscoringthelearnedviewsandsimultaneousevaluationwithouttheviews,oursystemmitigatespotential nega-tiveeffectsbybalancingtheexploration-and-exploitationtrade-off.Thisisincontrasttominimizingnegativetransfereffects throughdeterminingtheunderlyingrelationshipsbetweentheexperiencesandthenewtasksanddomains [14].Such rela-tionshipsmaynotexistandareusuallyhardtodetermineintheheterogeneous,dynamicenvironmentsthatwefocuson.

1.3. Anewtransferlearningframework

We presenta newtransferlearning framework thatsupports all theessential steps ofafully autonomousRLtransfer agent,ascomparedwithmostexistingworkthatmainlyfocusesononeortwoofthesesteps[12]:

1. Selectanappropriatesourcetaskorsetofsourcetaskstotransfertoatargettask; 2. Learntherelationshipsbetweenthesourceandthetargettasks;and

3. Transfertherelevantknowledgefromthesourcetasktothetargettasktoimprovelearningperformance. Oursystemcomprisesthefollowingcomponents:

Situation-calculusMarkovdecisionprocess(CMDP),anewvariantoffactoredMDPthatcompactlyrepresentsthe rele-vantactioneffectsinadynamicenvironment;

A collectionofviews, whichsummarizethe experienceslearnedfromconstructing transitionmodels indifferent en-vironments based on multinomial logistics regression; the views are scored based on their relevance in the new environments;

Aview ortransitionmodellearningalgorithm,multinomial DualAveragingwithGroupLasso(mDAGL),thatsupports onlinefeatureselectionandrelevance-focusedupdating;

A newreinforcement learning algorithm,logistic regressionRL (loreRL), that appliesmDAGL on aCMDPto learn the transitionmodelsbyautomaticallyfocusingontherelevantfeaturesintheenvironment;and

Anewtransferlearningalgorithm,TransferExpectationS(TES),thatmayutilizethepreviouslylearnedtransitionmodels toimprovereinforcementlearninginnewenvironments.

The restofthe paperisorganized asfollows.We willnext formalize theproblemanddescribe themethod ofonline feature selection and collecting viewsinto a library. We will then present an efficient implementation of the proposed transfer learning technique.Afterdiscussingrelatedwork,we willdemonstratethe effectivenessofoursystemthrougha set ofsimulated experimentsanda real roboticapplication. We will concludeby discussingthe lessonslearnedand the implicationsforfutureroboticresearch.

2. Background

AtaskoraproblemMforanintelligentagentorrobottosolve,e.g.,navigatetothemasterbedroomorfetchacupfrom thekitchentable,istypicallymodeledasaMarkovdecisionprocess(MDP)definedbyatupleM

=

(

S

,

A

,

T

,

R

)

,whereS is asetofstates; Aisasetofactions;T

:

S

×

A

×

S

→ [

0

,

1

]

isatransitionfunctionormodel,suchthat T

(

s

,

a

,

s

)

=

P

(

s

|

s

,

a

)

indicatestheprobabilityoftransitingtoastate supontakinganactiona instates; R

:

S

×

A

R

isarewardfunctionor modelindicatingimmediateexpectedrewardafteranactionaistakeninstates.InRL,thestate-actionspaceS

×

Adefines thedomainofthetask,thetransitionmodelT andtherewardmodelR definetheobjectiveofthetask[21];thetransition modelT and(sometimes)therewardmodelRareunknown.ThemainchallengeoftransferlearninginRLisintransferring acrossheterogeneousdomains,i.e.,wherethestate-actionspaces,andpossiblythefeaturespaces,aredifferent.

For each task M, thetaskor problemsolutionis to finda policy

π

that specifies an action toperform ineach state so that theexpectedaccumulatedfuturereward(withpossibly higherweights formore recentrewards)foreachstate is maximized [6].The optimalpolicy

π

isusually derived by amodel-based ora model-freeapproach inRL;thelatterdoes not considerthe transitionmodelT inderiving thesolution.The two RLapproachesare basedon differentassumptions, problemcharacteristics,andconstraints.Weadoptthemodel-basedapproachasitiseasiertoincorporatedomainorexpert knowledgeintothetransitionandrewardmodels;knowledgetransferacrossdifferentenvironmentscanalsobefacilitated throughthemodels.

In model-basedRL,theoptimalpolicy isderived byfirst estimatingthetransitionmodel T (andthe rewardmodel R, ifnecessary) throughinteractingwiththeenvironment.Thisstudyfocuseson learningthemodels,andremainsrelatively agnostic aboutthe actual planning techniques. However, since the purpose is to build a model about world dynamics, planningsystemsthatrelyonsimulatorsappearmostnatural.

3. Amulti-viewtransferlearningframework

A key idea ofthiswork is that the agent canrepresent theworld dynamics fromits sensorystate space indifferent ways.Suchdifferentviewscorrespondtotheagent’sdecisionstofocusattentionononlysomefeaturesofthestateinorder

(5)

Fig. 1.Our life-long learning agent.

Algorithm1Overviewoftheproposedlearningframework.

Input:ViewlibraryL.

Output:UpdatedviewlibraryL.

Initialize the system for a task

LH:historicalrecordofhowgoodeachviewwasinprevioustasks

LL/*AworkingcopyofLtofindthedynamicsofthecurrentenvironment*/

B←initializebeliefofhowwelleachviewcanapproximatethecurrent environmentdynamicsbasedonLH.

/* whileinteractingwiththeenvironment */ fort=0,1,2,. . .do

Select views to approximate the world transition dynamics

{W}←selectthemostpromisingviewsfromLbasedonB Interact with environment based on that approximate model π←plananactionpolicybasedon{W}

st:currentstateintheenvironment

at←chooseanactiontoperforminstaccordingtoactionpolicyπ

Performactionatandobservefeedback:actionoutcomesfromtheenvironment

Score all views with the new feedback S←scoreallviewsinLwiththenewfeedback

B←updatebeliefabouttheviewsinLbasedonthescoreS

Adjust all views with new feedback

L←Adaptallviewstowardthiscurrentenvironmentbasedonthenewfeedback Breakwhenthetaskends

end for

Update view library L

L←Updatetheviewlibrary,e.g.

IfaviewWisdifferentfromexistingviewsinL,

whichmeansthetransitiondynamicsofthecorrespondingactionisnew,

thenaddthatviewWtoL;

elsereplacetheoldviewinLwiththenewlyupdatedviewW.

to quicklyapproximatethe statetransition function.Inother words, aview represents an expectationofthe agent about thetransitiondynamicsresultingfromone(orseveral)ofitsactionsinthetaskenvironment.Inanewtask,theagentwill selectappropriateviewstosolvethetask,andtolearnnewviewsiftheenvironmentisnovel.Fig. 1illustratestheworkflow ofourlife-longlearningagent.Partsofthisworkhavebeenpublishedin[22–24].

WesketchtheoverallbehaviorofouragentinAlgorithm 1.

Whentheagentdoesnothaveanyinitialinformationaboutthetransitiondynamicsoftheenvironmentinanewtask, itselects“expectations”orviewsbasedonhistorythattellshowwelleachviewinthelibraryhasworkedinprevioustasks. Weassumethatthemorefrequentlyaviewhasworkedinthepast,themorelikelyitwillworkagaininanewtask.

The agent thenoperates accordingtothe policy learnedbasedon thetransitionmodel builtfromtheselected views, assuming that the reward model is known. However, since the views have been selected without any reference to the actualcharacteristicsofthenewenvironment,itishighlylikelythattheseviewsareinappropriateforthecurrenttask.In other words, the“expected” transitionmodelmay notcorrectly approximatethe truetransitiondynamics ofthe current environment.Theresultingpolicymayjustperformpoorly,leadingtonegativerewards.

To limit such negative transfer effects, ouragent exploitsthe outcomes or feedback foreach ofits actions, interms of state changes and rewards, to score all the views in the library. The score estimates the capability of each view in approximating the transition dynamics in the current environment; it is a primary criterion to re-select the views for subsequentdecisions.

Thenewtaskand/orenvironment,however,mayactuallybeverydifferentfromallthepastexperiencesoftheagent.In thiscase,noneoftherecordedviewscancaptureoradapttothenewtransitiondynamics.Hence,environmentfeedbackis alsousedtodevelopandincorporatenewviewsintothelibrary.Thesestepsarerepeateduntilastoppingcriterionismet.

(6)

Fig. 2.Different ways to decompose a state transition function.

At the endof a task, the selected viewswill be recorded asnew additions in the library ifthe transition dynamics captured issignificantly differentfromthe existingviews. Otherwise,the agent wouldjustupdate theexisting views ac-cordingly.Viewsthathavenotbeenusedfora“longtime”willbeprunedtomanagethelibrarysize.

We address three main technicalchallenges in this framework: First, the transitionmodel T

(

S

,

A

,

S

)

is task specific, which isprobably a reason why there have not been many studies that transfer the transition model. Second, learning orupdating aview oratransitionmodelonlineina complexandfeature-rich environmentiscomputationallyexpensive. Third,theviewscoringmethodmustbesimpletobecalculatedonline,basedonfeedbackfromtheenvironment.Theview libraryalsoneedstobeefficientlyupdated.

3.1. SituationcalculusMDP:CMDP

In afactoredMDP, eachstate sisrepresented bya vector ofnstate-attributes si

,

i

=

1

,

. . . ,

n.Eachstate attributeis a

randomvariablethatcantakeonmultiplevalues;eachstatesisdefinedbytheCartesianproductofthenstateattributes. For example, the binary state attributes Water (with values present and absent) and Martians (with values present and

absent) willdefine all the states that describe the “interestingness” ofa particularlocation on Mars:{(water, Martians), (water,no-Martians),(no-water,Martians),(no-water,no-Martians)}.

ThetransitionfunctionormodelforthefactoredstatesiscommonlyexpressedindynamicBayesiannetworks(DBNs),a temporalvariantofBayesiannetworks.IntheDBNrepresentationofthetransitionmodel,T

(

s

,

a

,

s

)

=

ni=1P

(

si

|

Paai

(

s

),

a

)

, wheresiisastate-attribute,Paai indicatesasubsetofstate-attributesinscalledtheparentsofsi(Fig. 2a),i.e.,theattributes

of s from which there are arcs to si in the DBN. Learning T requires learning the subsets Paa

i andthe parameters for

conditional distributions,ortheDBNlocalstructures.AcriticalissueinthisMDPformulationistheambiguousdefinitions of the factored attributesor variables. The state-attributes orfeatures serve to both define the state spaceand capture informationaboutthetransitionmodel,evenifthesetwopurposescanbeverydifferent.Forexample,twostate-attributes, the (x, y)-coordinates uniquely identifya state andcompactly factorizethe state space ina grid-world. A policy can be learnedonthisfactoredspace.Thestate transitiondynamics,however,maydependonotherfeatures ofthestate,suchas the surface material atthelocation (state), thepresence orabsenceof Martians,etc. Suchfeatures are oftenincluded in thestaterepresentations.Whileessentialinformulatingthetransitionorrewardmodels,thesefeaturesmaycomplicatethe planningorlearningprocessesbyincreasingthesizeandcomplexityofthestatespace.

We presenta variant ofthe factored MDP that definesa “compact butcomprehensive” factorizationof thetransition function andsupportsefficientlearningoftherelevantfeatures. Weseparatethestateidentifyingstate-attributesfromthe “merely”informativestate-featuresinourrepresentation(see Fig. 2b).Thisway,wecanapply anefficientfeatureselection methodonalargenumberofstatefeaturestocapturethetransitiondynamics,whilemaintainingacompactstatespace.

SimilartotheapproachproposedbyKonidarisandBarto[25],thestateattributescouldbedefinedintermsofthe agent-spacerepresentationbasedonthecapabilitiesaffordedbythesensorsoractuatorsoftherobot,e.g.,the(x,y)-coordinates oftherobotlocation.The statefeatures,ontheotherhand,couldbedefinedintermsoftheproblem-spacerepresentation based onthe domainorenvironmental characteristicsofthe task, e.g.,the terrain orwetnessofthe floorsurface, which mayormaynotbeaccuratelydetectablebytheagentorrobot.

LearningDBNstructuresofthetransitionfunctiononline,i.e.,whiletheagentisinteractingwiththeenvironment,isstill computationallyprohibitiveinmostdomains.Ontheotherhand,recentstudies[26,27]haveshownencouragingresultsin learning thestructureoflogisticregressionmodels,whichcaneffectivelyapproximatethelocalstructuresinDBNsevenin highdimensionalspaces.Whiletheseregressionmodelscannotfullycapturetheconditionaldistributions,theirexpressive

(7)

powercanbeimprovedbyaugmentinglow dimensionalstate representationwithnon-linearfeaturesofthestate vectors. Wewillintroduceanonlinesparsemultinomiallogisticregressionmethodthatsupportsefficientlearningofthestructured representationofthetransitionfunctioninSection4.

Inaddition,wepredicttherelativechangesofstatesinsteadofdirectlyspecifyingthenextstate-attributesinatransition (seeFig. 2c).InRL,anactionwillstochasticallycreateaneffectthatdetermineshowthecurrentstatechangestothenext one [28,10,17].Mediating statechanges viaactioneffectsisa commonstrategyintheclassic situationcalculus [29].Since the numberofrelative changes or actioneffects isusually muchsmaller than thesize of thestate space, or thesize of state-attributedomains,thecorrespondingpredictiontaskshouldbeeasier.Thelearningproblemcanthenbeexpressedas amulti-classclassificationtaskofpredictingtheactioneffects.

Usually,the transitionfunctionsandrewardfunctionsaredefinedinterms ofthestate features– aspectsofthe state representationthathelpinpredictingtheactioneffects.Inaspecifictaskorenvironment,thesameactionindifferentstates (e.g.,differentlocations)withthesamefeaturestendtoyieldthesameactioneffects(i.e.,relativechangesinthestates).

Formally,asituationcalculusMDP (CMDP)isdefinedbya tupleM

=

(

S

,

f

,

A

,

T

,

E

,

R

,

γ

)

,whereS

,

A

,

T

,

R

,

γ

havethe samemeaningasinaregularMDP.S

=

S1

,

S2

,

..,

Sn

isthestatespaceimplicitlyrepresentedbyvectorsofn state-attributes.

The function f

:

S

R

m extractsm state-featuresfrom each state. E is an actioneffect variable such that the transition functioncanbefactoredas

T

(

s

,

a

,

s

)

=

P

(

s

|

s

,

a

)

=

n

i=1 P

(

si

|

s

,

f

(

s

),

a

)

=

n

i=1

eE P

(

si

|

e

,

s

)

P

(

e

|

s

,

f

(

s

),

a

).

(1)

Fig. 2cshowsanexampleofthisdecomposition.Theagentusesthefeaturefunction f toidentifytherelevantfeatures, andthenusesbothstateattributesandfeaturestopredicttheactioneffects.Wealsoassumethattheeffecte andcurrent state s determinethenext state s,thus P

(

s

|

e

,

s

)

is either0or1.Thisdefinesthesemanticmeaningoftheeffectwhich is assumedto be known by the agent. The remaining taskis to learn P

(

e

|

s

,

a

)

=

P

(

e

|

x

(

s

),

a

)

,where x

(

s

)

=

(

s

,

f

(

s

))

isa vector containing boththe state attributesandthestate features.Assuming theeffecte is discrete,learning P

(

e

|

x

(

s

),

a

)

isa classificationproblem. InSection 4we willshow howtosolve thisproblemby usingmultinomial logisticregression methods.

3.2. Transferringexpectations:TES

In ourframework, the knowledgegathered andtransferred by the agent iscollected into a library

T

ofonline effect predictorsorviews. Aviewconsistsofastructure component f

¯

thatpicks thefeatures whichshouldbe focused on,and aquantitativecomponent

thatdefineshowthesefeaturesshouldbecombinedtoapproximatethedistributionofaction effects. Formally,a view isdefined as

τ

=

(

¯

f

,

)

,such that P

(

E

|

S

,

a

;

τ

)

=

P

(

E

| ¯

f

(

S

),

a

;

)

=

τ

(

S

,

a

,

E

)

,in which

¯

f is an orthogonalprojectionofx

(

s

)

tosomesubspaceof

R

m,wheremisthedimensionofthesubspace.Eachview

τ

isspecialized inpredictingtheeffectsofoneactiona

(

τ

)

A andityieldsaprobabilitydistributionfortheeffectsoftheactionainany state.Thispredictionisbasedonthefeaturesofthestateandtheparameters

(

τ

)

oftheviewthatmaybeadjustedbased ontheactualeffectsobservedinthetaskenvironment.

We denote the subset ofviews that specify the effects for action a by

T

a

T

.The main challenge is to build and maintain acomprehensive setofviewsthat canbe usedinnewenvironments likely resemblingthe old ones,butatthe sametimeallowadaptationtonewtaskswithcompletelynewtransitiondynamicsandfeaturedistributions.

Atthebeginningofeverynewtask,theexisting libraryiscopiedintoaworkinglibrarywhichisalsoaugmentedwith fresh,uninformedviews,one foreachaction,that arereadytobeadaptedtonewtasks.Wethenselect,foreachaction,a viewwithagood“trackrecord”,i.e.,ithasbeen“used”orappliedmanytimesorwithahighrecency-weightedscore.This viewisusedtoestimatetheoptimalpolicybasedonthetransitionmodelspecifiedinEquation(1),andthepolicyisused to pickthefirst actiona.The actioneffectis thenused toscore allthe viewsinthe workinglibraryandtoadjusttheir parameters.Ineachroundtheselectionofviewsisrepeatedbasedontheirscores,andthenewoptimalpolicyiscalculated basedonthenewselections.Attheendofthetask,theactuallibraryisupdatedbypossiblyrecruitingtheviewsthathave “performedwell”andretiringthosethathavenot.A morerigorousversionoftheprocedureisdescribedinAlgorithm 2.

3.2.1. Scoringtheviews

Toassess the quality ofa view

τ

,we measure its predictive performance by a cumulative log-score. Thisis a proper

score[30]thatcanbeeffectivelycalculatedonline.

Givenasequence Da

=

(

d1

,

d2

,

. . . ,

dN

)

ofobservationsdi

=

(

si

,

a

,

ei

)

inwhichactiona hasresultedineffectei instate si,thescoreforana-specialized

τ

is

S

,

Da

)

=

N

i=1

(8)

Algorithm2TES:TransferringExpectationsusingalibraryofviews.

Input:T= {τ12,. . .}:viewlibrary;CMDPj:anewjthtask;:viewgoodnessevaluator

LetT0beasetoffreshviews–oneforeachaction TtmpTT0 /*theworkinglibraryforthetask*/

forallaAdo Tˆ[a]←arg maxτTa(τ,j) endfor /*selectingviews*/

fort=0,1,2,. . .do

at← ˆπ(st),whereπˆ isobtainedbysolvingMDPusingtransitionmodelTˆ

Performactionatandobserveeffectet

forallτTat

tmpTat do Score[τ]←Score[τ]+logτ(st,at,et) endfor

forallτTat

tmp do Updateviewτbasedon(f(st),at,et) endfor

ˆ

T[at]←arg maxτTat

tmpScore[τ] /*

selectingviews*/

end for

forallaAdo τ∗←arg maxτTa tmpScore[τ];

TagrowLibrary(Ta,τ,Score,j) /*updatinglibrary*/

endfor

if|T|>Mthen TT− {arg minτT(τ,j)} endif /*pruninglibrary*/

Algorithm3Growsub-library

T

a. Input:Ta,τ,Score,j:taskindex;c:constant;H

τ∗= {}:emptyhistoryrecord

Output:updatedlibrarysubsetTaandwinninghistoriesH τ

caseτ∗∈Ta

0 do TaTa∪ {τ∗} /*addnewbietolibrary*/ otherwise do Letτ¯∈T betheoriginal,notadaptedversionofτ

caseScore[τ∗]−Score[ ¯τ]>c do TaTa∪ {τ} otherwise do TaTa∪ {τ}− { ¯τ}

∗←¯ /*inherithistory*/

∗←∗∪ {j}

where

τ

(

si

,

a

,

ei

;

θ

i

(

τ

))

istheprobabilityofeffectei givenbythevieworeffectpredictor

τ

basedonthefeaturesofstate siandtheparameters

θ

i

(

τ

)

thatmayhavebeenadjustedusingpreviousdata

(

d1

,

d2

,

. . . ,

di−1

)

.

3.2.2. Growingthelibrary

Aftercompleting atask, thehighestscoring viewsforeach actionare consideredforrecruiting intotheactual library. Thewinning“newbies”ornewentriesareautomaticallyaccepted.Inthiscase,thedatahasmostprobablycomefromthe distributionthatisfarfromtheanycurrentmodels,otherwiseoneofthecurrentmodelswouldhavehadanadvantageto adaptandwin.Inother words,thismeansthat theexistingviewshavegeneratedlow ornegative rewardsintheprocess ofderiving theoptimalpolicyinRL;they havenot beenusedfurtherandanewsetofworlddynamicshasbeenlearned. Instead oftrying to directlyidentify theunderlying commonalitiesbetweenthe old andnewtasksand/or environments, ourframeworkmitigatespotentialnegativetransfereffectsbylearningthenewviewsortransitiondynamicsinparallelto exploitingandexamining iftheexisting viewscan improvelearning.This abilityisimportantin realdomains wherethe underlyingsystemdynamicsmaynotbeeasilydetermined.

Thewinners

τ

∗thatareadjustedversionsofoldviews

τ

¯

areacceptedasnewmembersiftheyscoresignificantlyhigher thantheiroriginalversions,basedonthelogarithmoftheprequentiallikelihoodratio[31]

(

τ

,

τ

¯

)

=

S

(

τ

,

Da

)

S

(

τ

¯

,

Da

)

. Otherwise, theoriginalversions

τ

¯

gettheir parameters updatedto thenewvalues.Thisprocedureisjusta heuristicand otherinclusionandupdatingcriteriamaywellbeconsidered.ThepolicyisdetailedinAlgorithm 3.

3.2.3. Pruningthelibrary

Tokeepthelibraryrelativelycompact,aplausiblepolicyistoremoveviewsthathavenotperformedwellforalongtime, possiblybecausetherearebetterpredictorsortheyhavebecomeobsoleteinthenewtasksorenvironments.Toimplement sucharetiringscheme,eachview

τ

maintainsalist oftaskindicesthatindicatesthetasksforwhichtheviewhasbeen

the bestscoringpredictorforits specialtyactiona

(

τ

)

.We canthen calculatetherecencyweighted trackrecordforeach view.Inpractice,wehaveadoptedtheprocedurebyZhuetal.[32]thatintroducestherecencyweightedscoreattime T as

,

T

)

=

t

eμ(Tt)

,

where

μ

controlsthespeedofdecayofpastsuccess.Otherdecayfunctionscould naturallyalsobeused.Thepruningcan thenbedonebyintroducingathresholdforrecencyweightedscoreoralwaysmaintainingthetopmviews.

4. Aviewlearningalgorithm

In TES, a view can be implemented by any probabilistic classification model that can be quickly learnedonline. This requirementexcludesmostofthebatchorofflineclassificationmethods.Inthisstudy,weintroduceascalableonlinesparse multinomial logistic regression algorithm to incrementally learn a view. The proposed algorithm optimizes an objective

(9)

Algorithm4ThemDAGLalgorithm.

Input:λ,α

Leth(W)beastronglyconvexfunctionwithmodulus1 LetW1=W0=arg min

Wh(W) LetG¯0= ¯0 fort=1,2,3,. . .do (yt,xt)observedata (Wt+1,G¯t)mDAGL-update(t,yt,xt,Wt,G¯t−1,λ,α) end for

functionsimilartothegrouplasso[33]whichhasbeenrecentlysuggestedforefficientfeatureselectionamongaverylarge setoffeatures[27].

Multinomial logistic regression is a simple yet effective classification method. Assuming K classes of d-dimensional vectors x

R

d, we represent each class k witha d-dimensional prototype vector Wk. Classificationof an input vector x

isbasedon how“similar” itisto theprototypevectors. Similarityismeasuredwithinner product

Wk

,

x

=

d

i=1Wkixi,

wherexi denotesfeaturei.Theprobabilityofaclassisdefinedby P

(

y

=

k

|

x

;

Wk

)

eWk,x.Theparametervectorsofthe

modelformtherowsofamatrixW

=

(

W1

,

. . . ,

WK

)

T.

Letlt

(

Wt

)

= −

logP

(

yt

|

xt

;

Wt

)

denotetheitem-wiselog-lossofamodelwithcoefficient matrix Wt predicting a data

point

(

yt

,

xt

)

observedattimet.Atypicalobjectiveofan onlinelearningsystemistominimizethetotallossbyupdating its Wt overtime.However,theresultingmodelwilloftenbeverycomplicatedandover-fitting.Toachieveaparsimonious

model,weexpressouraprioribeliefthatmostfeaturesareirrelevant orsuperfluousbyintroducingaregularization term

(

W

)

=

λ

di=1

K

W·i

2,where

W·i

2 denotesthe2-normoftheithcolumnof W,and

λ

isapositiveconstant. This

regularization issimilar tothat of grouplasso. It communicatestheidea that itislikely that awhole columnof W has zero values (especially, for large

λ

). A column of all zeros suggeststhat the corresponding feature is not necessary for classification.

Theobjectivefunctioncannowbewrittenas

L

(

T

)

=

T

t=1 lt

(

Wt

)

+

(

Wt

)

=

T

t=1

log e Wt yt,x t

keW t k,xt

+

λ

d

i

K

W·ti

2

,

whereWt isthecoefficientmatrixlearnedusingt

1 previouslyobserveddataitems.Thequalityofasequenceof

param-etermatricesWt

,

t

(

1

,

. . . ,

T

)

withrespecttoafixedparametermatrixW canbemeasuredbytheamountofextraloss, orregret RT

(

W

)

=

L

(

T

)

LW

(

T

)

=

T

t=1

(

lt

(

Wt

)

+

(

Wt

))

T

t=1

(

lt

(

W

)

+

(

W

)).

Wewanttolearnaseriesofparameters Wt toachieve smallregretwithrespecttoagoodmodelW that hasa small lossLW

(

T

)

.

4.1. Onlinelearningofmultinomialregularizedlogisticregressionwithgrouplasso:mDAGL

Recently, Xiao etal. [26] introduced a dual averaging method forsolving lasso online, andYang etal. [27] extended the work forsolving group lasso. The methods are simple, efficient,and scalablefor learning the regularized regression models. Following the same approach, we introduce mDAGL (Algorithm 4), a new algorithm to incrementally learn W

to optimize the specified objective function in the sparse multinomial logistic regression. Typically, group lasso is used to regularizegroupsofcoefficientswhere eachcoefficient corresponds toa particularfeature.In ourcaseofmultinomial logistic regression, a group comprises the coefficients of a feature; in other words, a group is a column in coefficient matrixW.

In thestandard online stochastic gradientdescent method, afterobserving the datavector

(

yt

,

xt

)

, we adjustthe pa-rametersofthemodeltowardthedirectionsthatmaximizethelikelihoodoftheobservation(orequivalently,minimizethe item-wiselog-loss lt). The dualaveraging method isa version of the stochastic gradientdescent, thus we again haveto

computethegradientGt ofthelog-lossfunctionforeachobservation

(

yt

,

xt

)

.Butinsteadofmovingtheparametersbased onthesedirectionsGt,we useGt toupdate theaverageofgradients,G

¯

t,forobservationsencountered thisfar, andmove

the parameters away from thoseaverage directions (away, since we are minimizing).We will next describe themethod moreformally.

(10)

Algorithm5mDAGL-update.

Input:t,yt,xt,Wt,G¯t−1,λ,α

Gtuseequation(2)with(yt,xt),Wt

¯

Gtt−1 t G¯t−1+

1 tGt

Wt+1useequation(A.1)withG¯t

,βt=αt,λ

return(Wt+1 ,G¯t

)

Algorithm6TheloreRLalgorithm.

Input:mDAGLregularizationparametersλ,α,CMDPvariablesS,f,A,E,R,γ,exploration. LetW=(W1,W2,. . . ,W|A|)=(W0,W0,. . . ,W0)

LetG¯=(G¯1,G¯2,. . . ,G|¯A|)=(0,0,. . . ,0)

s0←randominitialstate fort=1,2,3,. . .do

π←SolveMDPusingtransitionmodelT(W¯)

aπ(st,) #-greedyactionselection

Takeactionayieldingeffecte,nextstatest+1 (Wa,G¯a)mDAGL(t,e,x(st),Wa,G¯a,λ,α)

end for

Weinitializetheparameters W toa K

×

dmatrixofallzeros.LetGt

kibe thepartialderivativesoffunctionlt

(

W

)

with

respect toWki atWt (Gtki

=

W∂ltki

(

Wt

)

). Wedefine G

¯

t tobea matrixofaveragepartial derivatives,i.e., G

¯

tki

=

1t

=1Gτki,

wheretakingthegradientofthelossfunction withrespecttotheparameterWkigivesustheformula

ki

= −

i

(

I

(

=

k

)

P

(

k

|

;

)).

(2)

We notice that this partial derivative points either away or towards theobserved feature i depending onwhether the observationbelongstotheclasskornot.

Foranydataobservedattimet,weusetheaveragegradientG

¯

t toupdatethecoefficientmatrixW via Wt+1

=

arg min W

¯

Gt

,

W

+

(

W

)

+

β

t t h

(

W

)

,

(3)

where

β

t is a non-negative, non-decreasing sequence of real numbers,and

·

,

·

denotes an inner product betweentwo

matrices;

¯

Gt

,

W

=

k,iG

¯

tkiWki.Thefirsttermisminimizedby W thatpoints tothedirection

− ¯

Gt.Whilethefirstterm

preferslongvectors, theregularizationterm

(

W

)

balances thisout. Thethirdtermintroducesan extraregularizationin terms ofastronglyconvexfunctionh

(

W

)

thatisneededforconvergenceandsparsity.In practiceweusetheFroebenius normh

(

W

)

=

k,iWki2 and

β

t

=

t.

Solving theminimization problemaboveleads toanupdaterule(Algorithm 5)inwhicheachcolumnoftheWt+1 isa

scaled versionofthecorresponding columnofthe G

¯

t.Furthermore,ifthelength oftheaveragegradientmatrixcolumnis

smallenough,thecorrespondingparametercolumnshouldbetruncatedtozero.Thisamountstofeatureselection.Thefull definitionandproofoftherulearedetailedinAppendix A.

Aregretanalysisconfirmsthatthesolutionwillconvergeandthattheaveragemaximalregretasymptoticallyapproaches zerowithrateO

(

√1

t

)

.ThefullanalysisisdetailedinAppendix B. 4.2. Model-basedRLwithmultinomiallogisticregression:loreRL

Our maintaskis toturn transitionmodellearning intothe learning ofconditional distributions P

(

E

|

s

,

f

(

s

),

a

)

using multinomiallogisticregressionforwhichattentiontorelevantfeaturescanbeefficientlyimplementedonlineviamDAGL.

Thekeystepsofourmethod,calledloreRL(RLwithregularizedlogisticregression),arepresentedinAlgorithm 6.Inputs toloreRLaretheCMDPcomponents(exceptthetransitionfunction),regularizationparameters

λ

and

α

ofmDAGLalgorithm, andthe

thatdetermines theprobability oftakinga randomaction.Wefirstinitializelogistic regressionparameters Wa

andtheaveragegradientmatricesG

¯

a foreachactiona

A.Wealsorandomlyselectastartingstates0.

Ateachtimestep,arandomactionaischosenwithasmallprobability

,butotherwisewecalculatetheoptimalpolicy

π

foranMDPwiththetransitionmodelT

(

W

)

isbasedonthecurrenteffectpredictors.Whilewehaveusedvalueiteration (likeinRmax)forfindingtheoptimalpolicy,anyothermodel-basedRLtechniquecanbeusedaswell.Wedonotfocuson theplanningpartofRLhere,butsearchheuristicssuchasthoseusedinDyna-Q[34] orPrioritizedSweeping[35]can be deployed foramorescalablealgorithm.Afterperformingan actionainstate st andobservingits effecte,theexperience

(

e

,

st

,

f

(

st

))

willbepresentedtothemDAGLalgorithmthatupdatestheparametermatrixW

aandthegradientmatrixG

¯

a.

Aswejustdo

-greedyrandomsampling,itisimpossibletoguaranteePACconvergencetoanoptimalpolicy.Assuming that observed datais i.i.d.,we can provethat difference in optimalvalue functionsoftwo CMDPs withdifferentlogistic regressionbasedtransitionfunctionsisboundedbythedifferenceintheirparameters.Thisleadstoacorollaryfor conver-gencetonearoptimalpolicy.
(11)

Theorem1(Differencein valuefunction). LetM1

=

(

S

,

f

,

A

,

T

(

WM1

),

E

,

R

,

γ

)

and M2

=

(

S

,

f

,

A

,

T

(

WM2

),

E

,

R

,

γ

)

betwo

CMDPswithoptimalpolicies

π

1and

π

2respectively.LetusdenotebyVπMthevaluefunctionforpolicy

π

inCMDPM.Let

1

=

2

max aA,eE

W (a),M1 e

We(a),M2

1sup s

x

(

s

)

1

,

then max sS V M2 π1

(

s

)

V M2 π2

(

s

)

Vmax

1 1

γ

,

where W(a),M1

e andWe(a),M2 refertothevectorofcoefficientscorrespondingtoclassE

=

e underactiona inmodelM1 andM2

respectively,

·

1isthe1-normofvector,andVmaxisthemaximumvalueofanystateforanypolicyineitheroftheCMDPs.

Bytaking M2 to be a CMDP basedon the optimal W∗ and M1 an estimatedCMDP basedon mDAGL,we can derive

a vanishingboundforvalue difference ofpolicies.Incasethetruetransitionmodelis representableby a sparse W∗,we wouldmostprobablyconvergetoanearoptimalpolicy.ThefullproofforthetheoremisinAppendix C.

Whenwecannotexpressthetruetransitiondynamicsaslogisticregressionbasedonavailablestatefeatures,itishardto giveguaranteesofperformance.However,wecanstillhavesomeconfidenceindoingwell.ThelogisticregressionmodelPl

closest(inKullback–Leiblerdistance)tothetruemodelPtrue (possiblynotalogisticregressionmodel)istheone2 thathas

thesmallestexpectedlog-loss.Whileouroptimalitycriterionistheexpectedregularizedlog-loss,weexpecttheregularized log-lossoptimalmodel P tobe closeto Pl∗ thusalmostascloseto Ptrue aswecanget.ThisrelativelysmallKL-distance

can beconvertedto relativelysmalldistances inactual transitionprobabilities,whichcan thenfurtherbe convertedto a relativelysmallboundonvalue differencesbythesameargumentsusedinprovingTheorem 1(inAppendix C).Therefore, sinceourmodelwouldverylikelyconvergecloseto P∗,wecanexpecttodoalmostaswellasP.

5. Experiments

We examinethe performance ofour expectationtransferalgorithm TES that transfers viewsto speed-upthe learning processacrossdifferentcomplex,feature-rich,heterogeneous,anddynamicenvironments.WeshowthatTEScanefficiently:

1. learntheappropriateviewsonline;

2. selectviewsusingtheproposedscoringmetric; 3. achieveagoodjumpstart;and

4. performwellinthelongrun.

We first evaluate our approach in a simulated navigation domain where the assumptions hold. We then conduct a case-studyinarealroboticdomaintoseeifthetheoreticalresultsareusefulinpractice.

5.1. Simulatedrobotnavigation

Environment.Weconsideragrid-basedrobotnavigationprobleminwhicheachgrid-cellhasthesurfaceofeithersand, soil, water,brick,or fire. Inaddition, theremay be wallsbetweencells. The surfacesandwalls determinethe stochastic dynamicsof theworld. However, theagent also observesnumerous other features intheenvironment. The agent hasto learntofocusontheimportantfeaturestolearntheenvironmentdynamicsmodel,andconsequentlytoachieveitsgoal.

Action,states,andrewards. Theagent canperformfouractions(moveup, down,left,right),whichwilllead ittoone ofthefourstatesarounditorleaveitinitscurrentstate.Effects oftheactions arecapturedinfiveoutcomes(movedup, left,down,right,didnotmove).Thestatesaredefinedbythe(x,y)-coordinatesoftheagent.Theagentspends0

.

01 unitsof energytoperformanaction.Itloses1 unitiffallingintoastateoffire,butgains1 unitwhensuccessfullyreachinganexit door.

Goal.The goalisto reachanyexit doorintheworld consumingaslittleenergyaspossible.A taskendswhen agent reachesaterminalstate,i.e.,anyexitdoororstatewithfire.

Tasks. We designfifteentasks withgrid sizes rangingfrom20

×

20 to30

×

30.Eachtaskhas adifferent state space anddifferent terminal states.Each state (cell) ischaracterized by its surface materials andthe wallsaround it, and200 additionalirrelevant,randombinaryfeatures.Thetasksmayinvolvedifferentdynamicsaswellasdifferentdistributionsof the surface materials.In ourexperiments,the environment transitiondynamicsisgenerated usingthree differentsetsof multinomial logistic regressionmodels; eachcombinationofcell surfacesandwallsaround thecell willlead todifferent

2 Suchmodelmaynotalwaysexistsincetheparametersetisopen.However,forourargument,anymodelwithalmostinfimumdistancetothetrue

(12)

Fig. 3.Accumulated rewards in a 900 state CMDP for various model-based RL-methods.

Table 1

Averagerunningtimeper episodein800 episodeswhenactinginanenvironmentwith210 features.(SlowRL-DT,LSE-Rmaxcouldonlyberunwith10 features.)RunonIntelXeonCPU2.13 GHz,32 GBRAM.

Algorithm fRmax fEpsG RL-DT LSE-R. bloreRL loreRL

Time (sec.) 0.26 0.25 9.09 67.53 4.3 0.55

transitiondynamicsatthecell.Theprobabilityofgoingthroughawallisroundedtozeroandthefreedprobabilitymassis evenlydistributedtoothereffects.Theagent’sstartingpositionisrandomlypickedineachepisode.

Transferlearning.The maximumsize M of theview library,initially empty,is setto be20; thresholdc

=

log 300. In a new environment,the TES-agent mainly relieson its transferred knowledge. However, we allow some

-greedy explo-ration with

=

0

.

05.The parameters for view learning algorithm are that

λ

=

0

.

05,

α

=

1

.

5. We conduct leave-one-out cross-validationexperimentwithfifteendifferenttasks.In eachscenario theagent isfirstallowed toexperience fourteen tasks,over100 episodesineach,anditisthentestedontheremainingtask.Norecencyweightingisusedtocalculatethe goodnessoftheviewsinthelibrary.Wenextdiscussexperimentalresultsaveragedover20 runsshowing95% confidence intervalsforsomerepresentativetasks.

5.1.1. Onlineviewlearninginfeature-richenvironments

We show the empiricalevaluationresults ofloreRLin a900 cell/state world. Weaim todemonstrate thatthe “single expectation” model-basedRL, loreRL, cana)learn viewsthat generalize andapproximate thetransitionmodel toachieve fastconvergencetonearoptimalpolicy,andb)withfeatureselection,performwell incomplex,featurerichenvironments. Wealsowanttoseeifthetheoreticalpromisesderivedunderassumptionofi.i.d.samplingcanberealizedinpractice.We compareaccumulatedrewardsofloreRLwithfactoredRmax(fRmax),inwhichthenetworkstructuresoftransitionmodels are known [36], and with factored

-greedy (fEpsG), in which the optimistic Rmax exploration offRmax is replaced by an

-greedy strategy. We also compareour method withRL-DT [37] andLSE-Rmax [38], which are the state ofthe art model-basedRLalgorithmsforlearningtransitionmodels.Forthesetests, werunloreRLwith

α

=

1

.

5,

λ

=

0

.

05,

γ

=

0

.

95, exploration

=

0

.

05, parameter m

=

10 forfRmax,m

=

5 for Rmax (m

=

5 is smallfor Rmax, butincreasing it didnot yieldedbetterresult),fixedm

=

10

,

σ

=

0

.

99 forLSE-Rmax(valuesoriginallyusedin[38]).

Generalizationandconvergence. We first show that when the feature space is small, loreRL performs as efficiently as thestate of the artmethods. RL-DT employs a decisiontree to generalize transitiondynamicsknowledge over states, butit isimplementedwithan

-greedy explorationstrategy. LSE-Rmaxappears tobe thebeststructure learning method forergodic factoredMDPs[38].fRmax andfEpsGhave correctDBNstructuresprovidedby an oracle.Allthe methodsare implemented with ourcustomized DBNto incorporatedomain knowledge. Rmax isincluded asa reference to show the effectofknowledgegeneralization.

As seeninFig. 3a,loreRLcan approximatetheworlddynamicsusingsamplesinallthestates,thusitconvergesasfast as fEpsG, andRL-DT to nearoptimal policy.AlthoughfRmax is provided withthe correctDBNstructure, its accumulated rewards are lower dueto aggressiveexploration tofind theoptimalmodel.Afterexploration the policy isguaranteed to be nearoptimal, butitmaystill take along time(or forever)tocatch upwithloreRL.While LSE-RmaxfollowstheRmax

scheme, it starts with a simple modeland exploresa bit lessaggressively than fRmax, gaining some advantage in early episodes.However,LSE-Rmaxappearstorequiremuchmoredatatochooseamorecomplexmodel.Itsaccumulatedreward dropsbelowfRmaxafter150 episodes,andtheangleofthecurvesuggeststhatitsDBNstructureisstillnotcorrect.Wedo notrunLSE-Rmaxformoreepisodes,asthealgorithmiscomputationallyverydemanding(Table 1).

When thefeaturesetincludesmanyirrelevantfeatures (Fig. 3b),loreRLisabletolearntherelevantonesandstillgain nearlyashighaccumulatedrewardsasfEpsGwhichhasrelevantfeaturesprovidedbyanoracle.loreRL’srunningtimeisalso notmuchlongerthanfRmax’sorfEpsG’s(Table 1).Othermethodsaretooslowtoruninthishigh-dimensionalenvironment. Theseresults alsosuggestthat with

-greedy explorationandrandomrestarts, nearoptimalpolicy canbe foundeven withouti.i.d.datasampling.
(13)

Fig. 4.Performance difference toTESin early trials in a) same dynamics, b) heterogeneous environments. c) Convergence.

Table 2

Cumulativerewardafterfirstepisodes.Forexample,inTask1TEScouldsave(0.616−0.113)/0.01=50.3 actionscomparedtoLWT. Methods Tasks

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

loreRL −0.681 −0.826 −0.814 −1.068 −0.575 −0.810 −0.529 −0.398 −0.653 −0.518 −0.528 −0.244 −0.173 −1.176 −0.692 LWT 0.113 −0.966 −0.300 0.024 −1.205 −0.345 −1.104 −1.98 −0.057 −0.664 −0.230 −1.228 0.034 0.244 −0.564

TES 0.616 −0.369 0.230 −0.044 −0.541 −0.784 −0.265 0.255 0.001 −0.298 −1.184 −0.077 0.209 0.389 −0.407

Featureselection.Tounderstandtheroleoffeatureselection,wecompareloreRLwithabloreRLthatisbasedon multi-nomiallogisticregressionwithoutfeatureselection(withouttheregularizationterm).fEpsGandfRmaxarebaselines.

Fig. 3bshowstheaccumulatedrewardswhentheenvironmenthas200 irrelevantbinaryfeatures.Asseen,loreRLisstill abletoquicklyconvergetotheoptimalpolicy,andoutperformsfRmaxandbloreRL.Fig. 3cshowsperformancesofloreRLand

bloreRLafter800 episodesasafunctionofthenumberofirrelevantfeatures.Onlyminimallyaffectedbytheactualnumber ofirrelevantfeatures, loreRLcanquicklyselecttherelevantfeaturesandoutperformbloreRL.loreRLdoesnotlosemuchto

fEpsGeither.WhilefRmaxmayfindanoptimalpolicybeforeloreRLduetoaggressive exploration,its accumulatedrewards are stilllower thanloreRL’s.We alsoobservethat loreRL, throughselectinga smallsetoffeatures, runsmuchfasterthan

bloreRL(Table 1).

5.1.2. Viewselectionandmulti-viewtransferincomplexenvironments

Transferringexpectationsbetweensamedynamicstasks.ToensurethatTES iscapableofbasicmodeltransfer,wefirst evaluateitonasimpletask.WetrainandtestTESontwoenvironmentswhichhavethesamedynamicsand200 irrelevant binaryfeatures thatchallengetheagent’sabilitytolearnacompactmodelfortransfer.Fig. 4ashowshowmuchtheother methodslosetoTESintermsofaccumulatedrewardsinthetesttask.loreRLisanimplementationofTESequippedwiththe view learningalgorithmthat doesnottransfer knowledge.fRmaxisthefactoredRmaxinwhichthenetworkstructuresof transitionmodelsareprovidedbyanoracle[36];itsparametermissettobe10 inalltheexperiments.fEpsGisaheuristic in whichthe optimistic Rmax exploration of fRmax isreplaced by an

-greedy strategy (

=

0

.

1). The results show that theseoraclemethodsstillhavetospendtimetolearnthemodelparameters,sotheygainloweraccumulatedrewardsthan

TES.Thisalsosuggeststhatthetransferredview ofTES islikelynot onlycompactbutalsoaccurate. Fig. 4afurthershows thatloreRLandfEpsGaremoreeffectivethanfRmaxinearlyepisodes.

Viewselectionvs.randomviews.Fig. 4bshowshowdifferentviewsleadtodifferentpoliciesandaccumulatedrewards overthefirst50 episodesinagiventask.TheRandscurvesshow theaccumulatedrewarddifferenceswithrespecttoTES

when theagent followssome random combinations ofviews fromthe library.Forclarity we show only 5 such random combinations.For all thesecurves,the differencesquickly turnnegative in the beginning indicating lessrewardin early episodes.Weconcludethatourviewselectioncriterionoutperformsrandomselection.

Multipleviewsvs.singleview,andnon-transfer. We comparethe multi-view learning TES agent witha non-transfer agentloreRL,andanLWT [39]agentthattriestolearnonlyonegoodmodelfortransfer.Wealsocomparewiththeoracle methodfEpsG.As seenin Fig. 4b,TES outperformsLWT which,duetodifferencesin thetasks,also performsworse than

loreRL.Whenthe earliertraining tasksare similarto thetest task, theLWT agentperforms well. However, theTES agent alsoquicklypicksthecorrectviews,thusweneverlosemuchbutoftengainalot.WealsonoticethatTESachieveshigher accumulatedrewardsthanloreRLandfEpsGthatareboundtomakeuninformeddecisionsinthebeginning.

Wealsonoticethat duetoits fastcapabilityofcapturingtheworlddynamics,TES runningtime isjustslightlylonger than LWT’sandloreRL’s,whichdonot perform extrawork forview switchingbutneed moretime anddatatolearn the dynamicsmodels.

5.1.3. Jumpstart

Table 2 shows the average cumulative reward after the first episode (the jumpstart effect) foreach test task in the leave-one-outcross-validation.WeobservethatTESusuallyoutperformsboththenon-transferandtheLWTapproach.

(14)

Fig. 5.Three different environments. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

5.1.4. Convergence

TostudytheasymptoticperformanceofTES,wecomparewiththeoraclemethodfRmaxwhichisknowntoconvergeto a(near)optimalpolicy.Noticethatinthisfeature-richdomain,fRmaxwithoutthepre-definedDBNstructureisjustsimilar toRmax.Therefore,wealsocomparewithRmax.ForRmax,thenumberofvisitstoanystatebeforeitisconsidered“known” issetto5,andtheexplorationprobability

forknownstatesstartstodecreasefromvalue0

.

1.

Fig. 4cshowstheaccumulatedrewardsandtheir statisticaldispersionsoverepisodes.Averageperformanceisreflected bytheanglesofthecurves.Asseen,TEScanachievea(near)optimalpolicyveryfastandsustainitsgoodperformanceover thelongrun.ItisonlygraduallycaughtupbyfRmaxandRmax.ThissuggeststhatTEScansuccessfullylearnagoodlibrary ofviewsinheterogeneousenvironmentsandefficientlyutilizethoseviewsinnoveltasks.

5.2. Realrobotnavigation

Thetheoreticalanalyses presentedabovehaveshowntheadvantagesofloreRLandTESoverthestateoftheart model-based RL algorithms.We havealsodemonstrated theefficiencyofour methodsthrough aset ofempiricalevaluations in simulateddomains.Theseresults,thoughvaluable,areobtainedunderassumptionsthatarefavorable forourapproach.In thissection,weaimtofurtherevaluatetheproposedmethodsinarealroboticdomainwherewecannotexpecttheeffects ofactionstofollowalogisticregressionmodel.

5.2.1. Experimentset-up

Environments. Fig. 5 showsthe three environments used in our casestudies. They are designed so that the robot’s actions wouldhavedifferenteffectsatdifferentlocations,andtheenvironment surfacesare themainfactorsaffectingthe action effects. Thesurfaces are madeofvarious materials such asbeans, soil, hay,leaves, shells, paperboard, andnylon Berbercarpet. Thesematerials havedifferentphysicaleffectsontheobjectsmovingonthem.Theslopesandobstacleson the surfacesalsocontributetothe differenteffectsoftheactions.Insome areas,thesurfacesmaychange becauseofthe robot’sactions.Werestorethesurfacestotheoriginalconditionsaftereveryepisode.

For a robot to efficientlyplan its path in theseenvironments, only a small set of features based on the slopes, the obstacles,andthematerialsofthesurfacesindifferentareasoftheenvironmentsarerelevant.Thesefeatures,however,are veryhardtodefine.Itispreferabletoleavetherobottoautomaticallyselecttherelevantfeaturesfromalargesetofsimple features.Totestourapproach,therefore,wesimplydrawgreenandbluemarksonthesurfaces.Therobot ismarkedwith two redmarks.Thereare alsoablueball,andseveraldeath-marked spotsintheenvironment. Numerousfeaturescanbe derived fromtheseartifacts. Therobotwillneedtoselectafewfeatures thatmayserveasproxiestothetruefactorsthat affectitsactioneffects.

Environment 1andEnvironment 3aredeliberatelydesignedsothattherobotshouldlearn itsviews(transitionmodel) based on the blue marks. The transitiondynamics in these two environments are very similar. However, the two envi-ronments arealso different(interms ofirrelevantfeatures): theblueballs, thedeath places,andthe greenmarksare at differentlocations.Wewillexplainthefeaturesinmoredetailshortly.Environment 2isverydifferentfromEnvironment 1 andEnvironment 3.It isdesignedsothat therobot shouldlearn itsviewsbasedon thegreenmarks insteadoftheblue ones.

We treat the environments asdiscrete MDPs. We discretizethe Environment 2 into a state spaceconsistingof 8

×

8 (x,y)-locations and8 different orientationsof arobot, whichyields a state spaceof512 states. Environment 1and 3 are larger,sowediscretizethemintoastatespaceconsistingof10

×

10 (x,y)locationsand8 differentorientationsofarobot. Theenvironmentsareintwodifferentsizes,5

×

5 feetand6

×

6 feet(Fig. 5).
(15)

Fig. 6.The robot. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Fig. 7.The system architecture.

Robot.WeusetheLEGOMindstormsNXTv1.1kittobuildathree-wheelrobotasdepictedinFig. 6.Twofrontwheelsof therobotareattachedtotwoseparatemotors;thebackwheelisfreerolling.Thetrackwidthis11.2 cm.Therobotcarries awhitepanelontopwithabigandasmallredmarksforpositioningandorientationdetection.

Therobot systemcomprisesthreemaincomponents:a centralprocessor,an observatorysystem,andacommand con-troller(see Fig. 7). Thepositioningsystemisasub-componentoftheobservatorysystem.Informationoftheenvironment andtherobot’spositionis capturedbya webcam andsent fromtheobservatorysystem tothecentralprocessor toupdate robot’sknowledge-baseaswell asto planthe next action.Theaction commandis then transmittedvia Bluetoothto the

commandcontrollerembeddedintherobotforexecution.WeimplementthecontrollerinleJOS.3

Actions. The robot is programmed to rotateits left andright wheelsin three different ways, corresponding to three actions.Forthefirst action,therobotrotates bothitsleft andrightwheels246 degrees.Forthesecond action,therobot rotates its left wheel90 and right wheel

90 degreesat the same time. Forthe third action,the robot rotates its left wheel

90 andrightwheel90 degree.Astherobot maybe stillmovingaftereach action,we letthesystemidlefor200 millisecondsafteran action,waiting forthe robottostop completely.Theseactions, underidealsituations,correspondto theactionsofmove-forward,turn-left,andturn-right,respectively.

Actioneffects. Due toinaccurate robot motors,sensors, andvarious real world factors such as thesurface materials, slopes ofthesurface, andobstacles,theactions maychangetherobot’srelative location infourdifferentways, including movedforwardonecell,moveddiagonallyforwardtothecellontherobot’sleft,moved diagonallyforwardtothecellon therobot’sright,anddidnotmove.Therobot’sorientationcanalsobechangedinfivedifferentways,including:turnedto thenextorientation onleft,thesecondnextorientation onleft, thenextorientationonright,thesecond nextorientation onright,anddidnotturn.Thatwouldresultinatotalof20 differenteffects.

Sensors.Wemainlyprocessinformationfromtheweb-camintheobservatorysystem.Theweb-camisattachedtothe ceilingabovethearea.Therobot,therefore,canfullyobservetheenvironment.However,therobotcanonlycapturethebig andsmallredmarksonthetopoftherobotitself,andtheinformationofthelocationsofthegreen,blue,redmarks,and

(16)

Fig. 8.Accumulated rewards by various methods.

Table 3

Averagerunningtimeperepisodein50 episodes.RunonIntelCentrinoDuoT2400(1.83 GHz), 2×512 MB RAM.

Algorithm fRmax fEpsG man-loreRL loreRL

Time (sec.) 13.44 12.77 9.35 10.81

theballintheenvironment.Asthefeaturesaresimple,weusethebasicalgorithmsinOpenCVlibrary4 todetectthem.The

resultisnearlyperfect.

Stateattributesandfeatures.Asmentioned,anenvironmentisdiscretizedtonrows

×

mcolumns

×

o orientations,sothe fullenvironmentstatespacecanbeidentifiedorfactorizedbythosethreestateattributes.However,theseattributesalonedo not containenoughinformationforpredictingactioneffectsortransitiondynamics.Therefore,itiscriticalthateachstate isalsodefinedwithalongvectorofbinarystatefeatures.The“green”binaryindicator fiG

(

s

)

ofastatesissetto1 iffthere isagreenmarkthatisfurtherthani unitsbutcloserthani

+

1 unitsfromthe(x,y)-centerofthestates(i

∈ {

0

,

. . . ,

99

}

). A unit equals thewidthofthe environmentdivided by 100.Similar features aredefinedfor bluemarks andtothe blue targetballyielding300 binaryfeatures.Eightindicatorsfordifferentrobotorientationsarealsoincludedinthefeature-base together withfourintentionallyredundant“thereis/is-not agreen/bluemarkinastate”-bits. Alltogether theseyield 312 binary features per state.The intuitionbehind thesefeatures isthat theyserve asproxiesto surface materials,slopes on thesurfaces,obstacles,etc.whichappeartobeimportantfactorsindeterminingthedynamicsintheenvironments,butthe robot’ssensorscannotcapture.Althoughonlyfewamongthese312 featuresareimportantformodelingrobot’sactions,the robot doesnot knowwhichonesactually matter. Therobothastolearntoselecttherelevantfeaturesbasedonfeedbackwhile interactingwiththeenvironment.

Task.Therobotisassumedtoknowtherewardmodelbeforeanystart.Therobot’staskistotravelintheenvironments fromarandomstartingpointtoreachtheblueball,whichwillearnitarewardof2 points.Therobotwillreceive

1 point if it fallsout ofthe area or into“death spots” marked withorangerectangles, and

0

.

05 points forreaching anyother states.Anepisodeendsiftherobotreachesaterminalstateorgetsstuck,i.e.,couldnotmove,forfourconsecutiveactions. Inother words,therobot aimsforthehighestcumulative rewardineachepisode.It triestoreachtheblueballasfast aspossible,butitwill avoidvisitingthedeathspotsormoving out ofthemap.In caseit isverycostly orimpossibleto reachtheball,therobotcouldgiveupbyrunningintoadeathspotormovingoutofthemap.

5.2.2. Onlineviewlearningformodel-basedRL

The robotbatterydoesnot allowustocompareouralgorithmwiththeslowRL-DT andLSE-Rmaxalgorithms,thuswe will only compare loreRLwiththe fine-tuned algorithms, including fRmax, fEpsG, andman-loreRL, inwhich we manually select important features and specify the DBN-structures forthe transition models. man-loreRL is based on multinomial logisticregressionmodelswiththe12 manuallyselectedfeatures,includingeightindicatorsfordifferentrobotorientations, andfourindicatorstellingifthereis/is-notagreen/bluemarkinastate.WeruntheexperimentswithloreRLandman-loreRL

setting

α

=

0

.

5,

λ

=

0

.

05,

γ

References

Related documents

37 Air Traffic Control, Czech Republic, training centre 38 Prague Airport administration building.. 39

Während aber in Ostdeutschland trotzdem knapp 90% der mit ExGZ oder ÜG geförderten Frauen in Voll- oder Teilzeit arbeiten, ist diese Quote für Frauen in Westdeutschland

In this study we review effects of mesenchymal stem cell conditioned medium in different central nervous system (CNS) disease associated with neuroinflammation..

Auto malls have Covenants, Conditions and Restrictions (CC&Rs) regulating architecture, landscaping, etc., as well as joint marketing programs and other common services

Owa (Editors), Current Topics in Analytic Function Theory , World Scientific Publishing Company, Singapore, New Jersey, London, and Hong Kong,

The degree of resistance exhibited after 1, 10 and 20 subcultures in broth in the absence of strepto- mycin was tested by comparing the number of colonies which grew from the

To get that in synchronous CDMA, we transmit , where C n,i is the chip sequence coded with data bit for n th user and i th time.. As it is synchronized in chip as well as

Following the genetic resolution of Vici syndrome in 2013, a number of disorders associated with defects in primary autophagy regulators have now been identified – for example,