Decentralized optimization of fluctuating urban traffic using reinforcement learning

(1)

Reinforement Learning

As'ad Salkham

AthesissubmittedtotheUniversityofDublin,TrinityCollege

in fulllmentoftherequirementsforthedegreeof

Dotorof Philosophy(Computer Siene)

(2)

I, the undersigned, delare that this work has not previously been submitted to this or any other

University,andthat unlessotherwisestated,itisentirelymyownwork.

As'adSalkham

(3)

I,theundersigned,agreethat TrinityCollegeLibrarymaylend oropythisthesisuponrequest.

As'adSalkham

(4)

I would liketothankmysupervisorProf. VinnyCahillforhis supportandguidanethroughout the

Ph.D. proess.

Thisthesis wouldnothavereaheditsstage nowwithoutthementorshipof Dr. Raymond

Cun-ningham. Hehasshownagreatdealofpatieneandsientiprofessionalisminadvisingandhelping

me. A big thankyouto Ivana Duspari whohasshared thesamefrustrationsof tra simulations

and thejolly feelingswhenonesee agentslearning! I appreiate alot her wittyproofreading many

ofmyhaptersandhersenseofhumourthat keptusbothfromedging insanity.

Iwouldliketothankeverymemberof DSG,oldand newones. Theexhaustivelistmighttakea

page length andif the aliatedpeople are added then itis going to betwopages ! So, thankyou

all foryoursupport and friendship. However,speial ones haveto bementioned,Marin Karpi«ski,

MaªgorzataJaksik,BartekandIlonaBiskupski,SerenaFritshandSanadGhgaraibeh...manythanks.

There are nowordsthat ould show myloveand appreiation formy family, theirsariesfor

meandtheirpatiene. Myparentshavealwaysstressedonthehigherpriorityforeduationandthey

wereright. TheyhaveraisedmewiththebesttheyanaordandforthatIamdeeplygrateful. My

loveandgratitudetomythreesistersandousinsforstandingbesideme foralltheseyears.

As'ad Salkham

UniversityofDublin,TrinityCollege

(5)

Inreasing tra ongestion levels are ausing high worldwide eonomi, environmental and soial

osts. Eienturban tra ontrol (UTC)is partofthesolutionto thetra ongestionproblem.

However,UTCoptimizationisahallengingtask. Urbantraisharaterizedbyonstantly

utu-ating tra patterns. Dailyvariationsin tra volume anddiretion,driverbehaviour, unexpeted

emergenysituationsandtraaidentsallresultintrautuations. Consequently,urbantra

networks exhibit non-stationarybehaviour and UTC systems are omplex. Furthermore, any loal

traontroldeisionsarriedoutatagivensignalizedjuntionontrollermayaetbothupstream

anddownstreamjuntions. Hene,unoordinatedorpoorloaldeisionsannegativelyimpatonthe

tranetwork. ModellingUTCasanoptimizationproblemisalsoompliatedbytheheterogeneous

interlinkedlayoutsofsignalizedjuntions andthesaleofthesystem.

UTC has been a widely studied problem for a long time. Numerous systems and

methodolo-gies havebeen proposed to addressitoverthe last four deades. Classial UTCsystemsare either

ontrolled by adediated entral serverorin a distributed manner. The majority relyon omplex

mathematial and preditive models to optimize spei settingsof a giventra ontroller. With

the inreasingostsofongestion,theperformane ofthese systems,whih arestillin serviein the

major ities of the world, have prompted questions onerning their eetiveness and adaptiveness

in saturatedtra onditions. Other approahesrangefrom rule-basedsystemsand thosemodelled

using fuzzy/heuristianddynamioptimizationtehniques,toevolutionarygametheoryandgeneti

programming basedapproahes. However, these approahesare still hallengedto provide salable

andyetreal-timeadaptiveandresponsiveperformane. Inaddition,reinforementlearning(RL)and

numerousdeentralizedRLmethodsarebeinginreasinglystudiedforUTCoptimization. Thenature

of RL asan unsupervisedlearning approah, and partiularly Q-Learning, asamodel-freelearning

strategy, allows for inomplex problem modelling and ontrol of the exploration proess towards a

near optimalsolution. Suhharateristisareadvantageousfordevelopingareal-timeadaptiveand

(6)

of themajor souresofthat unertaintyis thenon-stationarynature oftra. An RL approahto

UTCoptimizationmustbedesignedinamannerthroughwhih itisrstlyapableofdistinguishing

between stable situations and seondly able to eiently optimize for eah. Moreover, the

perfor-maneofexistingRL-basedUTCapproahesisoftenevaluatedusingsimpliedgrid-likemaps. Some

approahesusemodel-basedRLandpartially observablemarkovdeisionproesses(POMDPs) that

add unjustiableomplexity. Whentryingtohandle thenon-stationarynatureof trawhileusing

RL, strit assumptions are needed, e.g., that a small number of stationary tra onditions reur,

that trapatternshangeinfrequentlyandtheindependene ofsuhhangesfromtraontroller

deisions. Inaddition,someofthese approahespresumetheavailabilityofknowledgethat iskeyto

theiroperationbutimpratialto obtainfrom therealworld.

Ourontributionisadeentralizedmulti-agentRL UTCstrategythatmodelsheterogeneous

sig-nalized juntions and optimizes UTC in an adaptive and responsive manner. It is motivated by

the lak of a model-free deentralized RL approah for UTC optimization that an deal eiently

with the non-stationary nature of tra without limiting assumptions and the possibility of

tak-ing advantageoftheinreasing availabilityof oatingvehiledata(FVD). Thegrowingadoptionof

vehile-to-vehile/infrastrutureommuniationandthepervasivenessofdierentpositioningsystems

both motivate the onsiderationof FVD asa meansof providing a rih view of loal tra

ondi-tions. We havedesigned aUTC optimization sheme basedon RL that deploysan adaptiveround

robin ontroller agentpairedwith anon-parametri tra-patternhange-detetion mehanismper

signalized juntion,namely, aSoilseagent. TheSoilseagentoptimizesphase timingsusing RL in a

non-ollaborativemanner. The agentisreferredto asSoilseCwhen it alsoollaborates with

neigh-bours. Itadaptstoloaltraonditionsandrespondstodierenttrapatternswhenrequired. In

ordertoprovideforsuhresponsiveness,itquantiesthedegreeofhangeperjuntionusing

informa-tionaboutloaltraoninominglanesanditsloalperformane. Essentially,ourdesignallowsfor

agentstorelearnupondetetingapersistentloaltrapattern hange. Therelearningparameters

are mainly basedon an average sample of therelevant degreeof pattern hange. An evaluation of

our approah shows its eetiveness against a non-adaptive xed-time UTC system and a

satura-tion balaning algorithm that emulates the Sydney Coordinated AdaptiveTraSystem(SCATS).

TheevaluationisbasedonsimulationsofrealDublinmapsofdierentsaleandnear-realistitra

(7)

Aknowledgements iv

Abstrat iv

List ofTables xi

List ofFigures xii

Chapter 1 Introdution 1

1.1 ReinforementLearning . . . 2

1.1.1 DeentralizedReinforementLearning . . . 3

1.2 UrbanTraControl . . . 3

1.2.1 UTCFatsandChallenges . . . 5

1.2.2 FloatingVehileData . . . 8

1.2.3 CommonUTCConepts. . . 9

1.2.4 UTCOptimizationTrends. . . 10

1.3 Hypothesis . . . 10

1.4 PrinipalContribution . . . 11

1.5 ThesisOrganization . . . 11

Chapter 2 State of theArt 12 2.1 ReinforementLearning . . . 12

2.1.1 MarkovDeisionProesses . . . 12

2.1.1.1 ValueIteration . . . 14

2.1.1.2 PoliyIteration . . . 15

(8)

2.1.2.1 LearningStrategies . . . 18

2.1.2.2 AtionSeletionStrategies . . . 20

2.1.3 DeentralizedReinforementLearning . . . 22

2.2 Unertaintyin UTC . . . 23

2.2.1 TraPatterns. . . 24

2.3 ClassialUTCApproahes. . . 25

2.3.1 SCATS . . . 25

2.3.2 SCOOT . . . 27

2.4 Non-RLUTCApproahes . . . 28

2.4.1 Centralized . . . 28

2.4.1.1 TUC . . . 28

2.4.1.2 DISCO . . . 29

2.4.1.3 MOTION . . . 29

2.4.1.4 Others . . . 29

2.4.2 Hierarhial . . . 30

2.4.2.1 RHODES/COP . . . 30

2.4.2.2 UTOPIA . . . 30

2.4.2.3 PRODYN-H . . . 31

2.4.2.4 Others . . . 31

2.4.3 Deentralized . . . 31

2.4.3.1 PRODYN-D . . . 31

2.4.3.2 ALLONS-D. . . 32

2.4.3.3 SuRJE . . . 32

2.4.3.4 Others . . . 32

2.4.4 Summary . . . 33

2.5 RL-BasedUTCApproahes . . . 34

2.5.1 Q-Learning-BasedApproahes . . . 34

2.5.2 EvolutionaryPrograming&RL. . . 36

2.5.3 FuzzyNeuralNetworks&RL . . . 37

2.5.4 Model-BasedVehile-CentriRL . . . 38

2.5.5 SpeiRL-BasedUTCApproahesforNon-StationaryEnvironments . . . . 38

(9)

3.2 OverviewandMotivations . . . 44

3.3 PatternChangeDetetion . . . 47

3.3.1 Design . . . 47

3.3.2 SensitivityandParameters . . . 49

3.3.3 Algorithm . . . 51

3.4 Phases . . . 54

3.5 SignalizedJuntionModel-SoilseandSoilseC . . . 55

3.5.1 Soilse . . . 56

3.5.1.1 LoalRewardModel. . . 58

3.5.1.2 Relearning . . . 59

3.5.1.3 SoilseAlgorithm . . . 61

3.5.2 SoilseC . . . 61

3.5.2.1 Neighbours . . . 65

3.5.2.2 CollaborativeRewardModel . . . 66

3.5.2.3 SoilseCAlgorithm . . . 66

3.6 Summary . . . 67

Chapter 4 Implementation 70 4.1 TheCRLframework . . . 70

4.1.1 LearningStrategy . . . 72

4.1.2 AtionSeletion . . . 73

4.1.3 RLAgent . . . 75

4.1.4 CRLAgent . . . 78

4.2 SoilseandSoilseCAgentGenerator. . . 80

4.2.1 PCD . . . 80

4.2.2 Relearn . . . 84

4.3 Summary . . . 84

Chapter 5 Evaluation 85 5.1 UTCSimulation . . . 85

5.1.1 TheUTCSimulator . . . 86

(10)

5.2.2 SoilseandSoilseCSpeis . . . 91

5.2.3 BaselinesforComparison . . . 92

5.2.4 PerformaneMetris . . . 93

5.2.5 EvaluationObjetives . . . 94

5.3 TrinitySenario. . . 95

5.3.1 BaselinePerformane . . . 95

5.3.2 Soilse . . . 100

5.3.2.1 Initial Learningvs. Relearning . . . 100

5.3.2.2 RelearningBehaviour . . . 104

5.3.3 SoilseC . . . 108

5.3.3.1 SoilseCvs. Soilse . . . 108

5.3.3.2 CollaborationMode . . . 112

5.3.3.3 RelearningBehaviour . . . 115

5.3.4 ComparisonAgainstBaselines . . . 117

5.3.5 Summary . . . 123

5.4 Dublin InnerCityCentreSenario . . . 124

5.4.1 BaselinesPerformane . . . 124

5.4.2 InitialLearningvs. Relearning . . . 125

5.4.3 RelearningBehaviour . . . 126

5.4.4 SoilseandSoilseCvs. Baselines. . . 129

5.4.4.1 SoilseC'sCollaborationMode. . . 134

5.4.5 Summary . . . 135

5.5 Summary . . . 136

Chapter 6 Conlusions and Future Work 137 6.1 ThesisContribution . . . 137

6.2 FutureWork . . . 139

(11)

3.1 PCD parameters . . . 49

3.2 SoilseCneighbourhood . . . 66

5.1 Trinitysenariotrapatterns . . . 88

5.2 Experimentalparameters . . . 92

5.3 Trinity-seletedbaselinesAWTperformaneomparison-Trinitymap . . . 100

5.4 Trinity - Soilse vs. SoilseInit based on best AWT performane per ation seletion strategy . . . 101

5.5 Trinity-best AWTSoilseperformaneperationseletionstrategy . . . 101

5.6 Trinity-best AWTSoilseperformaneperexplorationfator(ExpFator) . . . 104

5.7 Trinity-SoilseCvs. SoilseforbestperformanebasedonAWT . . . 109

5.8 Trinity-SoilseCvs. SoilseforbestperformanebasedonAvgStops . . . 109

5.9 Trinity-keyparametersofbest performingSoilse. . . 110

5.10 Trinity-keyparametersofbest performingSoilseC. . . 110

5.11 Trinity-SoilseCbest AWT performaneperollaborationmode-

ǫ

-greedy . . . 115

5.12 Trinity-best AWTperformaneofSoilseandSoilseCagainsttheseletedbaselines . 117 5.13 DublinICC -baselinesperformane-RRandSAT . . . 125

5.14 DublinICC -SoilseInitvs. Soilse-bestAWTperformane . . . 125

5.15 DublinICC -SoilseCvs. Soilsebestperformane . . . 129

(12)

1.1 Skethoftheworld'srsttrasignalthatwasinstalledonthejuntionofGeorgeand

BridgestreetsinLondonin 1868(Mueller, 1970) . . . 4

1.2 RoadmotorvehilesperthousandinhabitantsinseletedOECDountries(OECD,2008) 6 2.1 AtypialRL agentinteratingwiththeunderlyingenvironment s : state,r : reward,a : ation . . . 17

2.2 AdeentralizedRL struture . . . 23

2.3 SCATSloalontrollerdata(Sims&Dobinson,1980) . . . 26

3.1 Designoverview. . . 45

3.2 Example-CUSUMsamples . . . 47

3.3 JuntionPCDhigh-levelsheme . . . 48

3.4 Fixedthresholdingtehnique . . . 50

3.5 Twophasesofafour-approahsignalizedjuntion. . . 54

3.6 Completeset ofphasesforaT-shapedjuntion . . . 54

3.7 SimplistisetofphasesforaT-shapedjuntion . . . 55

3.8 ConisesetofphasesforaT-shapedjuntion . . . 55

3.9 Soilseagentstruture . . . 56

3.10 Soilseagentstate-ationspaeforasignalizedjuntionwiththree phases . . . 57

3.11 Phasetraount-usedtodeterminephasestatus . . . 58

3.12 SoilseCagentstruture . . . 62

3.13 NPV example . . . 64

3.14 Neighboursexample . . . 65

(13)

4.4 RLAgentlassdiagram . . . 77

4.5 CRLAgentlass diagram. . . 79

4.6 CRLAgentreeiveSRfuntionsequenediagram . . . 79

4.7 SoilseandSoilseCagentsgeneratorlassdiagram . . . 81

4.8 SoilseandSoilseClassesrelationtotheCRLframeworklasses . . . 82

4.9 High-levelagentgenerationsheme . . . 82

5.1 UTCsimulator-viewersnapshot . . . 87

5.2 Trinitymap . . . 88

5.3 Dublin innerityentre map . . . 90

5.4 Trinity-RRtotalvehilewaitingtimethroughoutthesimulationtime . . . 95

5.5 Trinity-SATtotalvehilewaitingtimethroughoutthesimulationtime . . . 96

5.6 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1, 2_1.5,5_1.1,5_1.5)-AWT . . . 97

5.7 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1, 2_1.5,5_1.1,5_1.5)-AvgStops . . . 97

5.8 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1,2_1.5,5_1.1,5_1.5)-numberofarrived vehiles . . . 98

5.9 Trinity-RR20svsSAT_2_1.5-totalvehilewaiting timethroughoutthesimulation time . . . 98

5.10 Trinity-RR20svsSAT_2_1.5-aumulatedtotalvehilewaitingtime . . . 99

5.11 Trinity-RR20svs. SAT_2_1.5-numberofstoppedvehilesthroughoutthesimulation time . . . 99

5.12 Trinity - Soilse using Boltzmann vs. (

ǫ

-)greedy - total waiting time throughout the simulationtime . . . 102

5.13 Trinity - Soilse using Boltzmann vs. (

ǫ

-)greedy - aumulated total waiting time throughoutthesimulationtime . . . 102

5.14 Trinity-SoilseusingBoltzmannvs. (

ǫ

-)greedy-numberofstoppedvehilesthroughout thesimulationtime. . . 103

5.15 Trinity-juntion #1226DPCunder Soilseusing

ǫ

-greedyfordierentExpFatorvalues106 5.16 Trinity-juntion#1226

ǫ

-greedyepsilonhangeusingSoilseunderdierentExpFator values . . . 106

(14)

5.18 Trinity -Soilseusing

ǫ

-greedyratioofrelearningtime tosimulationtime -best AWT

performane . . . 108

5.19 Trinity - Soilse vs. SoilseC using

ǫ

-greedy - aumulated total vehile waiting time,

totalvehilewaitingand numberofstoppedvehilesthroughoutthesimulationtime

-bestAWTperformane . . . 111

5.20 Trinity -Soilsevs. SoilseCusing Boltzmann-aumulatedtotalvehilewaitingtime,

totalvehilewaitingand numberofstoppedvehilesthroughoutthesimulationtime

5.21 Trinity-Soilsevs. SoilseCusinggreedy-aumulatedtotalvehilewaitingtime,total

vehilewaitingtimeandnumberofstoppedvehilesthroughoutthesimulationtime

5.22 Trinity - SoilseC using

ǫ

-greedy (re)learningstart and end times - best AWT

perfor-mane . . . 116

5.23 Trinity-SoilseCusing

ǫ

-greedyratioofrelearningtimetosimulationtime-bestAWT

performane . . . 116

5.24 Trinity - SoilseC

ǫ

-greedy vs. (RR20s and SAT_2_1.5) - total vehile waiting time

throughoutthesimulationtime-best AWTperformane . . . 119

5.25 Trinity - Soilse

ǫ

-greedy vs. (RR20s and SAT_2_1.5) - total vehile waiting time

throughoutthesimulationtime-best AWTperformane . . . 120

5.26 Trinity-SoilseandSoilseC

ǫ

-greedyvs. (RR20sandSAT_2_1.5)-aumulatedtotal

vehilewaiting timethroughoutthesimulationtime-bestAWTperformane . . . . 120

5.27 Trinity - SoilseC

ǫ

-greedy vs. (RR20s and SAT_2_1.5) - (aumulated) number of

stoppedvehilesthroughoutthesimulationtime-best AWTperformane . . . 121

5.28 Trinity - Soilse

ǫ

-greedy vs. (RR20s and SAT_2_1.5) - (aumulated) number of

stoppedvehilesthroughoutthesimulationtime-best AWTperformane . . . 122

5.29 DublinICC-Soilseusing

ǫ

-greedy(re)learningstartandend times-bestAWT

perfor-mane . . . 126

5.30 DublinICC - Soilse using

ǫ

-greedy ratio of relearning time to simulation time - best

AWTperformane . . . 127

5.31 DublinICC -SoilseCusing

ǫ

-greedy (re)learningstartand endtimes -best AWT

per-formane . . . 128

5.32 DublinICC - SoilseCusing

ǫ

-greedyratio ofrelearning time to simulationtime -best

(15)

total vehilewaiting time andtotal numberofstopped vehiles throughout the

simu-lation time . . . 131

5.34 DublinICC -SoilseandSoilseCbest performanevs. SAT-totalvehilewaitingtime

andtotalnumberofstoppedvehiles throughoutthesimulationtime . . . 132

5.35 DublinICC - Soilse and SoilseC best performane vs. SAT - aumulated total

ve-hile waiting time and aumulatedtotal numberof stopped vehiles throughout the

(16)

1 ThevalueiterationDP method . . . 14

2 Thepoliy iterationDP method . . . 15

3 GeneriQ-Learning. . . 19

4 GeneriSARSA. . . 20

5 PCD-alulate DPC . . . 52

6 ThePCDproessforasinglesignalizedjuntion . . . 53

7 Soilseinitialization . . . 61

8 Soilseproess . . . 62

9 Reparameterizeperationseletionstrategy. . . 63

10 SoilseCinitialization . . . 67

(17)

Introdution

Thisthesispresentsanewdeentralizedapproahtoonlineoptimizationofurbantraontrol(UTC)

using Reinforement Learning (RL). In our approah, eah RL agent learns to ontrol a spei

signalized juntion through environmental feedbak and potential ollaboration with neighbouring

agents. Agentsadapt to loal tra onditionsby learninga sequene of tra lightphases to be

used. Theyrespondto utuatingtra patternsorunsatisfatoryperformanebyrelearningbased

on aloal non-parametritra-patternhange-detetionmehanism. Thenoveltyof ourapproah

stemsfromitsonlinedeentralizedUTCoptimizationshemeusingRLwithoutaprioriknowledgeof

tramodelsinanadaptiveandresponsivemannerthatdealswithutuatingtra. Essentially,by

providingsuh anadaptiveandresponsiveUTCshemeweaimtoredueongestion inurbanareas.

ThishapterintroduesRLinludingentralizedanddeentralizedRLshemes. Italsoprovidesa

historialbakgroundonerningUTCandintroduestherelevantfatsandhallengesinthedomain.

AnemergingsoureofdataforUTCoptimizationnamely,oatingvehiledata(FVD)isintroduedas

wellastheommonUTConeptsandurrenttrendsinUTCoptimization. Furthermore,wepresent

ourhypothesiswhihisbasedonanumberofargumentsonerningthedeentralizationofUTConline

optimization using RL and ontheviabilityof loal non-parametritra-patternhange-detetion.

WealsopresentourontributionthatprovidesashemeforUTCoptimizationusingRLwhiledealing

with utuatingurban train adeentralized andonline manner. Finally, theorganizationof the

(18)

1.1 Reinforement Learning

The esseneof RL an be traedto themanner bywhih nature'sintelligentelements anlearnby

interating with the surrounding environment. Sutton & Barto (1998) dene RL aslearninghow

to map situations to ations soas to maximise a numerialreward signal. RL is an unsupervised

learningapproahinthesensethatanagentdoesnotrelyonaknowledgeablemasterthatmighthave

spei domain knowledge. On the ontrary, agents explore their environment by sensingdierent

situations stimuliandthen exeutingsomeseletedation(s) whih resultinafeedbakin theform

ofareward.

AnyRLsolutionisbasedontwobasielements,namely, arewardfuntionandavaluefuntion.

Optionally,someRLsolutionsmakeuseofamodeloftheenvironmenttopredittherewardandnext

stateafter takinganationin agivenstate. Therewardfuntion ismeantto provideanimmediate

goodnessmeasureforaertainationinagivenstate. Thevaluefuntion,asopposedtothereward

funtion,triestoindiatethelong-rungoodnessofagivenation,i.e.,theexpetedrewardsthatan

be aumulated over the future starting from the urrent state. Interation with the environment

eventuallyprovidestheRLagentwithapoliy,i.e.,amappingbetweenallstatesandtheirrespetive

bestationsatanygiventime. Moreover,ationseletionanourusingexploratorystrategies,(e.g.,

ǫ

-greedyorBoltzmann(Sutton&Barto,1998))ornon-exploratorystrategies,(e.g.,greedy). Finding

the limitto whih explorationshould last is known astheexplorationversusexploitation dilemma.

Exploitation isthephaseduring whih theagentputs thepreviouslylearntpoliy into ontrol. The

esseneof thedilemmaisin thefatthat anagentannotrun purelyonexplorationorexploitation

otherwiseit willbelearningforeverwithoutatuallyputting thelearntpoliy intoontrol oritwill

allow agiven poliy to ontrol forever,hene abalane is needed. Q-Learning(Watkins & Dayan,

1992)is awell-establishedmodel-freeo-poliy (explainedbelow)RLstrategy basedon theonept

ofdisountedexpetedrewards. AnRLagentthatusesQ-Learningusuallylearnswithaspeirate

α

: 0

≤

α <

1

andaertaindisountrate

γ

: 0

≤

γ <

1

throughaMarkovDeisionProess (MDP)

representation ofthe environment. It is amodel-free approah in thesense that it doesnotrequire

someapriorilikelihoodmodelfortheationsthatouldbeexeutedontheenvironment. Q-Learning

is onsideredano-poliystrategyas itlearnsand updatestheagent'sknowledgeevenwhiletaking

ationsthat ouldproveto benon-optimalin thefuture (Abdulhaietal.,2003). Beingano-poliy

learningstrategy,aswellasallowingforshortperiodknowledgeupdatingperationtaken,Q-Learning

isanidealandidateforUTCoptimizationgiventhenon-stationarynatureoftra(Abdulhaietal.,

(19)

1.1.1 Deentralized Reinforement Learning

ClassialRL isaentralizedoptimization approah. Thismakesproblemmodellingmorediultas

the system'somplexity inreasesdueto the inreasein thenumberof system'sstatesthat needto

berepresentedwhihouldbealsoaompaniedbyaninreasein thenumberofdeisions/ations.

TheUTCproblem,forexample,dealswithnumerousinteronnetedsignalizedjuntionswithsome

ofaheterogeneousroadlayout. ForarelativelysmallitylikeDublin,theityentrehasroughly

∼

250

signalized juntions that need to besimultaneouslyontrolled. A distributed/deentralized version

of RL an be useful (Abdulhai & Pringle, 2003) for suh a system while a lassial (entralized)

RL view poses problem modelling omplexity as the network of signalized juntions inreases in

size. Many deentralized RL approaheswhere no single RL agentmodels and ontrols the global

problem have been proposed. They provide optimization approahes of a distributed manner that

breaks the global optimization problem into manageable sub-problems. Eah RL agent dealswith

itsassignedsub-problemloallywiththepossibilityofollaboration,(i.e.,knowledgeexhange)with

other agents. This ouldbeseen asaspeializedMulti-AgentSystem (MAS) where agents useRL

for optimization(Bu³oniuet al.,2008;Panait&Luke,2005). Furthermore,several ollaborativeRL

approahes (Dowling et al., 2006; Kok & Vlassis, 2006; Hoen et al., 2006; Goldman & Zilberstein,

2004;Tesauro,2003;Guestrinetal.,2002;Ahmadabadietal.,2001;Abuletal.,2000;Tan,1998;Hu

& Wellman, 1998; Claus &Boutilier, 1997; Littman, 1994) havebeenproposed and wewill disuss

them laterin Chapter 2. Dowlinget al.(2006)theyusethe termCRLto referto aspei form of

CollaborativeReinforement Learning. We adopt the term CRL only in desribing our framework

implementationin Chapter4,however,ourCRLviewisdierentthantheirs. WeusethetermCRL

torefertoashemewhereRLagentsanollaborate,i.e.,exhangeknowledgeofanynaturethatan

beusedinupdatingtheagent'sloalknowledgebesidestheuseofitsloalrewards.

1.2 Urban Tra Control

Thehistoryoftramanagementarguablyextendsbakto theRomanera. Itisinterestingtonote

that theproverballroadsleadtoRome isbasedonthefatthat areferenepoint,in theformofa

golden milestone,waspositionedin theForum in theanientityofRome. Roadbuildersin Rome,

in theirturn, usedmilestonesasaform ofprimitivemeanstoinformroad usersabouttheirrelative

loationto the golden milestone in Rome (Mueller, 1970). These distributed milestones worked as

indiators orsignals ofreassurane for road usersthat theywere onthe right routetowardsRome.

(20)

Figure 1.1: Skethoftheworld'srsttrasignalthat wasinstalledonthejuntionofGeorgeand

BridgestreetsinLondonin 1868(Mueller,1970)

inform andevenontrol tra reatedby theinreasing numberof road usershaveindeed hanged

dramatially.

Asroadsbeamewiderandtra grewheavier,theneedto managemovementwithin itiesand

the inreasing number of road fatalities beame an urgent issue with whih to deal. It was in the

British parliament in the late eighteenth entury that it was rst suggested to borrow a method

deployed in railways to be used in ontrolling tra on roads. A tra superintendent from the

south easternBritish railwaynamed J. P. Knight had suggestedto EarlGranville that the onept

of a railway semaphore signal ould be ported onto the road network to allow for tra ontrol

(Ishaque&Noland,2006). TheBritishparliamentagreedtoEarlGranville'ssuggestionandinstalled

the world's rst tra signal (see Figure 1.1) on Deember 1868 on a juntion near the Housesof

Parliament in London. That tra signal was paradoxially put in plae to ease road aess for

membersofparliamentratherthanimprovepedestrians'safety. Thetrasignalfuntionedinaway

that ombined red and green gas lights with semaphore arms. The arms extended horizontallyto

denote astop signalandona

45

o

angletodenote aution. Atnight,thestopsignwasaompanied

by a redlight on the topwhile the aution signal wasaompanied by a greenone. The readeris

referredto(Mueller,1970)foramoreindepthhistoryoftrasignals. Heneforth,thetermstra

signal andtralight areusedinterhangeably.

Earlytrasignalswereontrolledbypoliemenwhihbeameinreasinglyimpratialaswider

deployment took plae in dierent ities. A greater number of juntions had to be ontrolled in

(21)

be to provide what is known as a green wave or a series of go signals along a desired path of

ontrolled juntions. Advanes in eletronis and omputer siene havemade it possible to devise

omputerized UTC systemsthat anmanage tra, in termsof eient automatedoperation and

performaneoptimization,onalargernumberofontrolledjuntions,i.e.,signalizedjuntions. Suh

asystemwasrstdeployedinTorontoin

1959

usinganIBM

650

omputertoontrolninesignalized juntions (Gazis, 1971). Early UTCsystems were entrally ontrolled and relied on detetors suh

asmagneti loops,radar andsonar. The main funtionalitiesprovided bytheontrol softwarewere

eletrialatuationofjuntionontrollers,tra-lightstatemonitoringanddetetordataproessing.

The latterdata wastypiallystoredfor potentialoine analysis whilesomeseleted datawasused

for better online ontrol strategies. Suh ontrol strategies were mainly based on the onept of

synhronizingaline of juntions,usually anarterial road, in orderto allow forvehiles to travelat

aonstantspeedwithminimalstops. However,thosestrategieswere xedforertainsituations and

onstrainedbythenumberofontrolledjuntions.Moreover,withtheinreasingnumberofvehileson

roadsandthegrowingsaleofurbanroadnetworks,theUTCproblemhasbeomemorehallenging.

Consequently,theneedformoresophistiatedandoordinatedUTCsystemstoprovideeienttra

ontrolstrategieshasarisen. TheultimategoalforsuhomputerizedUTCsystemsisto providean

eienttraontrolstrategythatrunsinanoptimalmannerinordertominimizeroadongestion.

This optimalityisdiretlyrelatedtoahievingminimumvehiledelay,less-interruptedtraowor

aminimumnumberofvehilestopsandinreasedvehileveloity. Detailsontheprogressionofearly

UTCsystemsanbefoundin (Gazis,1971).

1.2.1 UTC Fats and Challenges

Urbantraisanevolvingproblemloselyrelatedtopopulationgrowthandworldeonomifators.

Many ountries are seeing an inreasein vehiles per apita with eah passing year. As far as the

Organisation for Eonomi Co-operation and Development (OECD) ountries are onerned, road

motor vehiles perthousand inhabitants haveinreased overthe period from 1990until 2006in all

studied ountries exepttheUnited States(OECD,2008)fornolearreasonbutpossiblydueto its

maturestatusandinreasingenvironmentalpubliawarenessprograms. Considerableinreaseswere

notied inountrieslikePortugal,Ieland,GreeeandPoland(seeFigure1.2).

As urbanization is inreasing, road networks in dierent ountries are expanding as well. For

(22)

Australia

Belgium

Canada

France

Germany

Greece

Iceland

Ireland

Italy

Japan

Poland

Portugal

Spain

Sweden

Turkey

United Kingdom

United States

India

Russian Federation

0

100

200

300

400

500

600

700

800

900

Road motor vehicles

Per thousand inhabitants

1996

2000

2003

2006

Country

Figure 1.2: Road motor vehiles per thousand inhabitants in seleted OECD ountries (OECD,

2008)

havematureroad networks. Theannualgrowthin roadnetwork sizeanvary from

6%

in ountries like Korea, Poland, Portugal, Ireland and Greee to a lower rate of

2%

in ountries with mature roadnetworksliketheUnitedStates,Germany,Canada,theRussianFederationandtheNetherlands

(OECD,2008).

Asthenetworkofsignalizedjuntionsgrewalongwiththeinreasingnumberofvehilesonroads,

the problem of providing an eient UTC system beame naturallymore omplex. Evidently, the

problemhasnotyetbeensolved,forinstane,theUnitedStateshasroughly

330

,

000

trasignalsof whih

75%

anbeadjustedtobemademoreeientusing,butnotexlusively,dierenttimingplans (United StatesDOT,2007). However,thesaleis notthesoleissue,roadusersalsoexhibitdierent

travelling routineswhile unexpeted emergeny and aident situations make tra networks

non-stationary in nature. Suh tra harateristisinreasethe UTCoptimization hallenge. Another

modellingandontrolhallengethatfaesUTCsystemsistheheterogeneousstrutureofinterlinked

signalized juntions. Theeetsof ontrollerdeisionsarriedout atonejuntion will propagatein

theroadnetworkaetingtheperformaneofothers,espeially,theirimmediateneighbours.

Conse-quently,theneedforawell-designedollaborationshemeisvitalinprovidingeientUTCsystems

(Bazzan,2004).

Thenegativeimpat of poorUTC systemsis massiveand anbeessentiallysummed up in one

word ongestion. It is true that better and more eient UTC systems annot alone solve this

inreasingproblembuttheyansurelyhelptoredueit(EuropeanCommission,2007a;UnitedStates

DOT, 2007). Congestionausesworldwideenvironmental,eonomiandsoialproblems. IntheEU

(23)

& Gargett, 2007). In 2007, ongestion ostthe United States

∼

US$

87

.

2

billionin

439

urban areas alulated based on wasted time and fuel(Shrank &Lomax, 2009). As far as theenvironment is

onerned,ongestionisamajorauseofairandnoisepollution. UrbanmobilityintheEUontributes

40%

of the overall

CO

2

emissions aused by road transportation while this perentage inreasesto

70%

ofallotherpollutants(EuropeanCommission,2007a).Theseonsiderableperentagesaredueto

inreasingtragrowthandtothestop-gonatureofdrivinginitiesdespitetheadvanesin vehile

emissionredutiontehnologies(EuropeanCommission,2007b). Furthermore,areentsurveybythe

DepartmentofTransportationintheUnitedStateshasshownthat

47%

ofAmeriansagreethatdelay aused bytra ongestionisatopommunityonern(UnitedStatesFHWA,2001).

Partof thesolutionto traongestion is evidentlybetterand moreeientlyresponsiveUTC

systems(European Commission,2007b,a;United StatesDOT,2007). AdaptiveandresponsiveUTC

systems haveproved to be promising in many asesin theUnited States. Comparedto previously

deployedsystems,aordingto (UnitedStatesDOT,2007)forexample,anewTexasLight

Synhro-nizationprogrammanagedtoreduetradelayby

24

.

6%

,fuelonsumptionby

9

.

1%

andthenumber of vehilestops by

14

.

2%

,all throughsignaltiming optimizationand equipmentupdate. In Califor-nia, a newfuel-eienttra signalmanagement programmanaged to redue fuelonsumption by

8%

. LosAngeles'AdaptiveTraControl System(ATCS)whihoperatesastheity'smain tra

ontrol system, managed to diminish average delay by

21

.

4%

and vehile average number of stops by

31%

throughreal-time response(signaltiming adjustment)totra demands. Theresultsabove enouraged further researh in providing more eient UTC systems in the US throughdediated

Federal and State funding programs (United States DOT, 2007). The above advanes might have

beentheresultofalongawaitedimprovementinthepoorperformaneoflegayUTCsystems.

Re-ently, new approahesarebeingonsidered to ome upwithsmarter UTC solutionstodealwith

the inreasing ongestion problems, for instane, areent

2009

governmental report onAustralia's Digital Eonomy: Future Diretions hasidentied the use of Artiial Intelligene (AI) and more

advaned tra sensor tehnologies for developing better UTC systemsasa strategiresearh goal

(Commonwealth ofAustralia,2009).

Theenabling tehnologies to designand deployan eientUTC systemthat taklesthese

hal-lengesareinreasinglybeomingpervasive.Thedomainthatenompassessuhtehnologiesisreferred

toasIntelligentTransportationSystems(ITS)whereinformationproessingandommuniation

teh-nologies are being applied to the transportationdomain (Yang & Wang, 2007). This ranges from

devisingbetterUTCoptimizationshemestonavigationsystemsandreal-timetramonitoring. An

(24)

means. Theyprovidearihreal-time viewof trastatus inities thatanbeexploitedforseveral

appliations inludingUTCoptimization.

Insummary,itis learthat traongestion isaworldwideproblemthat hasbeenlearly

aus-ing eonomi, environmental and soial problems (see statistisabove). This problem is worsening

withtheinreaseinurbanization,vehilenumbers,populationandthepossibleineienyoflegay

UTC systems. Dierent eorts are being exerted to develop moreeient UTC systemsthat are

moreadaptiveandresponsivetotrahanges. However,innovativeandsmart UTCoptimization

shemes that make use of progressivetra management tehnologies like FVD and AI have only

reentlyameintofous.

1.2.2 Floating Vehile Data

TheoreideabehindFVD(EuropeanCommission,2003)istoprovidedierentmeanstoommuniate

various dataassoiatedwith vehiles in amorepervasiveand ost-eetivemanner using

vehile-to-infrastruture(V2I)orvehile-to-vehile(V2V)ommuniation. Suhdataisusuallyspatio-temporal,

for example, the loationof an anonymous (or possiblyknown)vehileat agiven point of time on

the road network. Furthermore, with the inreasing availability of in-vehile sensors, data ould

range from airpressurelevelsin tires to fuelonsumptionand auratespeeddata at agiventime.

Standardization eorts are also playing a major role in helping the spread and adoption of

FVD-basedtehnologiesandsolutions. TheInternationalStandardsOrganization(ISO)andtheEuropean

CommitteeforStandardization(CEN)areleadingtheeortsinprovidingstandardsforV2VandV2I

ommuniationtehnologies. Mostnotably,theDediatedShortRangeCommuniations(DSRC)(Bai

& Krishnan,2006)and theContinuousAir-interfae,Longand Medium Range(CALM) (Williams,

2004)standardsthatmakeuseofthewirelessaessinvehiularenvironments(WAVE)enablingIEEE

protool,namely,IEEE802.11p(Eihler,2007). Thelatteraimsatprovidingawideplatformof

dier-entommuniationtehnologiesworkingseamlesslytogether inluding,forexample,DSRC, General

Paket RadioServie (GPRS), Global System forMobile ommuniations(GSM) and International

Mobile Teleommuniations-2000(IMT-2000)or3G.

Traditionally,trademanddataisgatheredthroughsensorsembeddedintheroadinfrastruture

suhasindutiveloopdetetors orameras. Withthe standardizationof FVDtehnologiesand the

inreasing pervasiveness ofwirelesspositioning systems,e.g.,GlobalPosition System(GPS),aswell

as the onsiderableinvestments in V2I and V2V ommuniationtehnologies; it is now possible to

establish a FVD enrihed environment with a signiantly lower ost ompared to the traditional

(25)

thoseoftheEuropeanUnion,haveresultedinapromisingsatellitepositioningprojetnamely,Galileo

(EuropeanCommission, 2001),whih isexpetedto bemoreauratethanurrentGPStehnology.

This willpotentiallyhaveapositiveimpatontramanagementsolutions(Kuhne,2003).

Moreover, there has been a reent fous on enrihing the set of typial FVD information, e.g.,

position,speedand time(Messelodi et al.,2009). Throughdealingwith vehiles asmovingsensors,

e.g., ameras and tra level analyzers, typial FVD is enrihed with information resulting from

vehile surroundings analysis, e.g., road onstrution notiation and tra level. The reader is

referredtothesurveyby(Luo&Hubaux,2004)formoreinformationonFVD.

1.2.3 Common UTC Conepts

There are anumber ofoneptsthat are usedin desribingthe funtionality within aUTCsystem.

An introdutionto someommonUTConeptsisprovidedinthissubsetion.

Signalizedjuntion: ajuntion thatisontrolledbyatralight.

Phase: aphaseisharaterizedbytheexlusivesetoftradiretions allowedto proeedata

givensignalizedjuntionfromertainapproahesatagiventime. Onlyonephaseanbeative

atatimewhereallitsapproaheshaveagreensignalto go.

Oset(time): thetimedierenebetweenthestartofsomephaseonagivensignalizedjuntion

and thestartof adierentphaseon anadjaent signalizedjuntion. Typiallyrelevant when

adjaentjuntionsneedtooordinatetheirphaseativationthat mayaetonnetinglinks.

Cyle(time): thetime neededto ompleteasequeneofphasesonagiven signalizedjuntion

inludingosets.

Split: theproportionedgreentimealloatedperphaseforallphasesinayle.

Oversaturation: a situation where links onneting signalized juntions reah their maximum

apaityintermsofnumberofvehiles.

Certain lassialUTC systems, asdisussed in Chapter 2, base their optimization methodology on

tuning signalized juntions timingparameterssuhastheoset, theyletime andthesplit. Some

non-lassialapproahes,however,followdierentoptimizationmethodologiesbasedonphase

(26)

1.2.4 UTC Optimization Trends

SeveralUTC systemshavebeen proposed overthe pastfour deades. Speially, twosystems, the

SydneyCoordinatedAdaptiveTraSystem(SCATS)(Sims&Dobinson,1980;Lowrie,1982)andthe

SplitCyleOsetOptimisationTehnique(SCOOT)(Huntetal.,1982)havebeendeployedinmany

major ities. Thesesystems arebasedon omplexmathematialmodelsto optimizespei timing

settingsofatraontroller,namely,theoset,splitandyletime. However,traontrolstrategies

in suh systemsare either entrally or hierarhiallyformulated. Numerous other approahes have

beenproposedasomputationalproblemsolvingmethodologieshaveevolved. Suhapproahesmainly

useDynamiProgramming,evolutionarygametheoryandgenetiprogrammingoraombinationof

those. Otherssimplyusefuzzy/heuristimodelsandrule-basedmethodswithpossibleintegrationwith

evolutionaryapproahes. However,RL hasemerged asapromising approah forUTCoptimization

in whih true adaptiveness an beahieved(Abdulhai& Pringle,2003; Abdulhai et al., 2003). We

onentrate on deentralized RL that speially uses Q-Learning for UTC optimization given its

salabilityandappliabilitytoonline(re)learningthatallowsfortheadaptivenessandresponsiveness

neededbyUTC.

1.3 Hypothesis

OurhypothesisisbasedonthefollowingargumentsonerninganeientUTCsystem:

LoaltrasignalsontrolledbyRLagentsthatanadaptandrespondtohangingtraare

advantageousomparedtoxed-timeand SCATS-inspiredtralightontrollers.

DesigninganRLagentusinganadaptiveround-robinshemebasedonphasestoontrolagiven

trasignalispossible.

DeentralizationthroughassigningaontrollingRLagentpersignalizedjuntion that

ollabo-rateswithneighbouringagentsanahievebetterglobalperformane.

Deteting tra hanges as they our is possible based on tra ltering per lane and the

performaneoftheassignedRLagentwithoutaprioritramodels.

Responsivenessanbeahievedbyrelearningbasedonaquantiedloaldegreeoftrahange.

Theproposeddesigndoesnotpresumespeisouresofsensorinformationbutratherexposes

(27)

We evaluate our ombined hypothesis using a mirosopi simulator that takes as inputs varying

tra patterns simulated on dierent real maps of Dublin ity. The evaluation inludes dierent

senarios haraterized by map sale, hanging tra and ollaboration. Comparisons are made

againstsenariosusingxed-time ontrollersandagainstaSCATS-inspiredalgorithm,namely,SAT

(Rihter,2006).

1.4 Prinipal Contribution

ThisthesisprovidesadeentralizedUTCoptimizationapproahusingRLandollaborationshemes,

that is eient, adaptive and yet responsive to the non-stationary nature of urban tra. Our

prinipal ontributionis asalable sheme in whih eahsignalized juntion isontrolled by anRL

agentthat isautonomouslyapableofdetetingunsatisfatoryperformane andloaltra-pattern

hange to whih it responds by relearning based onthe degree of hange observed. The RL agent

an potentially ollaborate with neighbouringagentsin order to provide better global performane.

With alltheir harateristis,wename ouragentsasSoilse whih means tralightsin the Irish

language. Heneforth,aSoilseagentisRL-basedwhereaSoilseCagentusesRLandollaborateswith

its neighbours. Furthermore, the approah does not assumeany domain knowledge norpredened

modelsoftra.

1.5 Thesis Organization

The remaining hapters of this thesis are organized as follows. Chapter 2 presents the

state-of-the-art in UTC inluding lassial widely deployed de fato systems, as well as RL and non-RL

approahes. The hapter also disusses RL and deentralized RL inluding the main learning and

ationseletionstrategies. InChapter3wedetailthedesignofourUTCoptimizationagents,namely,

SoilseandSoilseCinludingthepatternhangedetetionmehanismandtherelearningstrategy. In

Chapter 4wepresentour implementation using aCRL framework that webuilt asa C++ library

and wedesribetheinteration betweentheUTC simulatorand theSoilseand SoilseCinstanes of

that framework. Chapter 5 presents our evaluation results based on dierent axises suh as sale,

ollaboration, responsiveness and ationseletionstrategies. Wenally onludeanddisuss future

(28)

State of the Art

The thesis merges between signiantly wide domains, i.e., reinforementlearning (RL) and urban

traontrol(UTC)optimization. ThisthesisaddressesRL-basedoptimizationofUTC,thereforein

thishapterweintroduethebakgroundneessaryforunderstandingourapproahaswellasrelated

work to position our ontribution and distinguish our approah from existing approahes. In this

hapter,weintrodueMarkovDeisionProesses(MDPs)anddisusstheessentialsofRL andmost

popularlearningandationseletionstrategies. Wealso disussthedeentralizationofRL. Aswell,

wereviewdierentlassial,(i.e.,urrentlydeployedanddefato)approahestoUTCalongwiththe

relatedworkinnon-RL-basedandRL-basedUTCoptimizationtehniques.

2.1 Reinforement Learning

InthissetionweintrodueRL.WebeginbydesribingMDPsgiventheirloserelationtomodelling

RLproblems. Wealsointroduesomewell-knownapproahestosolvingMDPsinthesenseofseeking

anoptimalpoliy.

2.1.1 Markov Deision Proesses

Often,RLproblemsaremodelledusingMDPs. Anagentoranyentitythatpereivesandatswithin

an environmentould ause anew underlying state. Suh a stateould be the diret result of the

agent'sations ordue to other fatorssuh asother agents' ationsorthe naturaldynamis of the

environment,e.g.,thepopularpreyandpredatorormultiplepredatorsproblem(Kok&Vlassis,2004).

(29)

through:

S

: adisretesetofstatesrepresentingthepossibleenvironmentalsettings

A

: adisretesetof ationsavailabletotheagent

R

(

s

t

, a

t

)

: arewardfuntion thatreturnsarewardfortaking ation

a

in state

s

at time

t

T

(

s

t

, a

t

, s

t

+1)

: a transition probability model known a priori that provides the probability

p

(

s

t

+1

|

s

t

, a

t

)

oftransitingtostate

s

t

+1

ifation

a

t

istakenfrom state

s

t

Anyproblem modelled asan MDP must naturally satisfythe Markovproperty, i.e., the future

be-haviourdepends ontheurrent state

s

t

but notonthepast states. Suh apropertyensuresthat a

givenstateapturestheeetofapreviouslytakenhainofations,whihallowssimplerrulestosolve

theMDP'soptimalpoliy

π

∗

,where

π

∗

isamappingfromthestatestothebestations. Itispossible

in this asetowrite one-stepformulasthat anbe,in someform,iterateduponin orderto disover

π

∗

. Animmediatereward

r

t

+1

givesagoodnessmeasurefortheation

a

t

exeutedinstate

s

t

. Itan bealulatedbasedontherewardfuntion

R

(

s

t

, a

t

)

orsometimesusing

R

(

s

t

)

whihreturnsareward forbeingin

s

t

. Suhareward,however,mightbeinsuienttoapturetheexpetedfutureeetor

the long-term usefulnessof taking agivenation unless itis ombinedwith future rewards. Hene,

theoneptoffuturedisountedrewardsemerges. Naturallyanagentwillnotbelikelytowaitforever

in ordertoaquireaveryhighreward,however,itmakessenseforittoinlude futurerewardswhile

dereasing theirimportane as theyour further away in time. Suh a behaviour anbeahieved

using a dereasing disount rate known as

γ

∈

[0

...

1]

. The Bellman optimality equation (2.1) is a well-known optimality equation basedon the onept of disountedrewards and states'expeted

utility. It is used to nd the optimalutility

U

∗

for allstates whih in manyases is referredto as

V

∗

aswell. TheBellmanequationaimsatoptimizingthevalue-funtion

V / U

thatgivesagoodness

measure forbeingin aertainstate, oralternatively,thestate-ation value-funtion

Q

(

s

t

, a

t

)

whih providessuhameasureperationperstate.

U

∗

(

s

t

) =

R

(

s

t

) +

γ max

a∈A

(

s

t

+1

)

X

s

t

+1

T

(

s

t

, a

t

, s

t

+1)

U

∗

(

s

t

+1)

(2.1)

Several lassial methods have been proposed to solve MDPs. The majority are onsidered to

fall into the Dynami Programming (DP) paradigm (Sutton & Barto, 1998). Two well-known DP

methods forsolvingMDPs arethe valueand poliy iteration methods. Althoughthe DP paradigm

assumesaperfetworldmodelasinMDPs,itisanimportantbasisforunderstandingRLwhihdoes

(30)

2.1.1.1 ValueIteration

Asthenameimplies,thismethoditeratesthrougheahstateinanMDPusingtheBellmanoptimality

equation (2.1)asanupdate rule in orderto reahanoptimal poliy. Thestoppingrule forsuh an

iterationisusuallybasedonthemaximumdierenebetweensubsequentstateutilityapproximations.

Ifthatdiereneislessthan

ǫ

(1

−

γ

)

/γ

thenitisguaranteedthattheerrorislessthansomevalueof

ǫ

. SuhanapproahreliesonanMDPwithlearlypredenedtransitionmodelandstaterewards. The

methodreturnsthenaloptimalutilityforallstates

U

∗

. Analgorithmdesribingthevalueiteration

method isshowninAlgorithm(1).

Algorithm1 ThevalueiterationDPmethod

V_I(S,A,T,R,

γ

,

ǫ

)

U

t

←

0

Do

λ

←

0

For

s

∈

S

U

t

+1(

s

)

←

R

(

s

t

) +

γ max

a

P

s

t

+1

T

(

s

t

, a

t

, s

t

+1)

U

t

(

s

t

+1)

λ

←

max

(

|

U

t

+1(

s

)

−

U

t

(

s

)

|

, λ

)

End

Until

λ <

(

ǫ

(1

−

γ

)

/γ

)

Return

U

∗

Theoptimalutilityforallstates

U

∗

an thenbeusedto deviseanoptimalpoliy

π

∗

byseleting

theationwiththemaximumexpetedutilityforeahstate,denoted as

Q

∗

(

s, a

)

,basedonequation

(2.2).

Q

∗

(

s

t

, a

t

) =

R

(

s

t

) +

γ

X

s

t

+1

T

(

s

t

, a

t

, s

t

+1)

U

∗

(

s

t

+1)

(2.2)

π

∗

(

s

) =

argmax

a

∈

A

(

s

)

Q

∗

(

s, a

)

(2.3)

Theoptimalpoliy

π

∗

anhenebe formulatedbyndingtheset of ationsofmaximum utility

(31)

2.1.1.2 PoliyIteration

The poliy iteration method onsists of two parts throughwhih an agent rstly produes a given

poliyusingtheBellmanupdateequation(2.1)andseondlytriestoamelioratethatpoliyifpossible.

In essene, the method runs as asequene of produing poliies and testing theirstability until an

optimal stable poliy is found. An algorithm desribing the poliy iteration method is shown in

Algorithm(2).

Algorithm 2Thepoliy iterationDP method

P_I(S,A,T,R,

γ

,

ǫ

)

Initialize

U, π

1

Do

λ

←

0

For

s

∈

S

U

t

+1(

s

)

←

R

(

s

) +

γ

P

T

(

s

t

, a

t

, s

t

+1)

U

t

(

s

t

+1)

λ

←

max

(

|

U

t

+1(

s

)

−

U

t

(

s

)

|

, λ

)

End

Until

(

λ <

(

ǫ

(1

−

γ

)

/γ

))

For

s

∈

S

temp

←

π

(

s

)

π

(

s

)

←

argmax

a

∈

A

(

s

)

[

R

(

s

t

) +

γ

P

T

(

s

t

, a

t

, s

t

+1)

U

(

s

t

+1)]

If

temp

6

=

π

(

s

)

Then

goto

1

End

Return

π

∗

The value iteration method is a ompat version of the poliy iteration method. The latter

it-erativelyheksthestabilityof theresulting poliy after anumberof valuefuntion updateson all

statesseeking exatonvergene. On theotherhand,thevalueiterationmethod ignoressuharule

and atsgreedily onthevaluefuntionupdates withoutseekingexatonvergenebut stillresulting

(32)

2.1.1.3 Partially Observable MDPs

Inertain situationsan agentmaynotbeableto determinethestatewhihit isurrentlyin. Suh

a aseis oftenthe result ofdealing withan unertainenvironmentwhere sensor inputs, fusionand

inferene tehniques are unable to dedue a given state with ertainty. Consequently, MDPs an

be extended in order to enompass a belief model that an provide a probability distribution over

the possible set of agent states, namely Partially Observable MDPs (POMDPs) (Kaelbling et al.,

1998). For example, an agent an potentially be in three states with a belief state distribution

of

< B

(

s

0) = 0

.

5

, B

(

s

1) = 0

, B

(

s

2) = 0

.

5

>

, meaning that the agent an never be in

s

1

but

has an equal hane of being in either

s

0

or

s

2

at a given time. As the agent interats with the unertainenvironment,itwillnaturallyneedtoupdateitsbeliefmodel,heneanobservationmodel

O

(

s, o

)

is used to inform the agent about the probability of an expeted observation in a given

state. Consequently, the belief model an be determined aordingto equation (2.4) where

α

is a

normalizationfator.

∀s

t

+1

B

t

+1(

s

t

+1) =

αO

(

s

t

+1

, o

)

X

s

T

(

s, a, s

t

+1)

B

(

s

t

)

(2.4)

Regardlessof the indenite number of states resulting from the ontinuous values in the belief

modelandtheintratabilityofndinganoptimalsolutioninsuhaase,someapproaheshavebeen

proposedunderassumedonstraintsin ordertoprovideapproximatesolutions(Murphy,1999). .

2.1.2 Reinforement Learning Struture

Reinforement Learning (Sutton & Barto, 1998; Kaelbling et al., 1996) is an extensively studied

approahtosolvingawiderangeofoptimizationproblems. RLisanunsupervisedlearningapproah

that aims at arriving to a setting through whih statesare optimally mapped to ations, i.e., in a

mannerthatmaximizesthelong-termexpetedrewardsreeivedafterexeutingaertainationina

givenstateatagiven time. Suh asettingahievedbyanRL agentonstitutestheagent'soptimal

poliy. AnRLagenttypiallydisoversitsenvironmentthroughinteration,morespeially,bytrial

anderror. Hene,thelearningproessthroughwhihanRLagenteventuallytriestoreahanoptimal

poliy,oursbyexeutinganationin agivenenvironmentalstateandonsequentlyevaluating the

utilityassoiatedwiththatationinthatstateusingthereeivedrewardandnextstateinformation.

Suh anRL approahisnormallyreferredto asamodel-freeapproahin thesense that it hasnoa

(33)

eahstate,i.e.,atransitionmodel. Theontraryisnaturallyreferredtoasamodel-basedapproah,

whih in ertainasespredits/estimates theoutome,in terms ofnewstateand reward. A typial

RL agent,see Figure(2.1),representsitsloalenvironmentthroughastate-ationspaeintheform

of anMDP.

Figure 2.1: AtypialRLagentinteratingwiththeunderlying environment

s : state,r : reward,a: ation

The rewardmodel in RL an be disrete or ontinuous. Oneould design a disretized reward

model where a onstant value is returned based on some goodness onditions. For example, in a

grid-world problem, an agent is required to navigatea grid to reah a goalstate/squareby taking

a series of ationsfrom aset of ations

A

grid

=

{

Lef t, Right, N orth, South

}

. The agent reeives a high positivereward

r

goal

= 100

when an ation

a

∈

A

grid

leadsto the goalstate. On theother hand,anyotheration

a

thatdoesnotleadtothegoalstatereeivesanegativereward

r

=

−

1

. An RL agenttryingtondalloptimalpathsleadingto thegoalstatewilltrytomaximizetheexpeted

future rewards in order to ahieve its task. Indeed, the agent an reeive hints (positive rewards)

on the way to itsgoal squareif theproblem wasmodelled in suh awayasto hasten ahieving an

optimal poliy. Thatmodelwill allowforsquarespositioned oneation awayfrom thegoalstateto

return ahighpositiverewardbut relativelylowerthan thegoalreward,

r

= 50

perexample. Other reward models an be ontinuous in the sense that they an fall in a given range of values. For

instane, anRL-basedtra lightontrollerthat tries to arriveto anoptimalontrol poliy, whih

allowsfor the maximum number of vehiles to pass through, ouldhaveareward model suh that

r

=

number of vehicles passed through af ter a given lights setting

. In that ase, the range of

r

is

relativetotheinomingtra volume. Deidingonwhethertouseadisreteoraontinuousreward

(34)

EssentiallyanRLagentwouldmodeltheunderlyingenvironmentasanMDP.However,inomplex

dynami problemssuhasUTC, itisoften verydiultto obtainadeniteprobabilistitransition

model for the MDP to be solved assuming that the resulting state is based on tra for instane.

Thesameargumentappliestoobtainingarewardpreditionmodel. However,itispossibletodesign

arewardmodel thattranslatesenvironmental feedbak. Q-Learningisone ofthelearningstrategies

that allowsanRLagent,throughitsvaluefuntion,to arriveto anoptimalpoliy withouttheneed

foratransitionmodelorarewardpreditionmodel.

ForanRLagenttofuntion,itreliesonalearningstrategy,anationseletionstrategy,areward

modeland,vitally,arepresentationoftheunderlying environment. Wedisussedrewardmodelsand

MDPsastheenvironmentalrepresentation. Onwardswedisussdierentlearningandationseletion

strategies.

2.1.2.1 Learning Strategies

A learningstrategy allowsthe RL agentto gradually build its knowledge onhowto optimally deal

withthesurroundingenvironment. Thatknowledgeisumulativelybuiltthroughinorporatingsensor

informationinamannerthataetstheRLagent'sviewontheenvironment. Inorporationismainly

donethroughavalueorastate-ationvaluefuntion updateruleofsomeform.

Q-Learning

Q-Learningwasrstintroduedinthe1989inWatkins'Ph.D.thesis(Watkins,1989). Sinethen,

it hasbeengainingmorepopularity asamodel-freeRL tehnique. Q-Learningfalls in theategory

of o-poliy TemporalDierene (TD)learningstrategies(Sutton&Barto,1998). Those strategies

are model-freeand anupdate aertain RLagent'spoliy estimatebasedonthe estimatesofother

elementsin thepoliy as well asontheinoming rewards. Convergene is assuredregardlessof the

ation seletionstrategy orexploration tehnique aslongasupdating allstate-ationvalue pairsis

ontinuous. Q-Learningtypiallybehavesin an o-poliy manner, whih meansthat it learnseven

whiletakingationsthatmightprovetobenon-optimalinthefuture.

Q-LearningontrolstheRLagent'slearningpaethroughalearningratevariable

α

: (0

≤

α <

1)

and thelevelbywhihitdisountsfuture rewardsthroughadisountratevariable

γ

: (0

≤

γ <

1)

. TheQ-Learningupdateequationispresentedin (2.5).

(35)

r

t

+1

:

reward received af ter executing a

t

Ahighlearningrateimpliesthattheagentismoreeagertoadoptongoinghangesdenotedby

r

t

+1

and the future eets of ation

a

t

denoted by

max

a

Q

(

s

t

+1

, a

)

in its updatedpoliy. TheRL agent beomesmorenear-sightedtheloweritsdisountrateisbyminimizing thefutureeet ofation

a

t

denoted by

max

a

Q

(

s

t

+1

, a

)

.

Algorithm 3Generi Q-Learning

Initialize lookuptable

∀

Q

(

s, a

)

QL(S,A,

α

,

γ

)

Forall episodes

s

t

←

s

initial

For eahstepintheepisode Do

Selet_Exeute

a

t

:

a

t

∈

A

(

s

t

)

usingsomeationseletionstrategy Reeive

s

t

+1

, r

t

+

t

Q

t

+1(

s

t

, a

t

)

←

Q

t

(

s

t

, a

t

) +

α

[

r

t

+1

+

γ max

a

Q

(

s

t

+1

, a

)

−

Q

t

(

s

t

, a

t

)]

s

t

←

s

t

+1

Until

s

t

==

s

terminal

End

An RL agent using Q-Learning normally keeps a lookup table for all possible ombinations of

state-ation pairsbasedontheMDP representationof itsenvironment. ItsMDP is builtwithouta

transition model for ation ourrenelikelihood nor areward predition model. Suh atransition

model is essentiallylearnt through interation with the environment and anbe dedued from the

state-ationvalueslookuptableusingsomeationseletionstrategy. AgeneriQ-Learningalgorithm

is presented in (3). An episode, for example, in a grid-world senario, ould last until the agent

arrivestoapredenedgoalstate/squareafterstartingfromadierentstate. Thenumberoflearning

episodesneedednaturallydependonsomeformofonvergenetestwheretheagentterminatesifthe

resultofthattestissatisfatory. However,in aninnitehorizonproblem,i.e.,aproblemthathasno

spei goal/terminalstate, thenotionofan episode disappears. Insuhaase, itis morelikelyto

use agraduallydereasinglearningratepairedwithabiasedation seletionstrategythat balanes

(36)

SARSA

The SARSA RL algorithmgets its namefrom the knowledge update mannerit followsasan

on-poliyapproah. LearningprogressesinSARSAfromagivenstate-ationpairtoanotherstate-ation

pairand henethename SARSA,i.e., State-AtionRewardState-Ation. Thelearningupdaterule

in SARSA depends onseletingthenextation

a

t

+1

forthenextstate

s

t

+1

usingaommon ation seletion strategy. In ontrary, Q-Learning uses the best next ation in

s

t

+1

. A generi SARSA algorithmis presentedin (4).

Algorithm4 Generi SARSA

Initialize lookup table

∀

Q

(

s, a

)

SARSA(S,A,

α

,

γ

)

Forall episodes

s

t

←

s

initial

Choose

a

t

:

a

t

∈

A

(

s

t

)

usingsomeationseletionstrategy For eahstepintheepisode Do

Exeute

a

t

Reeive

s

t

+1

, r

t

+

t

Choose

a

t

+1

:

a

t

+1

∈

A

(

s

t

+1)

usingsomeationseletionstrategy

Q

t

+1(

s

t

, a

t

)

←

Q

t

(

s

t

, a

t

) +

α

[

r

t

+1

+

γ Q

(

s

t

+1

, a

t

+1)

−

Q

t

(

s

t

, a

t

)]

s

t

←

s

t

+1

, a

t

←

a

t

+1

Until

s

t

==

s

terminal

End

SARSAtendstobeamoresafeapproahasopposedtoQ-Learninginthesensethatitseletsthe

nextation basedon agivenstrategywhileQ-Learningrisksitby takingationsthat mightnotbe

optimalbutstilllearns. Asaresult,Q-Learningispossiblyabletoreahtheoptimalpoliywithless

aumulatedrewardswhile SARSA will onvergeto anear optimalone(Takadama & Fujita, 2005;

Sutton&Barto,1998). Indeed,thedesignoftherewardmodel isessentialin thatase.

2.1.2.2 Ation SeletionStrategies

The RL yle annot be omplete without aeting the underlying environment through seleted

(37)

hasto eiently explore,(i.e.,visitdierentstatesand trydierentations)thespae-ationspae,

i.e.,theMDP.Naturally,andoftenintheaseofaninniteoptimizationproblem,anRLagentshould

beabletograduallyswithfromexploringforan(optimal)poliytoexploitingthatpoliy. Theroleof

abiased,(i.e.,allowsforontrollingtheexplorationperiod)ationseletionstrategy,(e.g.,Boltzmann

and

ǫ

-greedy)henebeomesessential.

Greedy &

ǫ

-Greedy

The most natural short-sighted strategy that an agent an be following is to always selet the

ation withthemaximum positiveoutome. Suh astrategy isreferredto asbeinggreedygiven its

ontinuousprefereneforationswithmaximumestimatedgoodness. However,thistypeofastrategy

alone is problemati for eient explorationin RL as other possibly lessfavourable urrent ations

ould result in better performanein the longrun. Tooverome that problem, a randomly greedy

ationseletionstrategywasdevised,namely,

ǫ

-greedy. Insuhastrategy,theurrentbest ationis

onlyseletedwithaprobability

(1

−

ǫ

)

where

ǫ

: 0

≤

ǫ

≤

1

. ThegreedinessoftheRL agentishene denedbythevalueof

ǫ

. Moreover,thisstrategyisindependentfromthestate-ationvalueestimates

Q

(

s, a

)

intermsoftheprobabilitydistribution usedforseletingtheations.

Boltzmann

The Boltzmann ation seletion strategy is a ustomization of the softmax approah (Sutton &

Barto, 1998)where the Boltzmann (alsoknown asGibbs) probability distribution is used to model

theationseletionstrategy. AtionsareseletedbasedonaBoltzmannprobabilitydistributionbuilt

using their

Q

(

s, a

)

values,seeequation(2.6).

P

(

a

) =

e

Q

(

a

)

/τ

P

f orall b

∈

A

e

Q

(

b

)

/τ

(2.6)

TheextentofBoltzmannexplorationisontrolled bythetemperatureparameter

τ

: 0

< τ

. The higherthevalueof

τ

is,themoreexplorativetheRLagentis,i.e.,ationstend tohavenearlyequal

hanes of being seleted. As the temperature ools down, the shift towardsexploitation beomes

greater and theRL agentbeomes moregreedy. However,deiding onthe best initial value of

τ

is

(38)

2.1.3 Deentralized Reinforement Learning

As optimization problems beomemore omplexin termsof sale, problem modelling in alassial

entralized RL manner beomes more diult and the solution might beome intratable. Hene,

the need forRL deentralization emerged. Suh adeentralization is partially realizedthrough the

Multi-AgentRL (MARL)realm. Thelatterhasresultedinaonsiderableamountofliterature.

An importantlassial dierentiation betweenMARL implementations is presentedin (Claus &

Boutilier, 1997) between what they refer to as independent learners, where agents learn based on

theirpureinterationwiththeenvironmentwithoutrealizingtheexisteneofotheragents,andjoint

ation learners,where anagentlearnsaso-alledjointation byobservingother agentsationsand

interpreting their loal eets. Moreover, an interesting lassial study by (Tan, 1998) shows that

ooperationamong RL agents,ifdoneintelligently, may resultin better performane than

indepen-dentlearning. Cooperationthere inludesommuniatingagent'sloalinformation suhas,learning

episodes,poliies,seletedations,rewardsandsensorinformation. Theviewspresentedby(Claus&

Boutilier, 1997;Tan,1998)formthefoundationsofmodernRLdeentralizationwhere thesingleRL

agentworldhasbeentransformedinto aworldofRL agentseithertryingto ompeteorollaborate.

Conentrating on ompetitive behaviour, the minimax-Q-Learning algorithm (Littman, 1994),

where an agent learns to win as a result of other agent's loss has emerged. An extension to that

approahispresentedin (Hu&Wellman, 1998). AMARLshemewhereoordinationamongagents

is basedon having anotion ofother agentsinorporatedin theloalstate desriptionsispresented

in (Abul et al.,2000). Theyonentrateonproblems with largestate-ationspaes wherethey use

generalization and funtion approximation (Sutton & Barto, 1998). In a oordinated RL sheme

(Guestrin etal.,2002),oordinationamong RLagentsisbasedonoordination graphswhereagents

seletanoptimaljointationwithone-hopneighbourswithoutsearhingthelargejointationspae.

A similar Q-Learningspei approah is presented in (Kok & Vlassis, 2006, 2004). Following the

idea of learning from the best, a ooperative learning approah for agents using Q-Learning with

weightedagentexpertnessisdesribedin(Ahmadabadi&Asadpour,2002;Ahmadabadietal.,2001).

Furthermore,adistributedvaluefuntionlearningshemeispresentedin(JeShneider,1999)where

an RL agent exhanges its value funtion estimations with neighbouring agents. An agent in that

shemeanlearnavaluefuntionbasedonthesumofallotheragents'disountedexpetedrewards.

Cooperation through sharing rewards among RL agents is rationalized in (Miyazaki & Kobayashi,

1999) wherethey provide aminimumpreonditionto preservethat rationality. On theother hand,

analternativeapproahto inorporatingotheragents'rewardsorQ-valuesispresentedin (Tesauro,

(39)

strate-gies and predit other agents' strategies through Bayesian inferene (Berger, 1993). In apartially

observableproblem, Goldman&Zilberstein (2004)proposeagroupofNEXP andPproblemswhere

agentsshare aommon deentralizedPOMDP and try to maximize aglobal goalthrough dierent

ommuniationmanners.

Figure2.2: AdeentralizedRLstruture

Wesee the deentralizationof RL as ameansto break down aglobal problem into manageable

loalRLproblems. Asaresult,loalRLagentstrytoat(ollaboratively)towardsa(near)optimal

solution for the ommon global problem, see Figure (2.2). In (Hoen et al., 2006) a study of the

dierentbehaviours(ooperativevsompetitive)oflearningagentsinamulti-agentsystem(MAS)is

provided. Asfarasthelatterstudyisonerned,weareinterestedin whattheydeneasonurrent

learning whereeahagentlearnsusingadediated learningproess. Overall,wereferthereaderto

three surveys(Bu³oniu et al., 2008; Yang &Gu, 2005; Panait & Luke, 2005) that ould provide a

widerviewonmulti-agent(reinforement)learningapproahes.

2.2 Unertainty in UTC

The nature of aUTC systemis omplex and often hardto predit. Numerous fators ould shape

the unpreditability in aUTC system. Humans' varying behavioural patterns, equipment wearing

outand ommuniationnoiseouldallbeseenassouresofunertaintyin UTCsystems. Consider

utuating ity tra overdierent periods of time and the resulting diulty in adapting ontrol

deisions. Suhdeisionsmight notonlyhave instantaneous eetsbut alsolong-term ones making

(40)

due to humans, aidents,road worksand nature. Interestingly, (Satyanarayanan,2003)elaborates

onunertaintyholistiallyasin;itisironithatintoday'sall-digitalworld,unertaintyreappearsas

amajoronernatahigherlevelofrepresentation. In(Viti,2006),athoroughstudy isprovidedon

theunertaintyanddynamisofroadusers'travellinganddelaytimes. Itisarguedthatunertainty

in atransportationnetwork originatesfrom thevariabilityin supplyanddemand (Viti, 2006). This

ouldbeofaylinatureorsporadi,(e.g., hostingworldfootballhampionshiporpossiblyrailway

workersstrike).

As farasunertaintyis onerned, weare mainly interested in utuatingurban tra and the

meanstogeneriallydetettra hangesonlineandrespondadequatelyusingdeentralizedRL.

2.2.1 Tra Patterns

As(Visser&Molenkamp,2004)putitwhendisussingtheidentiationoftrapatterns:

Determiningthedailyandweeklypatternsisabitofanart, morethanasiene: resultsarepartly

dependentoneveryindividual'sownframeofreferene(e.g., IstheThursdaybeforeEasteraregular

weekdayasfarastra isonerned?).

Mostommon approahes to determiningtra patterns are oine approahesthat require the

analysisofmassivehistorialdataandengineeringexpertise(Venkatanarayanaetal.,2007).

Further-more, aquestion arises onerning the onstituents of a tra pattern. Volume and diretionality

ould beintuitively seenas important harateristis of agiven tra pattern. Several approahes

inluding inident and tra pattern orstate detetion have been proposed upon the introdution

of FloatingVehileData (FVD)tehnologies(Kerneretal.,2005; Kamran&Haas,2007; Matshke,

2004;Chenetal.,2007). MostoftheseapproahesentrallyproessommuniatedFVDsuhastravel

timeandveloityinordertoprovideaglobalimageoftrastatus,orinertainase,traaidents

(Kamran&Haas,2007). Also,theymai