Reinforement Learning
As'ad Salkham
AthesissubmittedtotheUniversityofDublin,TrinityCollege
in fulllmentoftherequirementsforthedegreeof
Dotorof Philosophy(Computer Siene)
I, the undersigned, delare that this work has not previously been submitted to this or any other
University,andthat unlessotherwisestated,itisentirelymyownwork.
As'adSalkham
I,theundersigned,agreethat TrinityCollegeLibrarymaylend oropythisthesisuponrequest.
As'adSalkham
I would liketothankmysupervisorProf. VinnyCahillforhis supportandguidanethroughout the
Ph.D. proess.
Thisthesis wouldnothavereaheditsstage nowwithoutthementorshipof Dr. Raymond
Cun-ningham. Hehasshownagreatdealofpatieneandsientiprofessionalisminadvisingandhelping
me. A big thankyouto Ivana Duspari whohasshared thesamefrustrationsof tra simulations
and thejolly feelingswhenonesee agentslearning! I appreiate alot her wittyproofreading many
ofmyhaptersandhersenseofhumourthat keptusbothfromedging insanity.
Iwouldliketothankeverymemberof DSG,oldand newones. Theexhaustivelistmighttakea
page length andif the aliatedpeople are added then itis going to betwopages ! So, thankyou
all foryoursupport and friendship. However,speial ones haveto bementioned,Marin Karpi«ski,
MaªgorzataJaksik,BartekandIlonaBiskupski,SerenaFritshandSanadGhgaraibeh...manythanks.
There are nowordsthat ould show myloveand appreiation formy family, theirsariesfor
meandtheirpatiene. Myparentshavealwaysstressedonthehigherpriorityforeduationandthey
wereright. TheyhaveraisedmewiththebesttheyanaordandforthatIamdeeplygrateful. My
loveandgratitudetomythreesistersandousinsforstandingbesideme foralltheseyears.
As'ad Salkham
UniversityofDublin,TrinityCollege
Inreasing tra ongestion levels are ausing high worldwide eonomi, environmental and soial
osts. Eienturban tra ontrol (UTC)is partofthesolutionto thetra ongestionproblem.
However,UTCoptimizationisahallengingtask. Urbantraisharaterizedbyonstantly
utu-ating tra patterns. Dailyvariationsin tra volume anddiretion,driverbehaviour, unexpeted
emergenysituationsandtraaidentsallresultintrautuations. Consequently,urbantra
networks exhibit non-stationarybehaviour and UTC systems are omplex. Furthermore, any loal
traontroldeisionsarriedoutatagivensignalizedjuntionontrollermayaetbothupstream
anddownstreamjuntions. Hene,unoordinatedorpoorloaldeisionsannegativelyimpatonthe
tranetwork. ModellingUTCasanoptimizationproblemisalsoompliatedbytheheterogeneous
interlinkedlayoutsofsignalizedjuntions andthesaleofthesystem.
UTC has been a widely studied problem for a long time. Numerous systems and
methodolo-gies havebeen proposed to addressitoverthe last four deades. Classial UTCsystemsare either
ontrolled by adediated entral serverorin a distributed manner. The majority relyon omplex
mathematial and preditive models to optimize spei settingsof a giventra ontroller. With
the inreasingostsofongestion,theperformane ofthese systems,whih arestillin serviein the
major ities of the world, have prompted questions onerning their eetiveness and adaptiveness
in saturatedtra onditions. Other approahesrangefrom rule-basedsystemsand thosemodelled
using fuzzy/heuristianddynamioptimizationtehniques,toevolutionarygametheoryandgeneti
programming basedapproahes. However, these approahesare still hallengedto provide salable
andyetreal-timeadaptiveandresponsiveperformane. Inaddition,reinforementlearning(RL)and
numerousdeentralizedRLmethodsarebeinginreasinglystudiedforUTCoptimization. Thenature
of RL asan unsupervisedlearning approah, and partiularly Q-Learning, asamodel-freelearning
strategy, allows for inomplex problem modelling and ontrol of the exploration proess towards a
near optimalsolution. Suhharateristisareadvantageousfordevelopingareal-timeadaptiveand
of themajor souresofthat unertaintyis thenon-stationarynature oftra. An RL approahto
UTCoptimizationmustbedesignedinamannerthroughwhih itisrstlyapableofdistinguishing
between stable situations and seondly able to eiently optimize for eah. Moreover, the
perfor-maneofexistingRL-basedUTCapproahesisoftenevaluatedusingsimpliedgrid-likemaps. Some
approahesusemodel-basedRLandpartially observablemarkovdeisionproesses(POMDPs) that
add unjustiableomplexity. Whentryingtohandle thenon-stationarynatureof trawhileusing
RL, strit assumptions are needed, e.g., that a small number of stationary tra onditions reur,
that trapatternshangeinfrequentlyandtheindependene ofsuhhangesfromtraontroller
deisions. Inaddition,someofthese approahespresumetheavailabilityofknowledgethat iskeyto
theiroperationbutimpratialto obtainfrom therealworld.
Ourontributionisadeentralizedmulti-agentRL UTCstrategythatmodelsheterogeneous
sig-nalized juntions and optimizes UTC in an adaptive and responsive manner. It is motivated by
the lak of a model-free deentralized RL approah for UTC optimization that an deal eiently
with the non-stationary nature of tra without limiting assumptions and the possibility of
tak-ing advantageoftheinreasing availabilityof oatingvehiledata(FVD). Thegrowingadoptionof
vehile-to-vehile/infrastrutureommuniationandthepervasivenessofdierentpositioningsystems
both motivate the onsiderationof FVD asa meansof providing a rih view of loal tra
ondi-tions. We havedesigned aUTC optimization sheme basedon RL that deploysan adaptiveround
robin ontroller agentpairedwith anon-parametri tra-patternhange-detetion mehanismper
signalized juntion,namely, aSoilseagent. TheSoilseagentoptimizesphase timingsusing RL in a
non-ollaborativemanner. The agentisreferredto asSoilseCwhen it alsoollaborates with
neigh-bours. Itadaptstoloaltraonditionsandrespondstodierenttrapatternswhenrequired. In
ordertoprovideforsuhresponsiveness,itquantiesthedegreeofhangeperjuntionusing
informa-tionaboutloaltraoninominglanesanditsloalperformane. Essentially,ourdesignallowsfor
agentstorelearnupondetetingapersistentloaltrapattern hange. Therelearningparameters
are mainly basedon an average sample of therelevant degreeof pattern hange. An evaluation of
our approah shows its eetiveness against a non-adaptive xed-time UTC system and a
satura-tion balaning algorithm that emulates the Sydney Coordinated AdaptiveTraSystem(SCATS).
TheevaluationisbasedonsimulationsofrealDublinmapsofdierentsaleandnear-realistitra
Aknowledgements iv
Abstrat iv
List ofTables xi
List ofFigures xii
Chapter 1 Introdution 1
1.1 ReinforementLearning . . . 2
1.1.1 DeentralizedReinforementLearning . . . 3
1.2 UrbanTraControl . . . 3
1.2.1 UTCFatsandChallenges . . . 5
1.2.2 FloatingVehileData . . . 8
1.2.3 CommonUTCConepts. . . 9
1.2.4 UTCOptimizationTrends. . . 10
1.3 Hypothesis . . . 10
1.4 PrinipalContribution . . . 11
1.5 ThesisOrganization . . . 11
Chapter 2 State of theArt 12 2.1 ReinforementLearning . . . 12
2.1.1 MarkovDeisionProesses . . . 12
2.1.1.1 ValueIteration . . . 14
2.1.1.2 PoliyIteration . . . 15
2.1.2.1 LearningStrategies . . . 18
2.1.2.2 AtionSeletionStrategies . . . 20
2.1.3 DeentralizedReinforementLearning . . . 22
2.2 Unertaintyin UTC . . . 23
2.2.1 TraPatterns. . . 24
2.3 ClassialUTCApproahes. . . 25
2.3.1 SCATS . . . 25
2.3.2 SCOOT . . . 27
2.4 Non-RLUTCApproahes . . . 28
2.4.1 Centralized . . . 28
2.4.1.1 TUC . . . 28
2.4.1.2 DISCO . . . 29
2.4.1.3 MOTION . . . 29
2.4.1.4 Others . . . 29
2.4.2 Hierarhial . . . 30
2.4.2.1 RHODES/COP . . . 30
2.4.2.2 UTOPIA . . . 30
2.4.2.3 PRODYN-H . . . 31
2.4.2.4 Others . . . 31
2.4.3 Deentralized . . . 31
2.4.3.1 PRODYN-D . . . 31
2.4.3.2 ALLONS-D. . . 32
2.4.3.3 SuRJE . . . 32
2.4.3.4 Others . . . 32
2.4.4 Summary . . . 33
2.5 RL-BasedUTCApproahes . . . 34
2.5.1 Q-Learning-BasedApproahes . . . 34
2.5.2 EvolutionaryPrograming&RL. . . 36
2.5.3 FuzzyNeuralNetworks&RL . . . 37
2.5.4 Model-BasedVehile-CentriRL . . . 38
2.5.5 SpeiRL-BasedUTCApproahesforNon-StationaryEnvironments . . . . 38
3.2 OverviewandMotivations . . . 44
3.3 PatternChangeDetetion . . . 47
3.3.1 Design . . . 47
3.3.2 SensitivityandParameters . . . 49
3.3.3 Algorithm . . . 51
3.4 Phases . . . 54
3.5 SignalizedJuntionModel-SoilseandSoilseC . . . 55
3.5.1 Soilse . . . 56
3.5.1.1 LoalRewardModel. . . 58
3.5.1.2 Relearning . . . 59
3.5.1.3 SoilseAlgorithm . . . 61
3.5.2 SoilseC . . . 61
3.5.2.1 Neighbours . . . 65
3.5.2.2 CollaborativeRewardModel . . . 66
3.5.2.3 SoilseCAlgorithm . . . 66
3.6 Summary . . . 67
Chapter 4 Implementation 70 4.1 TheCRLframework . . . 70
4.1.1 LearningStrategy . . . 72
4.1.2 AtionSeletion . . . 73
4.1.3 RLAgent . . . 75
4.1.4 CRLAgent . . . 78
4.2 SoilseandSoilseCAgentGenerator. . . 80
4.2.1 PCD . . . 80
4.2.2 Relearn . . . 84
4.3 Summary . . . 84
Chapter 5 Evaluation 85 5.1 UTCSimulation . . . 85
5.1.1 TheUTCSimulator . . . 86
5.2.2 SoilseandSoilseCSpeis . . . 91
5.2.3 BaselinesforComparison . . . 92
5.2.4 PerformaneMetris . . . 93
5.2.5 EvaluationObjetives . . . 94
5.3 TrinitySenario. . . 95
5.3.1 BaselinePerformane . . . 95
5.3.2 Soilse . . . 100
5.3.2.1 Initial Learningvs. Relearning . . . 100
5.3.2.2 RelearningBehaviour . . . 104
5.3.3 SoilseC . . . 108
5.3.3.1 SoilseCvs. Soilse . . . 108
5.3.3.2 CollaborationMode . . . 112
5.3.3.3 RelearningBehaviour . . . 115
5.3.4 ComparisonAgainstBaselines . . . 117
5.3.5 Summary . . . 123
5.4 Dublin InnerCityCentreSenario . . . 124
5.4.1 BaselinesPerformane . . . 124
5.4.2 InitialLearningvs. Relearning . . . 125
5.4.3 RelearningBehaviour . . . 126
5.4.4 SoilseandSoilseCvs. Baselines. . . 129
5.4.4.1 SoilseC'sCollaborationMode. . . 134
5.4.5 Summary . . . 135
5.5 Summary . . . 136
Chapter 6 Conlusions and Future Work 137 6.1 ThesisContribution . . . 137
6.2 FutureWork . . . 139
3.1 PCD parameters . . . 49
3.2 SoilseCneighbourhood . . . 66
5.1 Trinitysenariotrapatterns . . . 88
5.2 Experimentalparameters . . . 92
5.3 Trinity-seletedbaselinesAWTperformaneomparison-Trinitymap . . . 100
5.4 Trinity - Soilse vs. SoilseInit based on best AWT performane per ation seletion strategy . . . 101
5.5 Trinity-best AWTSoilseperformaneperationseletionstrategy . . . 101
5.6 Trinity-best AWTSoilseperformaneperexplorationfator(ExpFator) . . . 104
5.7 Trinity-SoilseCvs. SoilseforbestperformanebasedonAWT . . . 109
5.8 Trinity-SoilseCvs. SoilseforbestperformanebasedonAvgStops . . . 109
5.9 Trinity-keyparametersofbest performingSoilse. . . 110
5.10 Trinity-keyparametersofbest performingSoilseC. . . 110
5.11 Trinity-SoilseCbest AWT performaneperollaborationmode-
ǫ
-greedy . . . 1155.12 Trinity-best AWTperformaneofSoilseandSoilseCagainsttheseletedbaselines . 117 5.13 DublinICC -baselinesperformane-RRandSAT . . . 125
5.14 DublinICC -SoilseInitvs. Soilse-bestAWTperformane . . . 125
5.15 DublinICC -SoilseCvs. Soilsebestperformane . . . 129
1.1 Skethoftheworld'srsttrasignalthatwasinstalledonthejuntionofGeorgeand
BridgestreetsinLondonin 1868(Mueller, 1970) . . . 4
1.2 RoadmotorvehilesperthousandinhabitantsinseletedOECDountries(OECD,2008) 6 2.1 AtypialRL agentinteratingwiththeunderlyingenvironment s : state,r : reward,a : ation . . . 17
2.2 AdeentralizedRL struture . . . 23
2.3 SCATSloalontrollerdata(Sims&Dobinson,1980) . . . 26
3.1 Designoverview. . . 45
3.2 Example-CUSUMsamples . . . 47
3.3 JuntionPCDhigh-levelsheme . . . 48
3.4 Fixedthresholdingtehnique . . . 50
3.5 Twophasesofafour-approahsignalizedjuntion. . . 54
3.6 Completeset ofphasesforaT-shapedjuntion . . . 54
3.7 SimplistisetofphasesforaT-shapedjuntion . . . 55
3.8 ConisesetofphasesforaT-shapedjuntion . . . 55
3.9 Soilseagentstruture . . . 56
3.10 Soilseagentstate-ationspaeforasignalizedjuntionwiththree phases . . . 57
3.11 Phasetraount-usedtodeterminephasestatus . . . 58
3.12 SoilseCagentstruture . . . 62
3.13 NPV example . . . 64
3.14 Neighboursexample . . . 65
4.4 RLAgentlassdiagram . . . 77
4.5 CRLAgentlass diagram. . . 79
4.6 CRLAgentreeiveSRfuntionsequenediagram . . . 79
4.7 SoilseandSoilseCagentsgeneratorlassdiagram . . . 81
4.8 SoilseandSoilseClassesrelationtotheCRLframeworklasses . . . 82
4.9 High-levelagentgenerationsheme . . . 82
5.1 UTCsimulator-viewersnapshot . . . 87
5.2 Trinitymap . . . 88
5.3 Dublin innerityentre map . . . 90
5.4 Trinity-RRtotalvehilewaitingtimethroughoutthesimulationtime . . . 95
5.5 Trinity-SATtotalvehilewaitingtimethroughoutthesimulationtime . . . 96
5.6 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1, 2_1.5,5_1.1,5_1.5)-AWT . . . 97
5.7 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1, 2_1.5,5_1.1,5_1.5)-AvgStops . . . 97
5.8 Trinity-RR(20s,30s,40s)vs. SAT(2_1.1,2_1.5,5_1.1,5_1.5)-numberofarrived vehiles . . . 98
5.9 Trinity-RR20svsSAT_2_1.5-totalvehilewaiting timethroughoutthesimulation time . . . 98
5.10 Trinity-RR20svsSAT_2_1.5-aumulatedtotalvehilewaitingtime . . . 99
5.11 Trinity-RR20svs. SAT_2_1.5-numberofstoppedvehilesthroughoutthesimulation time . . . 99
5.12 Trinity - Soilse using Boltzmann vs. (
ǫ
-)greedy - total waiting time throughout the simulationtime . . . 1025.13 Trinity - Soilse using Boltzmann vs. (
ǫ
-)greedy - aumulated total waiting time throughoutthesimulationtime . . . 1025.14 Trinity-SoilseusingBoltzmannvs. (
ǫ
-)greedy-numberofstoppedvehilesthroughout thesimulationtime. . . 1035.15 Trinity-juntion #1226DPCunder Soilseusing
ǫ
-greedyfordierentExpFatorvalues106 5.16 Trinity-juntion#1226ǫ
-greedyepsilonhangeusingSoilseunderdierentExpFator values . . . 1065.18 Trinity -Soilseusing
ǫ
-greedyratioofrelearningtime tosimulationtime -best AWTperformane . . . 108
5.19 Trinity - Soilse vs. SoilseC using
ǫ
-greedy - aumulated total vehile waiting time,totalvehilewaitingand numberofstoppedvehilesthroughoutthesimulationtime
-bestAWTperformane . . . 111
5.20 Trinity -Soilsevs. SoilseCusing Boltzmann-aumulatedtotalvehilewaitingtime,
totalvehilewaitingand numberofstoppedvehilesthroughoutthesimulationtime
-bestAWTperformane . . . 113
5.21 Trinity-Soilsevs. SoilseCusinggreedy-aumulatedtotalvehilewaitingtime,total
vehilewaitingtimeandnumberofstoppedvehilesthroughoutthesimulationtime
-bestAWTperformane . . . 114
5.22 Trinity - SoilseC using
ǫ
-greedy (re)learningstart and end times - best AWTperfor-mane . . . 116
5.23 Trinity-SoilseCusing
ǫ
-greedyratioofrelearningtimetosimulationtime-bestAWTperformane . . . 116
5.24 Trinity - SoilseC
ǫ
-greedy vs. (RR20s and SAT_2_1.5) - total vehile waiting timethroughoutthesimulationtime-best AWTperformane . . . 119
5.25 Trinity - Soilse
ǫ
-greedy vs. (RR20s and SAT_2_1.5) - total vehile waiting timethroughoutthesimulationtime-best AWTperformane . . . 120
5.26 Trinity-SoilseandSoilseC
ǫ
-greedyvs. (RR20sandSAT_2_1.5)-aumulatedtotalvehilewaiting timethroughoutthesimulationtime-bestAWTperformane . . . . 120
5.27 Trinity - SoilseC
ǫ
-greedy vs. (RR20s and SAT_2_1.5) - (aumulated) number ofstoppedvehilesthroughoutthesimulationtime-best AWTperformane . . . 121
5.28 Trinity - Soilse
ǫ
-greedy vs. (RR20s and SAT_2_1.5) - (aumulated) number ofstoppedvehilesthroughoutthesimulationtime-best AWTperformane . . . 122
5.29 DublinICC-Soilseusing
ǫ
-greedy(re)learningstartandend times-bestAWTperfor-mane . . . 126
5.30 DublinICC - Soilse using
ǫ
-greedy ratio of relearning time to simulation time - bestAWTperformane . . . 127
5.31 DublinICC -SoilseCusing
ǫ
-greedy (re)learningstartand endtimes -best AWTper-formane . . . 128
5.32 DublinICC - SoilseCusing
ǫ
-greedyratio ofrelearning time to simulationtime -besttotal vehilewaiting time andtotal numberofstopped vehiles throughout the
simu-lation time . . . 131
5.34 DublinICC -SoilseandSoilseCbest performanevs. SAT-totalvehilewaitingtime
andtotalnumberofstoppedvehiles throughoutthesimulationtime . . . 132
5.35 DublinICC - Soilse and SoilseC best performane vs. SAT - aumulated total
ve-hile waiting time and aumulatedtotal numberof stopped vehiles throughout the
1 ThevalueiterationDP method . . . 14
2 Thepoliy iterationDP method . . . 15
3 GeneriQ-Learning. . . 19
4 GeneriSARSA. . . 20
5 PCD-alulate DPC . . . 52
6 ThePCDproessforasinglesignalizedjuntion . . . 53
7 Soilseinitialization . . . 61
8 Soilseproess . . . 62
9 Reparameterizeperationseletionstrategy. . . 63
10 SoilseCinitialization . . . 67
Introdution
Thisthesispresentsanewdeentralizedapproahtoonlineoptimizationofurbantraontrol(UTC)
using Reinforement Learning (RL). In our approah, eah RL agent learns to ontrol a spei
signalized juntion through environmental feedbak and potential ollaboration with neighbouring
agents. Agentsadapt to loal tra onditionsby learninga sequene of tra lightphases to be
used. Theyrespondto utuatingtra patternsorunsatisfatoryperformanebyrelearningbased
on aloal non-parametritra-patternhange-detetionmehanism. Thenoveltyof ourapproah
stemsfromitsonlinedeentralizedUTCoptimizationshemeusingRLwithoutaprioriknowledgeof
tramodelsinanadaptiveandresponsivemannerthatdealswithutuatingtra. Essentially,by
providingsuh anadaptiveandresponsiveUTCshemeweaimtoredueongestion inurbanareas.
ThishapterintroduesRLinludingentralizedanddeentralizedRLshemes. Italsoprovidesa
historialbakgroundonerningUTCandintroduestherelevantfatsandhallengesinthedomain.
AnemergingsoureofdataforUTCoptimizationnamely,oatingvehiledata(FVD)isintroduedas
wellastheommonUTConeptsandurrenttrendsinUTCoptimization. Furthermore,wepresent
ourhypothesiswhihisbasedonanumberofargumentsonerningthedeentralizationofUTConline
optimization using RL and ontheviabilityof loal non-parametritra-patternhange-detetion.
WealsopresentourontributionthatprovidesashemeforUTCoptimizationusingRLwhiledealing
with utuatingurban train adeentralized andonline manner. Finally, theorganizationof the
1.1 Reinforement Learning
The esseneof RL an be traedto themanner bywhih nature'sintelligentelements anlearnby
interating with the surrounding environment. Sutton & Barto (1998) dene RL aslearninghow
to map situations to ations soas to maximise a numerialreward signal. RL is an unsupervised
learningapproahinthesensethatanagentdoesnotrelyonaknowledgeablemasterthatmighthave
spei domain knowledge. On the ontrary, agents explore their environment by sensingdierent
situations stimuliandthen exeutingsomeseletedation(s) whih resultinafeedbakin theform
ofareward.
AnyRLsolutionisbasedontwobasielements,namely, arewardfuntionandavaluefuntion.
Optionally,someRLsolutionsmakeuseofamodeloftheenvironmenttopredittherewardandnext
stateafter takinganationin agivenstate. Therewardfuntion ismeantto provideanimmediate
goodnessmeasureforaertainationinagivenstate. Thevaluefuntion,asopposedtothereward
funtion,triestoindiatethelong-rungoodnessofagivenation,i.e.,theexpetedrewardsthatan
be aumulated over the future starting from the urrent state. Interation with the environment
eventuallyprovidestheRLagentwithapoliy,i.e.,amappingbetweenallstatesandtheirrespetive
bestationsatanygiventime. Moreover,ationseletionanourusingexploratorystrategies,(e.g.,
ǫ
-greedyorBoltzmann(Sutton&Barto,1998))ornon-exploratorystrategies,(e.g.,greedy). Findingthe limitto whih explorationshould last is known astheexplorationversusexploitation dilemma.
Exploitation isthephaseduring whih theagentputs thepreviouslylearntpoliy into ontrol. The
esseneof thedilemmaisin thefatthat anagentannotrun purelyonexplorationorexploitation
otherwiseit willbelearningforeverwithoutatuallyputting thelearntpoliy intoontrol oritwill
allow agiven poliy to ontrol forever,hene abalane is needed. Q-Learning(Watkins & Dayan,
1992)is awell-establishedmodel-freeo-poliy (explainedbelow)RLstrategy basedon theonept
ofdisountedexpetedrewards. AnRLagentthatusesQ-Learningusuallylearnswithaspeirate
α
: 0
≤
α <
1
andaertaindisountrateγ
: 0
≤
γ <
1
throughaMarkovDeisionProess (MDP)representation ofthe environment. It is amodel-free approah in thesense that it doesnotrequire
someapriorilikelihoodmodelfortheationsthatouldbeexeutedontheenvironment. Q-Learning
is onsideredano-poliystrategyas itlearnsand updatestheagent'sknowledgeevenwhiletaking
ationsthat ouldproveto benon-optimalin thefuture (Abdulhaietal.,2003). Beingano-poliy
learningstrategy,aswellasallowingforshortperiodknowledgeupdatingperationtaken,Q-Learning
isanidealandidateforUTCoptimizationgiventhenon-stationarynatureoftra(Abdulhaietal.,
1.1.1 Deentralized Reinforement Learning
ClassialRL isaentralizedoptimization approah. Thismakesproblemmodellingmorediultas
the system'somplexity inreasesdueto the inreasein thenumberof system'sstatesthat needto
berepresentedwhihouldbealsoaompaniedbyaninreasein thenumberofdeisions/ations.
TheUTCproblem,forexample,dealswithnumerousinteronnetedsignalizedjuntionswithsome
ofaheterogeneousroadlayout. ForarelativelysmallitylikeDublin,theityentrehasroughly
∼
250signalized juntions that need to besimultaneouslyontrolled. A distributed/deentralized version
of RL an be useful (Abdulhai & Pringle, 2003) for suh a system while a lassial (entralized)
RL view poses problem modelling omplexity as the network of signalized juntions inreases in
size. Many deentralized RL approaheswhere no single RL agentmodels and ontrols the global
problem have been proposed. They provide optimization approahes of a distributed manner that
breaks the global optimization problem into manageable sub-problems. Eah RL agent dealswith
itsassignedsub-problemloallywiththepossibilityofollaboration,(i.e.,knowledgeexhange)with
other agents. This ouldbeseen asaspeializedMulti-AgentSystem (MAS) where agents useRL
for optimization(Bu³oniuet al.,2008;Panait&Luke,2005). Furthermore,several ollaborativeRL
approahes (Dowling et al., 2006; Kok & Vlassis, 2006; Hoen et al., 2006; Goldman & Zilberstein,
2004;Tesauro,2003;Guestrinetal.,2002;Ahmadabadietal.,2001;Abuletal.,2000;Tan,1998;Hu
& Wellman, 1998; Claus &Boutilier, 1997; Littman, 1994) havebeenproposed and wewill disuss
them laterin Chapter 2. Dowlinget al.(2006)theyusethe termCRLto referto aspei form of
CollaborativeReinforement Learning. We adopt the term CRL only in desribing our framework
implementationin Chapter4,however,ourCRLviewisdierentthantheirs. WeusethetermCRL
torefertoashemewhereRLagentsanollaborate,i.e.,exhangeknowledgeofanynaturethatan
beusedinupdatingtheagent'sloalknowledgebesidestheuseofitsloalrewards.
1.2 Urban Tra Control
Thehistoryoftramanagementarguablyextendsbakto theRomanera. Itisinterestingtonote
that theproverballroadsleadtoRome isbasedonthefatthat areferenepoint,in theformofa
golden milestone,waspositionedin theForum in theanientityofRome. Roadbuildersin Rome,
in theirturn, usedmilestonesasaform ofprimitivemeanstoinformroad usersabouttheirrelative
loationto the golden milestone in Rome (Mueller, 1970). These distributed milestones worked as
indiators orsignals ofreassurane for road usersthat theywere onthe right routetowardsRome.
Figure 1.1: Skethoftheworld'srsttrasignalthat wasinstalledonthejuntionofGeorgeand
BridgestreetsinLondonin 1868(Mueller,1970)
inform andevenontrol tra reatedby theinreasing numberof road usershaveindeed hanged
dramatially.
Asroadsbeamewiderandtra grewheavier,theneedto managemovementwithin itiesand
the inreasing number of road fatalities beame an urgent issue with whih to deal. It was in the
British parliament in the late eighteenth entury that it was rst suggested to borrow a method
deployed in railways to be used in ontrolling tra on roads. A tra superintendent from the
south easternBritish railwaynamed J. P. Knight had suggestedto EarlGranville that the onept
of a railway semaphore signal ould be ported onto the road network to allow for tra ontrol
(Ishaque&Noland,2006). TheBritishparliamentagreedtoEarlGranville'ssuggestionandinstalled
the world's rst tra signal (see Figure 1.1) on Deember 1868 on a juntion near the Housesof
Parliament in London. That tra signal was paradoxially put in plae to ease road aess for
membersofparliamentratherthanimprovepedestrians'safety. Thetrasignalfuntionedinaway
that ombined red and green gas lights with semaphore arms. The arms extended horizontallyto
denote astop signalandona
45
o
angletodenote aution. Atnight,thestopsignwasaompanied
by a redlight on the topwhile the aution signal wasaompanied by a greenone. The readeris
referredto(Mueller,1970)foramoreindepthhistoryoftrasignals. Heneforth,thetermstra
signal andtralight areusedinterhangeably.
Earlytrasignalswereontrolledbypoliemenwhihbeameinreasinglyimpratialaswider
deployment took plae in dierent ities. A greater number of juntions had to be ontrolled in
be to provide what is known as a green wave or a series of go signals along a desired path of
ontrolled juntions. Advanes in eletronis and omputer siene havemade it possible to devise
omputerized UTC systemsthat anmanage tra, in termsof eient automatedoperation and
performaneoptimization,onalargernumberofontrolledjuntions,i.e.,signalizedjuntions. Suh
asystemwasrstdeployedinTorontoin
1959
usinganIBM650
omputertoontrolninesignalized juntions (Gazis, 1971). Early UTCsystems were entrally ontrolled and relied on detetors suhasmagneti loops,radar andsonar. The main funtionalitiesprovided bytheontrol softwarewere
eletrialatuationofjuntionontrollers,tra-lightstatemonitoringanddetetordataproessing.
The latterdata wastypiallystoredfor potentialoine analysis whilesomeseleted datawasused
for better online ontrol strategies. Suh ontrol strategies were mainly based on the onept of
synhronizingaline of juntions,usually anarterial road, in orderto allow forvehiles to travelat
aonstantspeedwithminimalstops. However,thosestrategieswere xedforertainsituations and
onstrainedbythenumberofontrolledjuntions.Moreover,withtheinreasingnumberofvehileson
roadsandthegrowingsaleofurbanroadnetworks,theUTCproblemhasbeomemorehallenging.
Consequently,theneedformoresophistiatedandoordinatedUTCsystemstoprovideeienttra
ontrolstrategieshasarisen. TheultimategoalforsuhomputerizedUTCsystemsisto providean
eienttraontrolstrategythatrunsinanoptimalmannerinordertominimizeroadongestion.
This optimalityisdiretlyrelatedtoahievingminimumvehiledelay,less-interruptedtraowor
aminimumnumberofvehilestopsandinreasedvehileveloity. Detailsontheprogressionofearly
UTCsystemsanbefoundin (Gazis,1971).
1.2.1 UTC Fats and Challenges
Urbantraisanevolvingproblemloselyrelatedtopopulationgrowthandworldeonomifators.
Many ountries are seeing an inreasein vehiles per apita with eah passing year. As far as the
Organisation for Eonomi Co-operation and Development (OECD) ountries are onerned, road
motor vehiles perthousand inhabitants haveinreased overthe period from 1990until 2006in all
studied ountries exepttheUnited States(OECD,2008)fornolearreasonbutpossiblydueto its
maturestatusandinreasingenvironmentalpubliawarenessprograms. Considerableinreaseswere
notied inountrieslikePortugal,Ieland,GreeeandPoland(seeFigure1.2).
As urbanization is inreasing, road networks in dierent ountries are expanding as well. For
Australia
Belgium
Canada
France
Germany
Greece
Iceland
Ireland
Italy
Japan
Poland
Portugal
Spain
Sweden
Turkey
United Kingdom
United States
India
Russian Federation
0
100
200
300
400
500
600
700
800
900
Road motor vehicles
Per thousand inhabitants
1996
2000
2003
2006
Country
Figure 1.2: Road motor vehiles per thousand inhabitants in seleted OECD ountries (OECD,
2008)
havematureroad networks. Theannualgrowthin roadnetwork sizeanvary from
6%
in ountries like Korea, Poland, Portugal, Ireland and Greee to a lower rate of2%
in ountries with mature roadnetworksliketheUnitedStates,Germany,Canada,theRussianFederationandtheNetherlands(OECD,2008).
Asthenetworkofsignalizedjuntionsgrewalongwiththeinreasingnumberofvehilesonroads,
the problem of providing an eient UTC system beame naturallymore omplex. Evidently, the
problemhasnotyetbeensolved,forinstane,theUnitedStateshasroughly
330
,
000
trasignalsof whih75%
anbeadjustedtobemademoreeientusing,butnotexlusively,dierenttimingplans (United StatesDOT,2007). However,thesaleis notthesoleissue,roadusersalsoexhibitdierenttravelling routineswhile unexpeted emergeny and aident situations make tra networks
non-stationary in nature. Suh tra harateristisinreasethe UTCoptimization hallenge. Another
modellingandontrolhallengethatfaesUTCsystemsistheheterogeneousstrutureofinterlinked
signalized juntions. Theeetsof ontrollerdeisionsarriedout atonejuntion will propagatein
theroadnetworkaetingtheperformaneofothers,espeially,theirimmediateneighbours.
Conse-quently,theneedforawell-designedollaborationshemeisvitalinprovidingeientUTCsystems
(Bazzan,2004).
Thenegativeimpat of poorUTC systemsis massiveand anbeessentiallysummed up in one
word ongestion. It is true that better and more eient UTC systems annot alone solve this
inreasingproblembuttheyansurelyhelptoredueit(EuropeanCommission,2007a;UnitedStates
DOT, 2007). Congestionausesworldwideenvironmental,eonomiandsoialproblems. IntheEU
& Gargett, 2007). In 2007, ongestion ostthe United States
∼
US$87
.
2
billionin439
urban areas alulated based on wasted time and fuel(Shrank &Lomax, 2009). As far as theenvironment isonerned,ongestionisamajorauseofairandnoisepollution. UrbanmobilityintheEUontributes
40%
of the overallCO
2
emissions aused by road transportation while this perentage inreasesto70%
ofallotherpollutants(EuropeanCommission,2007a).Theseonsiderableperentagesareduetoinreasingtragrowthandtothestop-gonatureofdrivinginitiesdespitetheadvanesin vehile
emissionredutiontehnologies(EuropeanCommission,2007b). Furthermore,areentsurveybythe
DepartmentofTransportationintheUnitedStateshasshownthat
47%
ofAmeriansagreethatdelay aused bytra ongestionisatopommunityonern(UnitedStatesFHWA,2001).Partof thesolutionto traongestion is evidentlybetterand moreeientlyresponsiveUTC
systems(European Commission,2007b,a;United StatesDOT,2007). AdaptiveandresponsiveUTC
systems haveproved to be promising in many asesin theUnited States. Comparedto previously
deployedsystems,aordingto (UnitedStatesDOT,2007)forexample,anewTexasLight
Synhro-nizationprogrammanagedtoreduetradelayby
24
.
6%
,fuelonsumptionby9
.
1%
andthenumber of vehilestops by14
.
2%
,all throughsignaltiming optimizationand equipmentupdate. In Califor-nia, a newfuel-eienttra signalmanagement programmanaged to redue fuelonsumption by8%
. LosAngeles'AdaptiveTraControl System(ATCS)whihoperatesastheity'smain traontrol system, managed to diminish average delay by
21
.
4%
and vehile average number of stops by31%
throughreal-time response(signaltiming adjustment)totra demands. Theresultsabove enouraged further researh in providing more eient UTC systems in the US throughdediatedFederal and State funding programs (United States DOT, 2007). The above advanes might have
beentheresultofalongawaitedimprovementinthepoorperformaneoflegayUTCsystems.
Re-ently, new approahesarebeingonsidered to ome upwithsmarter UTC solutionstodealwith
the inreasing ongestion problems, for instane, areent
2009
governmental report onAustralia's Digital Eonomy: Future Diretions hasidentied the use of Artiial Intelligene (AI) and moreadvaned tra sensor tehnologies for developing better UTC systemsasa strategiresearh goal
(Commonwealth ofAustralia,2009).
Theenabling tehnologies to designand deployan eientUTC systemthat taklesthese
hal-lengesareinreasinglybeomingpervasive.Thedomainthatenompassessuhtehnologiesisreferred
toasIntelligentTransportationSystems(ITS)whereinformationproessingandommuniation
teh-nologies are being applied to the transportationdomain (Yang & Wang, 2007). This ranges from
devisingbetterUTCoptimizationshemestonavigationsystemsandreal-timetramonitoring. An
means. Theyprovidearihreal-time viewof trastatus inities thatanbeexploitedforseveral
appliations inludingUTCoptimization.
Insummary,itis learthat traongestion isaworldwideproblemthat hasbeenlearly
aus-ing eonomi, environmental and soial problems (see statistisabove). This problem is worsening
withtheinreaseinurbanization,vehilenumbers,populationandthepossibleineienyoflegay
UTC systems. Dierent eorts are being exerted to develop moreeient UTC systemsthat are
moreadaptiveandresponsivetotrahanges. However,innovativeandsmart UTCoptimization
shemes that make use of progressivetra management tehnologies like FVD and AI have only
reentlyameintofous.
1.2.2 Floating Vehile Data
TheoreideabehindFVD(EuropeanCommission,2003)istoprovidedierentmeanstoommuniate
various dataassoiatedwith vehiles in amorepervasiveand ost-eetivemanner using
vehile-to-infrastruture(V2I)orvehile-to-vehile(V2V)ommuniation. Suhdataisusuallyspatio-temporal,
for example, the loationof an anonymous (or possiblyknown)vehileat agiven point of time on
the road network. Furthermore, with the inreasing availability of in-vehile sensors, data ould
range from airpressurelevelsin tires to fuelonsumptionand auratespeeddata at agiventime.
Standardization eorts are also playing a major role in helping the spread and adoption of
FVD-basedtehnologiesandsolutions. TheInternationalStandardsOrganization(ISO)andtheEuropean
CommitteeforStandardization(CEN)areleadingtheeortsinprovidingstandardsforV2VandV2I
ommuniationtehnologies. Mostnotably,theDediatedShortRangeCommuniations(DSRC)(Bai
& Krishnan,2006)and theContinuousAir-interfae,Longand Medium Range(CALM) (Williams,
2004)standardsthatmakeuseofthewirelessaessinvehiularenvironments(WAVE)enablingIEEE
protool,namely,IEEE802.11p(Eihler,2007). Thelatteraimsatprovidingawideplatformof
dier-entommuniationtehnologiesworkingseamlesslytogether inluding,forexample,DSRC, General
Paket RadioServie (GPRS), Global System forMobile ommuniations(GSM) and International
Mobile Teleommuniations-2000(IMT-2000)or3G.
Traditionally,trademanddataisgatheredthroughsensorsembeddedintheroadinfrastruture
suhasindutiveloopdetetors orameras. Withthe standardizationof FVDtehnologiesand the
inreasing pervasiveness ofwirelesspositioning systems,e.g.,GlobalPosition System(GPS),aswell
as the onsiderableinvestments in V2I and V2V ommuniationtehnologies; it is now possible to
establish a FVD enrihed environment with a signiantly lower ost ompared to the traditional
thoseoftheEuropeanUnion,haveresultedinapromisingsatellitepositioningprojetnamely,Galileo
(EuropeanCommission, 2001),whih isexpetedto bemoreauratethanurrentGPStehnology.
This willpotentiallyhaveapositiveimpatontramanagementsolutions(Kuhne,2003).
Moreover, there has been a reent fous on enrihing the set of typial FVD information, e.g.,
position,speedand time(Messelodi et al.,2009). Throughdealingwith vehiles asmovingsensors,
e.g., ameras and tra level analyzers, typial FVD is enrihed with information resulting from
vehile surroundings analysis, e.g., road onstrution notiation and tra level. The reader is
referredtothesurveyby(Luo&Hubaux,2004)formoreinformationonFVD.
1.2.3 Common UTC Conepts
There are anumber ofoneptsthat are usedin desribingthe funtionality within aUTCsystem.
An introdutionto someommonUTConeptsisprovidedinthissubsetion.
Signalizedjuntion: ajuntion thatisontrolledbyatralight.
Phase: aphaseisharaterizedbytheexlusivesetoftradiretions allowedto proeedata
givensignalizedjuntionfromertainapproahesatagiventime. Onlyonephaseanbeative
atatimewhereallitsapproaheshaveagreensignalto go.
Oset(time): thetimedierenebetweenthestartofsomephaseonagivensignalizedjuntion
and thestartof adierentphaseon anadjaent signalizedjuntion. Typiallyrelevant when
adjaentjuntionsneedtooordinatetheirphaseativationthat mayaetonnetinglinks.
Cyle(time): thetime neededto ompleteasequeneofphasesonagiven signalizedjuntion
inludingosets.
Split: theproportionedgreentimealloatedperphaseforallphasesinayle.
Oversaturation: a situation where links onneting signalized juntions reah their maximum
apaityintermsofnumberofvehiles.
Certain lassialUTC systems, asdisussed in Chapter 2, base their optimization methodology on
tuning signalized juntions timingparameterssuhastheoset, theyletime andthesplit. Some
non-lassialapproahes,however,followdierentoptimizationmethodologiesbasedonphase
1.2.4 UTC Optimization Trends
SeveralUTC systemshavebeen proposed overthe pastfour deades. Speially, twosystems, the
SydneyCoordinatedAdaptiveTraSystem(SCATS)(Sims&Dobinson,1980;Lowrie,1982)andthe
SplitCyleOsetOptimisationTehnique(SCOOT)(Huntetal.,1982)havebeendeployedinmany
major ities. Thesesystems arebasedon omplexmathematialmodelsto optimizespei timing
settingsofatraontroller,namely,theoset,splitandyletime. However,traontrolstrategies
in suh systemsare either entrally or hierarhiallyformulated. Numerous other approahes have
beenproposedasomputationalproblemsolvingmethodologieshaveevolved. Suhapproahesmainly
useDynamiProgramming,evolutionarygametheoryandgenetiprogrammingoraombinationof
those. Otherssimplyusefuzzy/heuristimodelsandrule-basedmethodswithpossibleintegrationwith
evolutionaryapproahes. However,RL hasemerged asapromising approah forUTCoptimization
in whih true adaptiveness an beahieved(Abdulhai& Pringle,2003; Abdulhai et al., 2003). We
onentrate on deentralized RL that speially uses Q-Learning for UTC optimization given its
salabilityandappliabilitytoonline(re)learningthatallowsfortheadaptivenessandresponsiveness
neededbyUTC.
1.3 Hypothesis
OurhypothesisisbasedonthefollowingargumentsonerninganeientUTCsystem:
LoaltrasignalsontrolledbyRLagentsthatanadaptandrespondtohangingtraare
advantageousomparedtoxed-timeand SCATS-inspiredtralightontrollers.
DesigninganRLagentusinganadaptiveround-robinshemebasedonphasestoontrolagiven
trasignalispossible.
DeentralizationthroughassigningaontrollingRLagentpersignalizedjuntion that
ollabo-rateswithneighbouringagentsanahievebetterglobalperformane.
Deteting tra hanges as they our is possible based on tra ltering per lane and the
performaneoftheassignedRLagentwithoutaprioritramodels.
Responsivenessanbeahievedbyrelearningbasedonaquantiedloaldegreeoftrahange.
Theproposeddesigndoesnotpresumespeisouresofsensorinformationbutratherexposes
We evaluate our ombined hypothesis using a mirosopi simulator that takes as inputs varying
tra patterns simulated on dierent real maps of Dublin ity. The evaluation inludes dierent
senarios haraterized by map sale, hanging tra and ollaboration. Comparisons are made
againstsenariosusingxed-time ontrollersandagainstaSCATS-inspiredalgorithm,namely,SAT
(Rihter,2006).
1.4 Prinipal Contribution
ThisthesisprovidesadeentralizedUTCoptimizationapproahusingRLandollaborationshemes,
that is eient, adaptive and yet responsive to the non-stationary nature of urban tra. Our
prinipal ontributionis asalable sheme in whih eahsignalized juntion isontrolled by anRL
agentthat isautonomouslyapableofdetetingunsatisfatoryperformane andloaltra-pattern
hange to whih it responds by relearning based onthe degree of hange observed. The RL agent
an potentially ollaborate with neighbouringagentsin order to provide better global performane.
With alltheir harateristis,wename ouragentsasSoilse whih means tralightsin the Irish
language. Heneforth,aSoilseagentisRL-basedwhereaSoilseCagentusesRLandollaborateswith
its neighbours. Furthermore, the approah does not assumeany domain knowledge norpredened
modelsoftra.
1.5 Thesis Organization
The remaining hapters of this thesis are organized as follows. Chapter 2 presents the
state-of-the-art in UTC inluding lassial widely deployed de fato systems, as well as RL and non-RL
approahes. The hapter also disusses RL and deentralized RL inluding the main learning and
ationseletionstrategies. InChapter3wedetailthedesignofourUTCoptimizationagents,namely,
SoilseandSoilseCinludingthepatternhangedetetionmehanismandtherelearningstrategy. In
Chapter 4wepresentour implementation using aCRL framework that webuilt asa C++ library
and wedesribetheinteration betweentheUTC simulatorand theSoilseand SoilseCinstanes of
that framework. Chapter 5 presents our evaluation results based on dierent axises suh as sale,
ollaboration, responsiveness and ationseletionstrategies. Wenally onludeanddisuss future
State of the Art
The thesis merges between signiantly wide domains, i.e., reinforementlearning (RL) and urban
traontrol(UTC)optimization. ThisthesisaddressesRL-basedoptimizationofUTC,thereforein
thishapterweintroduethebakgroundneessaryforunderstandingourapproahaswellasrelated
work to position our ontribution and distinguish our approah from existing approahes. In this
hapter,weintrodueMarkovDeisionProesses(MDPs)anddisusstheessentialsofRL andmost
popularlearningandationseletionstrategies. Wealso disussthedeentralizationofRL. Aswell,
wereviewdierentlassial,(i.e.,urrentlydeployedanddefato)approahestoUTCalongwiththe
relatedworkinnon-RL-basedandRL-basedUTCoptimizationtehniques.
2.1 Reinforement Learning
InthissetionweintrodueRL.WebeginbydesribingMDPsgiventheirloserelationtomodelling
RLproblems. Wealsointroduesomewell-knownapproahestosolvingMDPsinthesenseofseeking
anoptimalpoliy.
2.1.1 Markov Deision Proesses
Often,RLproblemsaremodelledusingMDPs. Anagentoranyentitythatpereivesandatswithin
an environmentould ause anew underlying state. Suh a stateould be the diret result of the
agent'sations ordue to other fatorssuh asother agents' ationsorthe naturaldynamis of the
environment,e.g.,thepopularpreyandpredatorormultiplepredatorsproblem(Kok&Vlassis,2004).
through:
S
: adisretesetofstatesrepresentingthepossibleenvironmentalsettings
A
: adisretesetof ationsavailabletotheagent
R
(
s
t
, a
t
)
: arewardfuntion thatreturnsarewardfortaking ationa
in states
at timet
T
(
s
t
, a
t
, s
t
+1)
: a transition probability model known a priori that provides the probabilityp
(
s
t
+1
|
s
t
, a
t
)
oftransitingtostates
t
+1
ifationa
t
istakenfrom states
t
Anyproblem modelled asan MDP must naturally satisfythe Markovproperty, i.e., the future
be-haviourdepends ontheurrent state
s
t
but notonthepast states. Suh apropertyensuresthat agivenstateapturestheeetofapreviouslytakenhainofations,whihallowssimplerrulestosolve
theMDP'soptimalpoliy
π
∗
,where
π
∗
isamappingfromthestatestothebestations. Itispossible
in this asetowrite one-stepformulasthat anbe,in someform,iterateduponin orderto disover
π
∗
. Animmediatereward
r
t
+1
givesagoodnessmeasurefortheationa
t
exeutedinstates
t
. Itan bealulatedbasedontherewardfuntionR
(
s
t
, a
t
)
orsometimesusingR
(
s
t
)
whihreturnsareward forbeingins
t
. Suhareward,however,mightbeinsuienttoapturetheexpetedfutureeetorthe long-term usefulnessof taking agivenation unless itis ombinedwith future rewards. Hene,
theoneptoffuturedisountedrewardsemerges. Naturallyanagentwillnotbelikelytowaitforever
in ordertoaquireaveryhighreward,however,itmakessenseforittoinlude futurerewardswhile
dereasing theirimportane as theyour further away in time. Suh a behaviour anbeahieved
using a dereasing disount rate known as
γ
∈
[0
...
1]
. The Bellman optimality equation (2.1) is a well-known optimality equation basedon the onept of disountedrewards and states'expetedutility. It is used to nd the optimalutility
U
∗
for allstates whih in manyases is referredto as
V
∗
aswell. TheBellmanequationaimsatoptimizingthevalue-funtion
V / U
thatgivesagoodnessmeasure forbeingin aertainstate, oralternatively,thestate-ation value-funtion
Q
(
s
t
, a
t
)
whih providessuhameasureperationperstate.U
∗
(
s
t
) =
R
(
s
t
) +
γ max
a∈A
(
s
t
+1
)
X
s
t
+1
T
(
s
t
, a
t
, s
t
+1)
U
∗
(
s
t
+1)
(2.1)Several lassial methods have been proposed to solve MDPs. The majority are onsidered to
fall into the Dynami Programming (DP) paradigm (Sutton & Barto, 1998). Two well-known DP
methods forsolvingMDPs arethe valueand poliy iteration methods. Althoughthe DP paradigm
assumesaperfetworldmodelasinMDPs,itisanimportantbasisforunderstandingRLwhihdoes
2.1.1.1 ValueIteration
Asthenameimplies,thismethoditeratesthrougheahstateinanMDPusingtheBellmanoptimality
equation (2.1)asanupdate rule in orderto reahanoptimal poliy. Thestoppingrule forsuh an
iterationisusuallybasedonthemaximumdierenebetweensubsequentstateutilityapproximations.
Ifthatdiereneislessthan
ǫ
(1
−
γ
)
/γ
thenitisguaranteedthattheerrorislessthansomevalueofǫ
. SuhanapproahreliesonanMDPwithlearlypredenedtransitionmodelandstaterewards. Themethodreturnsthenaloptimalutilityforallstates
U
∗
. Analgorithmdesribingthevalueiteration
method isshowninAlgorithm(1).
Algorithm1 ThevalueiterationDPmethod
V_I(S,A,T,R,
γ
,ǫ
)U
t
←
0
Do
λ
←
0
For
s
∈
S
U
t
+1(
s
)
←
R
(
s
t
) +
γ max
a
P
s
t
+1
T
(
s
t
, a
t
, s
t
+1)
U
t
(
s
t
+1)
λ
←
max
(
|
U
t
+1(
s
)
−
U
t
(
s
)
|
, λ
)
End
Until
λ <
(
ǫ
(1
−
γ
)
/γ
)
ReturnU
∗
Theoptimalutilityforallstates
U
∗
an thenbeusedto deviseanoptimalpoliy
π
∗
byseleting
theationwiththemaximumexpetedutilityforeahstate,denoted as
Q
∗
(
s, a
)
,basedonequation(2.2).
Q
∗
(
s
t
, a
t
) =
R
(
s
t
) +
γ
X
s
t
+1
T
(
s
t
, a
t
, s
t
+1)
U
∗
(
s
t
+1)
(2.2)π
∗
(
s
) =
argmax
a
∈
A
(
s
)
Q
∗
(
s, a
)
(2.3)Theoptimalpoliy
π
∗
anhenebe formulatedbyndingtheset of ationsofmaximum utility
2.1.1.2 PoliyIteration
The poliy iteration method onsists of two parts throughwhih an agent rstly produes a given
poliyusingtheBellmanupdateequation(2.1)andseondlytriestoamelioratethatpoliyifpossible.
In essene, the method runs as asequene of produing poliies and testing theirstability until an
optimal stable poliy is found. An algorithm desribing the poliy iteration method is shown in
Algorithm(2).
Algorithm 2Thepoliy iterationDP method
P_I(S,A,T,R,
γ
,ǫ
)Initialize
U, π
1
Doλ
←
0
For
s
∈
S
U
t
+1(
s
)
←
R
(
s
) +
γ
P
T
(
s
t
, a
t
, s
t
+1)
U
t
(
s
t
+1)
λ
←
max
(
|
U
t
+1(
s
)
−
U
t
(
s
)
|
, λ
)
End
Until
(
λ <
(
ǫ
(1
−
γ
)
/γ
))
Fors
∈
S
temp
←
π
(
s
)
π
(
s
)
←
argmax
a
∈
A
(
s
)
[
R
(
s
t
) +
γ
P
T
(
s
t
, a
t
, s
t
+1)
U
(
s
t
+1)]
If
temp
6
=
π
(
s
)
Thengoto
1
EndReturn
π
∗
The value iteration method is a ompat version of the poliy iteration method. The latter
it-erativelyheksthestabilityof theresulting poliy after anumberof valuefuntion updateson all
statesseeking exatonvergene. On theotherhand,thevalueiterationmethod ignoressuharule
and atsgreedily onthevaluefuntionupdates withoutseekingexatonvergenebut stillresulting
2.1.1.3 Partially Observable MDPs
Inertain situationsan agentmaynotbeableto determinethestatewhihit isurrentlyin. Suh
a aseis oftenthe result ofdealing withan unertainenvironmentwhere sensor inputs, fusionand
inferene tehniques are unable to dedue a given state with ertainty. Consequently, MDPs an
be extended in order to enompass a belief model that an provide a probability distribution over
the possible set of agent states, namely Partially Observable MDPs (POMDPs) (Kaelbling et al.,
1998). For example, an agent an potentially be in three states with a belief state distribution
of
< B
(
s
0) = 0
.
5
, B
(
s
1) = 0
, B
(
s
2) = 0
.
5
>
, meaning that the agent an never be ins
1
buthas an equal hane of being in either
s
0
ors
2
at a given time. As the agent interats with the unertainenvironment,itwillnaturallyneedtoupdateitsbeliefmodel,heneanobservationmodelO
(
s, o
)
is used to inform the agent about the probability of an expeted observation in a givenstate. Consequently, the belief model an be determined aordingto equation (2.4) where
α
is anormalizationfator.
∀s
t
+1
B
t
+1(
s
t
+1) =
αO
(
s
t
+1
, o
)
X
s
T
(
s, a, s
t
+1)
B
(
s
t
)
(2.4)Regardlessof the indenite number of states resulting from the ontinuous values in the belief
modelandtheintratabilityofndinganoptimalsolutioninsuhaase,someapproaheshavebeen
proposedunderassumedonstraintsin ordertoprovideapproximatesolutions(Murphy,1999). .
2.1.2 Reinforement Learning Struture
Reinforement Learning (Sutton & Barto, 1998; Kaelbling et al., 1996) is an extensively studied
approahtosolvingawiderangeofoptimizationproblems. RLisanunsupervisedlearningapproah
that aims at arriving to a setting through whih statesare optimally mapped to ations, i.e., in a
mannerthatmaximizesthelong-termexpetedrewardsreeivedafterexeutingaertainationina
givenstateatagiven time. Suh asettingahievedbyanRL agentonstitutestheagent'soptimal
poliy. AnRLagenttypiallydisoversitsenvironmentthroughinteration,morespeially,bytrial
anderror. Hene,thelearningproessthroughwhihanRLagenteventuallytriestoreahanoptimal
poliy,oursbyexeutinganationin agivenenvironmentalstateandonsequentlyevaluating the
utilityassoiatedwiththatationinthatstateusingthereeivedrewardandnextstateinformation.
Suh anRL approahisnormallyreferredto asamodel-freeapproahin thesense that it hasnoa
eahstate,i.e.,atransitionmodel. Theontraryisnaturallyreferredtoasamodel-basedapproah,
whih in ertainasespredits/estimates theoutome,in terms ofnewstateand reward. A typial
RL agent,see Figure(2.1),representsitsloalenvironmentthroughastate-ationspaeintheform
of anMDP.
Figure 2.1: AtypialRLagentinteratingwiththeunderlying environment
s : state,r : reward,a: ation
The rewardmodel in RL an be disrete or ontinuous. Oneould design a disretized reward
model where a onstant value is returned based on some goodness onditions. For example, in a
grid-world problem, an agent is required to navigatea grid to reah a goalstate/squareby taking
a series of ationsfrom aset of ations
A
grid
=
{
Lef t, Right, N orth, South
}
. The agent reeives a high positiverewardr
goal
= 100
when an ationa
∈
A
grid
leadsto the goalstate. On theother hand,anyotherationa
thatdoesnotleadtothegoalstatereeivesanegativerewardr
=
−
1
. An RL agenttryingtondalloptimalpathsleadingto thegoalstatewilltrytomaximizetheexpetedfuture rewards in order to ahieve its task. Indeed, the agent an reeive hints (positive rewards)
on the way to itsgoal squareif theproblem wasmodelled in suh awayasto hasten ahieving an
optimal poliy. Thatmodelwill allowforsquarespositioned oneation awayfrom thegoalstateto
return ahighpositiverewardbut relativelylowerthan thegoalreward,
r
= 50
perexample. Other reward models an be ontinuous in the sense that they an fall in a given range of values. Forinstane, anRL-basedtra lightontrollerthat tries to arriveto anoptimalontrol poliy, whih
allowsfor the maximum number of vehiles to pass through, ouldhaveareward model suh that
r
=
number of vehicles passed through af ter a given lights setting
. In that ase, the range ofr
isrelativetotheinomingtra volume. Deidingonwhethertouseadisreteoraontinuousreward
EssentiallyanRLagentwouldmodeltheunderlyingenvironmentasanMDP.However,inomplex
dynami problemssuhasUTC, itisoften verydiultto obtainadeniteprobabilistitransition
model for the MDP to be solved assuming that the resulting state is based on tra for instane.
Thesameargumentappliestoobtainingarewardpreditionmodel. However,itispossibletodesign
arewardmodel thattranslatesenvironmental feedbak. Q-Learningisone ofthelearningstrategies
that allowsanRLagent,throughitsvaluefuntion,to arriveto anoptimalpoliy withouttheneed
foratransitionmodelorarewardpreditionmodel.
ForanRLagenttofuntion,itreliesonalearningstrategy,anationseletionstrategy,areward
modeland,vitally,arepresentationoftheunderlying environment. Wedisussedrewardmodelsand
MDPsastheenvironmentalrepresentation. Onwardswedisussdierentlearningandationseletion
strategies.
2.1.2.1 Learning Strategies
A learningstrategy allowsthe RL agentto gradually build its knowledge onhowto optimally deal
withthesurroundingenvironment. Thatknowledgeisumulativelybuiltthroughinorporatingsensor
informationinamannerthataetstheRLagent'sviewontheenvironment. Inorporationismainly
donethroughavalueorastate-ationvaluefuntion updateruleofsomeform.
Q-Learning
Q-Learningwasrstintroduedinthe1989inWatkins'Ph.D.thesis(Watkins,1989). Sinethen,
it hasbeengainingmorepopularity asamodel-freeRL tehnique. Q-Learningfalls in theategory
of o-poliy TemporalDierene (TD)learningstrategies(Sutton&Barto,1998). Those strategies
are model-freeand anupdate aertain RLagent'spoliy estimatebasedonthe estimatesofother
elementsin thepoliy as well asontheinoming rewards. Convergene is assuredregardlessof the
ation seletionstrategy orexploration tehnique aslongasupdating allstate-ationvalue pairsis
ontinuous. Q-Learningtypiallybehavesin an o-poliy manner, whih meansthat it learnseven
whiletakingationsthatmightprovetobenon-optimalinthefuture.
Q-LearningontrolstheRLagent'slearningpaethroughalearningratevariable
α
: (0
≤
α <
1)
and thelevelbywhihitdisountsfuture rewardsthroughadisountratevariableγ
: (0
≤
γ <
1)
. TheQ-Learningupdateequationispresentedin (2.5).r
t
+1
:
reward received af ter executing a
t
Ahighlearningrateimpliesthattheagentismoreeagertoadoptongoinghangesdenotedby
r
t
+1
and the future eets of ationa
t
denoted bymax
a
Q
(
s
t
+1
, a
)
in its updatedpoliy. TheRL agent beomesmorenear-sightedtheloweritsdisountrateisbyminimizing thefutureeet ofationa
t
denoted by
max
a
Q
(
s
t
+1
, a
)
.Algorithm 3Generi Q-Learning
Initialize lookuptable
∀
Q
(
s, a
)
QL(S,A,α
,γ
)Forall episodes
s
t
←
s
initial
For eahstepintheepisode Do
Selet_Exeute
a
t
:
a
t
∈
A
(
s
t
)
usingsomeationseletionstrategy Reeives
t
+1
, r
t
+
t
Q
t
+1(
s
t
, a
t
)
←
Q
t
(
s
t
, a
t
) +
α
[
r
t
+1
+
γ max
a
Q
(
s
t
+1
, a
)
−
Q
t
(
s
t
, a
t
)]
s
t
←
s
t
+1
Until
s
t
==
s
terminal
EndAn RL agent using Q-Learning normally keeps a lookup table for all possible ombinations of
state-ation pairsbasedontheMDP representationof itsenvironment. ItsMDP is builtwithouta
transition model for ation ourrenelikelihood nor areward predition model. Suh atransition
model is essentiallylearnt through interation with the environment and anbe dedued from the
state-ationvalueslookuptableusingsomeationseletionstrategy. AgeneriQ-Learningalgorithm
is presented in (3). An episode, for example, in a grid-world senario, ould last until the agent
arrivestoapredenedgoalstate/squareafterstartingfromadierentstate. Thenumberoflearning
episodesneedednaturallydependonsomeformofonvergenetestwheretheagentterminatesifthe
resultofthattestissatisfatory. However,in aninnitehorizonproblem,i.e.,aproblemthathasno
spei goal/terminalstate, thenotionofan episode disappears. Insuhaase, itis morelikelyto
use agraduallydereasinglearningratepairedwithabiasedation seletionstrategythat balanes
SARSA
The SARSA RL algorithmgets its namefrom the knowledge update mannerit followsasan
on-poliyapproah. LearningprogressesinSARSAfromagivenstate-ationpairtoanotherstate-ation
pairand henethename SARSA,i.e., State-AtionRewardState-Ation. Thelearningupdaterule
in SARSA depends onseletingthenextation
a
t
+1
forthenextstates
t
+1
usingaommon ation seletion strategy. In ontrary, Q-Learning uses the best next ation ins
t
+1
. A generi SARSA algorithmis presentedin (4).Algorithm4 Generi SARSA
Initialize lookup table
∀
Q
(
s, a
)
SARSA(S,A,α
,γ
)Forall episodes
s
t
←
s
initial
Choose
a
t
:
a
t
∈
A
(
s
t
)
usingsomeationseletionstrategy For eahstepintheepisode DoExeute
a
t
Reeive
s
t
+1
, r
t
+
t
Choose
a
t
+1
:
a
t
+1
∈
A
(
s
t
+1)
usingsomeationseletionstrategyQ
t
+1(
s
t
, a
t
)
←
Q
t
(
s
t
, a
t
) +
α
[
r
t
+1
+
γ Q
(
s
t
+1
, a
t
+1)
−
Q
t
(
s
t
, a
t
)]
s
t
←
s
t
+1
, a
t
←
a
t
+1
Until
s
t
==
s
terminal
EndSARSAtendstobeamoresafeapproahasopposedtoQ-Learninginthesensethatitseletsthe
nextation basedon agivenstrategywhileQ-Learningrisksitby takingationsthat mightnotbe
optimalbutstilllearns. Asaresult,Q-Learningispossiblyabletoreahtheoptimalpoliywithless
aumulatedrewardswhile SARSA will onvergeto anear optimalone(Takadama & Fujita, 2005;
Sutton&Barto,1998). Indeed,thedesignoftherewardmodel isessentialin thatase.
2.1.2.2 Ation SeletionStrategies
The RL yle annot be omplete without aeting the underlying environment through seleted
hasto eiently explore,(i.e.,visitdierentstatesand trydierentations)thespae-ationspae,
i.e.,theMDP.Naturally,andoftenintheaseofaninniteoptimizationproblem,anRLagentshould
beabletograduallyswithfromexploringforan(optimal)poliytoexploitingthatpoliy. Theroleof
abiased,(i.e.,allowsforontrollingtheexplorationperiod)ationseletionstrategy,(e.g.,Boltzmann
and
ǫ
-greedy)henebeomesessential.Greedy &
ǫ
-GreedyThe most natural short-sighted strategy that an agent an be following is to always selet the
ation withthemaximum positiveoutome. Suh astrategy isreferredto asbeinggreedygiven its
ontinuousprefereneforationswithmaximumestimatedgoodness. However,thistypeofastrategy
alone is problemati for eient explorationin RL as other possibly lessfavourable urrent ations
ould result in better performanein the longrun. Tooverome that problem, a randomly greedy
ationseletionstrategywasdevised,namely,
ǫ
-greedy. Insuhastrategy,theurrentbest ationisonlyseletedwithaprobability
(1
−
ǫ
)
whereǫ
: 0
≤
ǫ
≤
1
. ThegreedinessoftheRL agentishene denedbythevalueofǫ
. Moreover,thisstrategyisindependentfromthestate-ationvalueestimatesQ
(
s, a
)
intermsoftheprobabilitydistribution usedforseletingtheations.Boltzmann
The Boltzmann ation seletion strategy is a ustomization of the softmax approah (Sutton &
Barto, 1998)where the Boltzmann (alsoknown asGibbs) probability distribution is used to model
theationseletionstrategy. AtionsareseletedbasedonaBoltzmannprobabilitydistributionbuilt
using their
Q
(
s, a
)
values,seeequation(2.6).P
(
a
) =
e
Q
(
a
)
/τ
P
f orall b
∈
A
e
Q
(
b
)
/τ
(2.6)
TheextentofBoltzmannexplorationisontrolled bythetemperatureparameter
τ
: 0
< τ
. The higherthevalueofτ
is,themoreexplorativetheRLagentis,i.e.,ationstend tohavenearlyequalhanes of being seleted. As the temperature ools down, the shift towardsexploitation beomes
greater and theRL agentbeomes moregreedy. However,deiding onthe best initial value of
τ
is2.1.3 Deentralized Reinforement Learning
As optimization problems beomemore omplexin termsof sale, problem modelling in alassial
entralized RL manner beomes more diult and the solution might beome intratable. Hene,
the need forRL deentralization emerged. Suh adeentralization is partially realizedthrough the
Multi-AgentRL (MARL)realm. Thelatterhasresultedinaonsiderableamountofliterature.
An importantlassial dierentiation betweenMARL implementations is presentedin (Claus &
Boutilier, 1997) between what they refer to as independent learners, where agents learn based on
theirpureinterationwiththeenvironmentwithoutrealizingtheexisteneofotheragents,andjoint
ation learners,where anagentlearnsaso-alledjointation byobservingother agentsationsand
interpreting their loal eets. Moreover, an interesting lassial study by (Tan, 1998) shows that
ooperationamong RL agents,ifdoneintelligently, may resultin better performane than
indepen-dentlearning. Cooperationthere inludesommuniatingagent'sloalinformation suhas,learning
episodes,poliies,seletedations,rewardsandsensorinformation. Theviewspresentedby(Claus&
Boutilier, 1997;Tan,1998)formthefoundationsofmodernRLdeentralizationwhere thesingleRL
agentworldhasbeentransformedinto aworldofRL agentseithertryingto ompeteorollaborate.
Conentrating on ompetitive behaviour, the minimax-Q-Learning algorithm (Littman, 1994),
where an agent learns to win as a result of other agent's loss has emerged. An extension to that
approahispresentedin (Hu&Wellman, 1998). AMARLshemewhereoordinationamongagents
is basedon having anotion ofother agentsinorporatedin theloalstate desriptionsispresented
in (Abul et al.,2000). Theyonentrateonproblems with largestate-ationspaes wherethey use
generalization and funtion approximation (Sutton & Barto, 1998). In a oordinated RL sheme
(Guestrin etal.,2002),oordinationamong RLagentsisbasedonoordination graphswhereagents
seletanoptimaljointationwithone-hopneighbourswithoutsearhingthelargejointationspae.
A similar Q-Learningspei approah is presented in (Kok & Vlassis, 2006, 2004). Following the
idea of learning from the best, a ooperative learning approah for agents using Q-Learning with
weightedagentexpertnessisdesribedin(Ahmadabadi&Asadpour,2002;Ahmadabadietal.,2001).
Furthermore,adistributedvaluefuntionlearningshemeispresentedin(JeShneider,1999)where
an RL agent exhanges its value funtion estimations with neighbouring agents. An agent in that
shemeanlearnavaluefuntionbasedonthesumofallotheragents'disountedexpetedrewards.
Cooperation through sharing rewards among RL agents is rationalized in (Miyazaki & Kobayashi,
1999) wherethey provide aminimumpreonditionto preservethat rationality. On theother hand,
analternativeapproahto inorporatingotheragents'rewardsorQ-valuesispresentedin (Tesauro,
strate-gies and predit other agents' strategies through Bayesian inferene (Berger, 1993). In apartially
observableproblem, Goldman&Zilberstein (2004)proposeagroupofNEXP andPproblemswhere
agentsshare aommon deentralizedPOMDP and try to maximize aglobal goalthrough dierent
ommuniationmanners.
Figure2.2: AdeentralizedRLstruture
Wesee the deentralizationof RL as ameansto break down aglobal problem into manageable
loalRLproblems. Asaresult,loalRLagentstrytoat(ollaboratively)towardsa(near)optimal
solution for the ommon global problem, see Figure (2.2). In (Hoen et al., 2006) a study of the
dierentbehaviours(ooperativevsompetitive)oflearningagentsinamulti-agentsystem(MAS)is
provided. Asfarasthelatterstudyisonerned,weareinterestedin whattheydeneasonurrent
learning whereeahagentlearnsusingadediated learningproess. Overall,wereferthereaderto
three surveys(Bu³oniu et al., 2008; Yang &Gu, 2005; Panait & Luke, 2005) that ould provide a
widerviewonmulti-agent(reinforement)learningapproahes.
2.2 Unertainty in UTC
The nature of aUTC systemis omplex and often hardto predit. Numerous fators ould shape
the unpreditability in aUTC system. Humans' varying behavioural patterns, equipment wearing
outand ommuniationnoiseouldallbeseenassouresofunertaintyin UTCsystems. Consider
utuating ity tra overdierent periods of time and the resulting diulty in adapting ontrol
deisions. Suhdeisionsmight notonlyhave instantaneous eetsbut alsolong-term ones making
due to humans, aidents,road worksand nature. Interestingly, (Satyanarayanan,2003)elaborates
onunertaintyholistiallyasin;itisironithatintoday'sall-digitalworld,unertaintyreappearsas
amajoronernatahigherlevelofrepresentation. In(Viti,2006),athoroughstudy isprovidedon
theunertaintyanddynamisofroadusers'travellinganddelaytimes. Itisarguedthatunertainty
in atransportationnetwork originatesfrom thevariabilityin supplyanddemand (Viti, 2006). This
ouldbeofaylinatureorsporadi,(e.g., hostingworldfootballhampionshiporpossiblyrailway
workersstrike).
As farasunertaintyis onerned, weare mainly interested in utuatingurban tra and the
meanstogeneriallydetettra hangesonlineandrespondadequatelyusingdeentralizedRL.
2.2.1 Tra Patterns
As(Visser&Molenkamp,2004)putitwhendisussingtheidentiationoftrapatterns:
Determiningthedailyandweeklypatternsisabitofanart, morethanasiene: resultsarepartly
dependentoneveryindividual'sownframeofreferene(e.g., IstheThursdaybeforeEasteraregular
weekdayasfarastra isonerned?).
Mostommon approahes to determiningtra patterns are oine approahesthat require the
analysisofmassivehistorialdataandengineeringexpertise(Venkatanarayanaetal.,2007).
Further-more, aquestion arises onerning the onstituents of a tra pattern. Volume and diretionality
ould beintuitively seenas important harateristis of agiven tra pattern. Several approahes
inluding inident and tra pattern orstate detetion have been proposed upon the introdution
of FloatingVehileData (FVD)tehnologies(Kerneretal.,2005; Kamran&Haas,2007; Matshke,
2004;Chenetal.,2007). MostoftheseapproahesentrallyproessommuniatedFVDsuhastravel
timeandveloityinordertoprovideaglobalimageoftrastatus,orinertainase,traaidents
(Kamran&Haas,2007). Also,theymai