T H E E F F E C T S O F C L U S T E R I N G O N T H E
M E D I U M A N D L A R G E - S C A L E
C A P A C I T A T E D L O C A T I O N - R O U T I N G
P R O B L E M
Jacoba Hendrina Bührmann
Supervisor: Professor Montaz Ali Co-supervisor: Doctor Ian Campbell
A thesis submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in fulfilment of the
requirements for the degree of Doctor of Philosophy.
Declaration
I declare that this thesis is my own unaided work. It is being submitted for the Degree of Doctor of Philosophy to the University of the Witwatersrand, Johannes-burg. It has not been submitted before for any degree or examination at any other University.
(Signature of Candidate)
Abstract
This work investigates the effectiveness of using clustering methods in solving various capacitated location-routing problems (CLRP) for medium- and large-scale datasets, with up to 20 000 datapoints. Different clustering methods as well as hybrid clus-tering methods are tested and compared.
A new problem called the planar CLRP (plCLRP) is introduced. Based on the re-sults from the clustering methods, cluster-based approaches are suggested to solve variants of the CLRP. These include the Hamiltonian p–median problem (HpMP), the planar CLRP (plCLRP), the concentrator discrete CLRP (cdCLRP) and the standard discrete CLRP (sdCLRP). A new method called the two-phased propor-tional regret ordering based unconstrained to constrained (PROBUC) method is also proposed to create capacitated clusters.
The focus falls on finding effective non-exponential time algorithms that can be used to solve large-scale problems with good results. A full set of results for each problem are presented and comparisons are made with known results from the literature where possible.
The PLRP (periodic location-routing problem) introduced by Prodhon and Prins (2008), is also investigated. A change in the current problem formulation, as pro-vided by Prodhon (2011), is proposed to enforce single-source constraints across time horizon and limit the maximum number of vehicles.
An approach to solve the PLRP, based on the cluster-based approaches to solve the discrete CLRPs, is suggested. The results of the cluster-based approach are compared to best-known solutions for existing PLRP instances given by Prodhon (2009a). A set of large scale PLRP instances are introduced, based on instances generated by Harks et al. (2013) for the sdCLRP.
Samevatting
In hierdie studie word die gebruik van groeperingsmetodes ondersoek om verskeie ligging-roeteringsprobleme met kapasiteitbeperkings (LRPK) op te los. Dit sluit die oplos van middelmatige en grootskaalse datastelle tot ’n maksimum van 20 000 datapunte in. Verskillende groeperingsmetodes, asook hibriede groeperingsmetodes, word met mekaar vergelyk en getoets.
Gebaseer op die bostaande ondersoek, word algoritmes wat van groeperingsmetods gebruik maak voorgestel om vier variasies van die LRPK mee op te los. Dit sluit onder andere die Hamiltoniaanse p-mediaanprobleem, die kontinue LRPK en twee diskrete LRPK variasies, naamlik die konsentratordiskrete - en die standaarddiskrete-probleem in. ’n Nuwe metode, genaamd die twee-fase proporsionele-berou gesor-teerde onbeperk-tot-beperkmetode word bekendgestel om gegroepeerde oplossings volgens kapasiteitbeperkings aan te pas.
Die fokus val op nie-eksponensiële-tyd algoritmes wat gebruik kan word om groot-skaalse probleme met goeie resultate te kan oplos. Die voorgestelde algoritmes is vir al vier variansies op groot datastelle getoets en met bestaande resultate vergelyk waar moontlik.
Die periodiese LRPK, voorgestel deur Prodhon en Prins (2008), word ook ondersoek. ’n Aanpassing op die huidige wiskundige model, soos gegee deur Prodhon (2011), word voorgestel. Beperkings om te voorkom dat kliënte deur meer as een fasiliteit bedien word, asook ’n maksimum limiet op die aantal voertuie, word voorgestel. ’n Algoritme gebaseer op die groeperingsgebaseerde algoritmes wat vir die diskrete LRPKs voorgestel is, word vir die periodiese LRPK voorgestel. Die algoritme word vergelyk met akademiese datastelle van Prodhon (2009a). Drie grootskaalse datastelle, van Harks et al. (2013) vir die diskrete LRPK, word aangepas vir die periodiese LRPK en die resultate word hier bekend gestel.
Aan my ouers. Julle staan altyd by my. Dankie vir alles wat julle vir my doen en julle onophoudelike liefde en ondersteuning.
Aan Jan-Rudolph. Dankie vir jou liefde, jou ondersteuning en die wonderlike lewe wat ons saam het.
Aan Sunelle en Johann. Julle is baie spesiaal. Dankie dat julle my lewe elke dag opnuut ophelder.
Acknowledgements
I would like to acknowledge the following people for their contributions towards this thesis:
• Professor Montaz Ali and Doctor Ian Campbell from the University of the Witwatersrand, as supervisor and co-supervisor, thank you for your feedback and encouragement in completing this study.
• An anonymous source for the medium size case study data.
• Professor David Lubinsky and colleagues for opening up the world of logistics. • Ms Jean le Roux, Operations Research senior lecturer UNISA, for your
Contents
Declaration i Abstract ii Samevatting iii Dedication iv Acknowledgements v List of Figures xiList of Tables xiv
List of Acronyms xviii
1 Introduction 1
1.1 Background and motivation . . . 1
1.2 Problem statement . . . 2
1.2.1 Limitations and scope . . . 3
1.2.2 Thesis objectives . . . 3
1.3 Methodology . . . 4
1.3.1 Data preparation and experiments . . . 4
1.3.2 Approach . . . 5
1.4 Thesis contribution . . . 7
1.5 Thesis overview . . . 7
2 Literature review - clustering methods 8 2.1 SAHN clustering methods . . . 8
2.1.1 The single linkage method . . . 10
2.1.2 The complete linkage method . . . 10
2.1.3 The average linkage method . . . 10
2.1.4 The weighted average-linkage method . . . 11
2.1.5 The centroid linkage method . . . 11
2.1.6 The weighted centroid (median) linkage method . . . 12
2.1.8 The Lance-Williams formula for SAHN methods . . . 13
2.1.9 CPU times of the SAHN methods . . . 14
2.1.10 Stopping rules in SAHN clustering . . . 15
2.2 Iterative partitioning methods . . . 16
2.2.1 The k–means method . . . 18
2.2.2 The h–means method . . . 19
2.2.3 The j–means (k–medoid) method . . . 19
2.2.4 The PAM (h–medoid) method . . . 20
2.2.5 The k–median method . . . 20
2.2.6 The h–median method . . . 21
2.2.7 Starting solutions for iterative partitioning methods . . . 21
2.2.8 Methods to prevent the number of clusters from declining . . 22
2.2.9 The CLUSTER partitioning methods . . . 22
2.2.10 The AS 136 method . . . 23
2.3 Graph–based methods . . . 24
2.3.1 MST graph–based clustering method . . . 25
2.3.2 RNG graph–based clustering method . . . 26
2.3.3 GG graph–based clustering method . . . 26
2.4 Nearest neighbour methods . . . 27
2.4.1 The k–near neighbours method . . . 27
2.4.2 The mutual neighbourhood value (MNV) method . . . 27
2.5 Density–based methods . . . 28
2.5.1 The kth nearest neighbour clustering method . . . 29
2.5.2 The hybrid clustering density–based method . . . 30
2.6 Hybrid clustering methods . . . 31
2.7 Clustering criteria . . . 31
2.8 Distance calculations . . . 34
2.9 Summary . . . 34
3 Literature review - distribution network problems 36 3.1 The facility location problem (FLP) . . . 38
3.1.1 The discrete CFLP . . . 39
3.1.2 Solution approaches for the discrete CFLP . . . 44
3.1.3 The continuous CFLP . . . 55
3.1.4 Solution approaches for the continuous CFLP . . . 59
3.2 The vehicle routing problem (VRP) . . . 63
3.2.1 The CVRP . . . 63
3.2.2 Solution approaches for the CVRP . . . 65
3.2.3 The PVRP . . . 77
3.2.4 Solution approaches for the PVRP . . . 79
3.3 The location-routing problem (LRP) . . . 81
3.3.2 Solution approaches for the HpMP . . . 85
3.3.3 The planar or continuous location-routing problem (plLRP) . 86 3.3.4 Solution approaches for the plLRP . . . 88
3.3.5 The discrete capacitated location-routing problem (CLRP) . 91 3.3.6 Solution approaches for the CLRP . . . 94
3.4 The periodic location-routing problem (PLRP) . . . 106
3.4.1 Solution approaches for the PLRP . . . 109
3.5 Summary . . . 113
4 An analysis of clustering methods 115 4.1 Methodology . . . 115 4.1.1 Data preparation . . . 116 4.1.2 Evaluation criteria . . . 116 4.1.3 Method . . . 117 4.2 Computational results . . . 124 4.3 Discussion . . . 125 4.3.1 SAHN clustering . . . 132
4.3.2 Iterative partitioning clustering . . . 139
4.3.3 Graph–based clustering . . . 153
4.3.4 Nearest neighbour clustering . . . 154
4.3.5 Density–based clustering . . . 157
4.3.6 Hybrid clustering methods . . . 160
4.4 Conclusions . . . 169
5 Cluster-based approaches to solve variants of the CLRP 170 5.1 Solving the HpMP . . . 171
5.1.1 Proposed new problem formulation . . . 171
5.1.2 A new cluster-based method to solve the HpMP . . . 172
5.1.3 Methodology and data preparation . . . 175
5.1.4 Computational results . . . 177
5.1.5 Discussion . . . 177
5.1.6 Conclusions - HpMP . . . 181
5.2 Solving the plCLRP . . . 182
5.2.1 Proposed new problem formulation . . . 182
5.2.2 Determining an initial number for depots and vehicles . . . . 184
5.2.3 Creating capacitated depot and route clusters . . . 186
5.2.4 The two-phase PROBUC method . . . 189
5.2.5 A new cluster-based method to solve the plCLRP . . . 206
5.2.6 Methodology and data preparation . . . 209
5.2.7 Computational results . . . 211
5.2.8 Discussion . . . 212
5.3 Solving the cdCLRP . . . 215
5.3.1 A new cluster-based method to solve the cdCLRP . . . 215
5.3.2 Methodology and data preparation . . . 218
5.3.3 Computational results . . . 219
5.3.4 Discussion . . . 219
5.3.5 Conclusions - cdCLRP . . . 219
5.4 Solving the sdCLRP . . . 221
5.4.1 The effect of using the cdCLRP location phase in the sdCLRP 221 5.4.2 A new cluster-based method to solve the sdCLRP . . . 222
5.4.3 Methodology and data preparation . . . 226
5.4.4 Computational results . . . 227
5.4.5 Discussion . . . 228
5.4.6 Conclusions - sdCLRP . . . 229
5.5 Conclusions . . . 229
6 A cluster-based approach to solve the PLRP 231 6.1 Proposed adapted problem formulation . . . 231
6.2 A new cluster-based method to solve the PLRP . . . 234
6.3 Methodology and data preparation . . . 235
6.4 Computational results . . . 237 6.5 Discussion . . . 237 6.6 Conclusions . . . 240 7 Conclusions 241 7.1 Contribution . . . 241 7.1.1 Clustering methods . . . 241 7.1.2 The HpMP . . . 242 7.1.3 The plCLRP . . . 243 7.1.4 The cdCLRP . . . 243 7.1.5 The sdCLRP . . . 244 7.1.6 The PLRP . . . 244 7.1.7 Test instances . . . 245 7.2 Conclusive remarks . . . 245
7.2.1 Observations regarding clustering methods . . . 246
7.2.2 Visualisation of data points . . . 247
7.3 Areas for future research . . . 247
References 249 Appendix A Clustering method results 259 A.1 Clustering method results . . . 260
A.1.1 Clustering method results - DataS1 . . . 260
A.1.3 Clustering method results - DataS3 . . . 284
A.1.4 Clustering method results - USA13509 . . . 296
A.2 Variations of the hybrid density method . . . 299
A.3 Hybrid clustering methods . . . 300
A.3.1 Hybrid SAHN - SAHN clustering methods - DataS3 . . . 300
A.3.2 Hybrid SAHN - partitioning clustering methods - DataS3 . . 301
A.3.3 Stepwise aggregation partitioning clustering methods - DataS3 307 A.3.4 Hybrid partitioning clustering - DataS3 . . . 308
Appendix B Displaying points in Google maps 311 Appendix C Program optimisation techniques 313 C.1 The use of arrays versus recalculations . . . 313
C.2 Methods to store a big distance matrix . . . 314
C.3 Sorting methods . . . 315
C.4 Keeping track of a cluster’s nearest neighbours . . . 316
C.5 Using coordinates instead of distance calculations for comparisons . 316 C.6 Identifying adjacent clusters . . . 317
C.7 Arrays versus text string variables . . . 317
C.8 Linked lists . . . 318
List of Figures
2.1 A dendrogram representation of SAHN clustering. . . 9
2.2 An illustration of the hierarchical clustering distance calculation used by the Lance-Williams parameters. . . 14
2.3 An illustration of the k–means method. . . 17
2.4 The regions of influence for the RNG and GG, (Jain and Dubes, 1998, p. 125). 26 3.1 Typical costs associated with the FLP problem. . . 42
3.2 An illustration of the discrete FLP problem. . . 43
3.3 The tree algorithm as illustrated by Papadimitriou and Steiglitz (1998). . 71
3.4 Christofides algorithm as illustrated by Papadimitriou and Steiglitz (1998). 71 3.5 The four basic neighbourhood exchanges of the Granular Tabu Search, (Toth and Vigo, 2003). . . 76
3.6 An illustration of the discrete LRP problem. . . 81
3.7 An example of the impact of outlier points on the placement of a depot in the LRP. . . 82
3.8 An example of customer assignments based on shortest distance in the CLRP. 92 3.9 The ECWA method as illustrated by Prins et al. (2006). . . 96
4.1 A summary of the results of the SSR measure for the various clustering methods using DataS3. . . 126
4.2 A summary of the results of the MSR measure for the various clustering methods using DataS3. . . 127
4.3 A summary of the results of the PD measure for the various clustering methods using DataS3. . . 128
4.4 A summary of the results of the ADM measure for the various clustering methods using DataS3. . . 129
4.5 A summary of the results of the AWCD measure for the various clustering methods using DataS3. . . 130
4.6 The CPU times for the various clustering methods using DataS3, up to 5 000 seconds. . . 131
4.7 An illustration of the single linkage method using DataS3. . . 135
4.8 An illustration of the complete linkage method using DataS3. . . 136
4.9 An illustration of the average linkage method using DataS3. . . 136
4.11 An illustration of the centroid linkage method using DataS3. . . 137
4.12 An illustration of the weighted centroid linkage method using DataS3. . . 138
4.13 An illustration of the Ward’s method using DataS3. . . 138
4.14 The results of the PAM clustering method using DataS2. . . 141
4.15 An illustration of the k–means partitioning method using DataS3. . . 145
4.16 An illustration of the h–means partitioning method using DataS3. . . 145
4.17 An illustration of the j–means partitioning method using DataS3. . . 146
4.18 An illustration of the PAM (h–medoid) partitioning method using DataS3. 146 4.19 An illustration of the k–median partitioning method using DataS3. . . 147
4.20 An illustration of the h–median partitioning method using DataS3. . . 147
4.21 An illustration of the best h–median starting solution using DataS3. . . . 148
4.22 An illustration of the CLUSTER k–means partitioning method using DataS3. 149 4.23 An illustration of the CLUSTER h–means partitioning method using DataS3. 149 4.24 An illustration of the CLUSTER j–means partitioning method using DataS3. 150 4.25 An illustration of the CLUSTER PAM partitioning method using DataS3. 150 4.26 An illustration of the CLUSTER k–median partitioning method using DataS3. 151 4.27 An illustration of the CLUSTER h–median partitioning method using DataS3. 151 4.28 A zoomed in view of the solution created by the CLUSTER PAM method using DataS3. . . 152
4.29 An illustration of the AS 136 algorithm using DataS3. . . 152
4.30 An illustration of the MST graph-based method using DataS3. . . 154
4.31 An illustration of the GG graph-based method using DataS3. . . 155
4.32 An illustration of the RNG graph-based method using DataS1. . . 155
4.33 An illustration of the k-near neighbours clustering method using DataS3. 156 4.34 An illustration of the density-based kthnearest neighbour clustering method using DataS3. . . 158
4.35 An illustration of the CLUSTER h–median density-based variant 1 method using DataS3. . . 159
4.36 An illustration of the CLUSTER h–median density-based variant 3 method using DataS3. . . 159
4.37 The SSR values for the hybrid single and complete linkage - Ward’s method. 160 4.38 The SSR values for the hybrid single and complete linkage - Ward’s method with 3 000 – 4 000 aggregated clusters. . . 161
4.39 The CPU times for the hybrid single and complete linkage - Ward’s method. 161 4.40 The SSR values for the hybrid single linkage - partitioning methods. . . . 162
4.41 The SSR values for the hybrid complete linkage - partitioning methods. . 163
4.42 The SSR values for the best SAHN - partitioning methods with 2 500 – 7 500 aggregated clusters. . . 163
4.43 The CPU times of the hybrid SAHN - partitioning methods. . . 164
4.44 The CPU times of the hybrid SAHN - partitioning methods with best SSR values. . . 165
4.45 The hybrid single linkage - h-median method with 6 500 aggregated clusters. 166 4.46 The SSR values for the aggregated stepwise- and hybrid SAHN - h–median
methods. . . 166
4.47 The hybrid partitioning method with 70 clusters created with the h–means method in phase one followed by the h–median method. . . 168
5.1 The proposed cluster-based method to solve the HpMP. . . 173
5.2 The routing costs for various clustering methods with 50 clusters using DataS1. . . 178
5.3 The results from the complete linkage clustering method where capacitated clusters are ignored using DataS3. . . 188
5.4 The work flow diagram of the suggested two-phase PROBUC method. . . 190
5.5 The re-assignment phase of the suggested two-phased PROBUC method. 192 5.6 Different regret functions for the single linkage method using DataS3. . . 197
5.7 Different regret functions for the single linkage method using DataS3 (cont. a). . . 198
5.8 Different regret functions for the single linkage method using DataS3 (cont. b). . . 199
5.9 Different regret functions for the single linkage method using DataS3 (cont. c). . . 200
5.10 Different regret functions for the complete linkage method using DataS3. 201 5.11 Different regret functions for the complete linkage method using DataS3 (cont. a). . . 202
5.12 Different regret functions for the complete linkage method using DataS3 (cont. b). . . 203
5.13 Different regret functions for the complete linkage method using DataS3 (cont. c). . . 204
5.14 The proposed cluster-based method to solve the plCLRP. . . 207
5.15 The calculation of the depot locations for the plCLRP. . . 209
5.16 The proposed cluster-based method to solve the cdCLRP. . . 216
5.17 The difference between using the closest two customers as endpoints versus lollipop routes to calculate the distance of a route. . . 218
5.18 An example of using the location phase method suggested for the cdCLRP to solve the sdCLRP. . . 223
5.19 The proposed cluster-based method to solve the sdCLRP. . . 224
List of Tables
2.1 The Lance-Williams parameters for different SAHN methods. . . 14 3.1 An overview of the distribution network problem definitions. . . 37 4.1 The three derived datasets from a food distribution company used to
com-pare the clustering methods. . . 116 5.1 The routing costs for the HpMP and clustering measures when creating 50
clusters using DataS1. . . 179 5.2 The routing cost for the HpMP using three TSPLIB instances given by
Gollowitzer et al. (2014). . . 180 5.3 The results for the HpMP using the USA13509 TSPLIB instance. . . 181 5.4 The different proximity measures in the two-phase PROBUC method using
DataS3. . . 205 5.5 A list of instances tested using the TSP13509 dataset. . . 210 5.6 A list of instances tested using the WTSP17237 dataset. . . 211 5.7 The results for the plCLRP using the USA13509 TSPLIB instances. . . . 214 5.8 The results for the plCLRP using the WTSP17237 instances. . . 214 5.9 The results for the cdCLRP using the USA13509 TSPLIB instances. . . . 220 5.10 The results for the cdCLRP using the WTSP17237 instances. . . 220 5.11 A list of instances tested from Harks et al. (2013) for the sdCLRP. . . 227 5.12 The results for the sdCLRP using the CLRlib instances generated by Harks
et al. (2013). . . 228 6.1 A list of instances tested from Harks et al. (2013) adapted for the PLRP. 237 6.2 The results for the PLRP for instances given by Prodhon (2009a). . . 238 6.3 The results for the PLRP using the CLRlib instances generated by Harks
et al. (2013). . . 238 6.4 The results for the PLRP using XL size instances generated by Harks et al.
(2013), excluding the branch–and–bound method in the MCA. . . 239 6.5 A comparison of the results for the PLRP, between including and excluding
the branch–and–bound method in the MCA. . . 239 A.1 The results of the SSR measure for the various clustering methods using
A.2 The results of the SSR measure for the various clustering methods using DataS1 (continue). . . 261 A.3 The results of the MSR measure for the various clustering methods using
DataS1. . . 262 A.4 The results of the MSR measure for the various clustering methods using
DataS1 (continue). . . 263 A.5 The results of the PD measure for the various clustering methods using
DataS1. . . 264 A.6 The results of the PD measure for the various clustering methods using
DataS1 (continue). . . 265 A.7 The results of the ADM measure for the various clustering methods using
DataS1. . . 266 A.8 The results of the ADM measure for the various clustering methods using
DataS1 (continue). . . 267 A.9 The results of the AWCD measure for the various clustering methods using
DataS1. . . 268 A.10 The results of the AWCD measure for the various clustering methods using
DataS1 (continue). . . 269 A.11 The CPU times for the various clustering methods using DataS1. . . 270 A.12 The CPU times for the various clustering methods using DataS1 (continue). 271 A.13 The results of the SSR measure for the various clustering methods using
DataS2. . . 272 A.14 The results of the SSR measure for the various clustering methods using
DataS2 (continue). . . 273 A.15 The results of the MSR measure for the various clustering methods using
DataS2. . . 274 A.16 The results of the MSR measure for the various clustering methods using
DataS2 (continue). . . 275 A.17 The results of the PD measure for the various clustering methods using
DataS2. . . 276 A.18 The results of the PD measure for the various clustering methods using
DataS2 (continue). . . 277 A.19 The results of the ADM measure for the various clustering methods using
DataS2. . . 278 A.20 The results of the ADM measure for the various clustering methods using
DataS2 (continue). . . 279 A.21 The results of the AWCD measure for the various clustering methods using
DataS2. . . 280 A.22 The results of the AWCD measure for the various clustering methods using
DataS2 (continue). . . 281 A.23 The CPU times for the various clustering methods using DataS2. . . 282
A.24 The CPU times for the various clustering methods using DataS2 (continue). 283 A.25 The results of the SSR measure for the various clustering methods using
DataS3. . . 284 A.26 The results of the SSR measure for the various clustering methods using
DataS3 (continue). . . 285 A.27 The results of the MSR measure for the various clustering methods using
DataS3. . . 286 A.28 The results of the MSR measure for the various clustering methods using
DataS3 (continue). . . 287 A.29 The results of the PD measure for the various clustering methods using
DataS3. . . 288 A.30 The results of the PD measure for the various clustering methods using
DataS3 (continue). . . 289 A.31 The results of the ADM measure for the various clustering methods using
DataS3. . . 290 A.32 The results of the ADM measure for the various clustering methods using
DataS3 (continue). . . 291 A.33 The results of the AWCD measure for the various clustering methods using
DataS3. . . 292 A.34 The results of the AWCD measure for the various clustering methods using
DataS3 (continue). . . 293 A.35 The CPU times for the various clustering methods using DataS3. . . 294 A.36 The CPU times for the various clustering methods using DataS3 (continue). 295 A.37 The results of the SSR and MSR measures for various clustering methods
using USA13509. . . 296 A.38 The results of the PD and ADM measures for various clustering methods
using USA13509. . . 297 A.39 The results of the AWCD measure and CPU times for various clustering
methods using USA13509. . . 298 A.40 The results for the three variants of the hybrid density method using DataS3. 299 A.41 The results for the hybrid single linkage Ward’s and complete linkage Ward’s
methods using DataS3. . . 300 A.42 The results of the SSR and MSR measures for the hybrid single linkage
-partitioning clustering methods using DataS3. . . 301 A.43 The results of the PD and ADM measures for the hybrid single linkage
-partitioning clustering methods using DataS3. . . 302 A.44 The results of the AWCD measure and CPU times for the hybrid single
linkage - partitioning clustering methods using DataS3. . . 303 A.45 The results of the SSR and MSR measures for the hybrid complete linkage
A.46 The results of the PD and ADM measures for the hybrid complete linkage - partitioning clustering methods using DataS3. . . 305 A.47 The results of the AWCD measure and CPU times for the hybrid complete
linkage - partitioning clustering methods using DataS3. . . 306 A.48 The results for the stepwise aggregated partitioning methods using DataS3. 307 A.49 The results of the SSR and MSR measures for the hybrid partitioning
methods using DataS3. . . 308 A.50 The results of the PD and ADM measures for the hybrid partitioning methods
using DataS3. . . 309 A.51 The results of the AWCD measure and CPU times for the hybrid partitioning
List of Acronyms
Acronym Meaning Page
ADM Average distance measure p 33
AGS Active guided search p 103
ACO Ant colony optimisation p 53
ATL Alternating transportation-location heuristic p 59
BKS Best-known solution p 210
CCP Capacitated clustering problem p 41
CCCP Capacitated centred clustering problem p 56
CFLP Capacitated facility location problem p 43
CCLP Capacitated concentrator location problem p 40
cdCLRP Concentrator discrete capacitated
location–routing problem p 215
CLRGTS Cooperative Lagrangian relaxation
-granular tabu search p 104
CLRP Capacitated location–routing problem p 91
CLS Composite local search p 102
CpMP Capacitated p–median problem p 40
CVRP Capacitated vehicle routing problem p 63
CWA Clarke and Wright algorithm p 67
DVRP Distance vehicle routing problem p 65
ECWA Extended Clarke and Wright algorithm p 95
ELS Evolutionary Local Search p 110
FDR Furthest distance rule p 59
FIFO First in first out p 73
GA Genetic algorithm p 61
GAP Generalised assignment problem p 69
GG Gabriel graph p 26
GIS Geographic information system p 1
GLS Guided local search p 102
GRASP Greedy randomised adaptive search procedure p 96
HES Hybrid extended savings method p 102
HpMP Hamiltonian p–median problem p 82
ILP Integer linear programming p 40
LH–ACS Lagrangian heuristic - ant colony system
hybrid method p 54
LTL Less-than-truckload shipments p 81
LP Linear programming p 46
LR Lagrangian relaxation p 50
LRP Location–routing problem p 81
MACS Multiple ant colony system p 53
MA|PM Memetic algorithm with population management p 109
MCA Modified Christofides algorithm p 174
MDVRP Multi-depot vehicle routing problem p 78
MILP Mixed integer linear programming p 46
MNV Mutual neighbourhood value p 27
MSR Mean of Squared Residuals p 33
MST Minimum spanning tree p 25
MSWP Multi-source Weber problem p 56
mTSP Multiple travelling salesman problem p 63
NN Nearest neighbours p 15
PAM Partitioning around medoids method p 20
PD Partition Diameter p 33
plLRP planar LRP (uncapacitated) p 86
PLRP Periodic location–routing problem p 106
POPMUSIC Partial optimization metaheuristic under
special intensification conditions p 104 PROBUC Proportional regret ordering based
unconstrained to constrained p 189
pTSP Periodic travelling salesman problem p 77
PVN Penalty variable neighbourhood p 103
PVRP Periodic vehicle routing problem p 77
RCL Restricted candidate list p 97
RECWA Randomised extended Clarke and Wright algorithm p 97
RNG Relative neighbourhood graph p 26
RNN Reciprocal nearest neighbours p 15
RVNS Reduced variable neighbourhood search p 62
SAHN Sequential agglomerative hierarchical
non-overlapping clustering p 8
sdCLRP Standard discrete capacitated
location–routing problem p 221
SNC Subtour number constraints p 83
SOM Self organised map p 89
SSR Sum of squared residuals p 32
TL Truckload shipments p 81
TP Transportation problem p 58
TSP Travelling salesman problem p 63
VLSN Very large scale neighbourhood search p 112
VNS Variable neighbourhood search p 61
VRP Vehicle routing problem p 63
VRPTW Vehicle routing problem with time windows p 65
Chapter 1
Introduction
The design of a distribution network for a big company has never been a task for the faint at heart. Many companies spend hours searching for the optimal placement of depots in their distribution network in the light of their individual service delivery footprint. According to Langevin and Riopel (2005) the location of depots or facili-ties is one of the most difficult network decisions to make because of the huge costs involved.
With the rapid growth of available data and technological advances, GIS (geographic information system) applications are becoming commonly available. The ability to create visualisations and maps from GIS data is becoming more accessible to everyone∗. Because of this, GIS data is also becoming more useful and therefore
more reliable. With this comes a motivation for companies to utilise the GIS data in their day-to-day operations while solving distribution problems like the vehicle routing problem.
Many goods distribution companies have customer footprints of over tens of thou-sands customers. For example, Klose (1990) refers to actual distribution network case studies with several thousand customers and Alvim and Taillard (2013) refers to a case study with 32 000 customers. The size of GIS datasets that companies have to deal with are therefore becoming larger with time.
1.1
Background and motivation
The capacitated location-routing problem (CLRP) is a well-known twofold optimisa-tion problem. It addresses the placement of depots while at the same time constructs vehicle routes to ensure all customers’ demands are met. The main objective is to minimise both the depot and vehicle costs. While the depots and vehicles can have capacity limitations, the customers can have varying demands that need to be met.
∗
An example is the free GoogleAPI tool that can be used to display geographical points on GoogleMaps used in this thesis, described in Appendix B.
1.2 Problem statement
Being derived from both the facility location problem (FLP) and vehicle routing problem (VRP), the CLRP is known to be NP–hard (Min, 1996; Lopes, 2011) and cannot be solved optimally within polynomial time complexity.
Numerous heuristic and metaheuristic solution approaches for the CLRP have been suggested. Some are listed in review papers like Nagy and Salhi (2007) and Drexl et al. (2014). Up to 2013, the solution approaches have mostly been tested on small instances, like those given by Prins (2004) where the biggest academic dataset with benchmarks had 200 customers and 10 potential depots. Unfortunately, these approaches often becomes impractical and too time consuming to solve bigger case studies.
Recently the focus has turned to solving location-routing problems with bigger datasets. Harks et al. (2013) proposed an approximate algorithm† to solve the
standard discrete CLRP (sdCLRP). Their largest instance is a randomly generated dataset with 10 000 customers and 1 000 potential depots. Alvim and Taillard (2013) proposed a solution approach to solve a variant of the CLRP, called the con-centrator discrete CLRP (cdCLRP), where all customers locations are considered potential depots. This case study do not cater for depots capacity constraints and allow an unlimited number of vehicles per depot. Their largest instance has over 1.9 million customers and is derived from the World TSP instance, (Cook, 2013). From papers like Toth and Vigo (2003) and Lam et al. (2009) it is clear that cus-tomers within close proximity of each other should be grouped on the same routes and assigned to the same depots in order to create good CLRP solutions. Identify-ing customers within close proximity can easily be done with the help of clusterIdentify-ing methods. Some clustering methods can quite effectively create clusters for large instances. It is therefore worth investigating how these clustering methods can be used to solve the CLRP for big case studies as well.
Min (1996), Barreto et al. (2007) and Lam et al. (2009) have investigated the use of clustering methods in the CLRP in the past. These studies were limited to hierarchical clustering methods. The solution approaches were also not tested on big instances.
1.2
Problem statement
The purpose of this thesis is to evaluate and compare the results from different clustering methods for large case studies with more than ten thousand data points. Based on the above investigation, cluster-based approaches are proposed to solve four variants of the CLRP and the PLRP (periodic location-routing problem). These
†
Approximate algorithms find solutions guaranteed to be within a bounded optimality gap
1.2 Problem statement
solution approaches are then evaluated to determine if they are useful to solve bigger case studies.
1.2.1 Limitations and scope
The study is limited to the following five clustering methods:
• Sequential agglomerative hierarchical non-overlapping (SAHN) clustering methods; • Iterative partitioning methods;
• Graph–based methods;
• Nearest neighbour methods; and • Density–based methods.
Combinations of the above mentioned methods are also tested as hybrid clustering methods. The methods are described in Chapter 2.
The following variants of the CLRP are studied: • HpMP (Hamiltonian p–Median problem);
• plCLRP (planar capacitated location-routing problem);
• cdCLRP (concentrator discrete capacitated location-routing problem); and • sdCLRP (standard discrete capacitated location-routing problem).
The PLRP (periodic location-routing problem) is also studied. An overview of the various distribution network problems and their characteristics are given in Table 3.1.
A number of case studies are considered with instances containing between 1 000 and 20 000 customer points and up to 1 000 potential depots in the case of the discrete CLRP. This is in line with realistic case study sizes and big enough to give an indication of the effectiveness of the solution approaches while at the same time considering software limitations such as variable declarations.
The study is limited to single-source problems where all customer demands are to be met in full, from a single depot. Each customer is therefore assigned to one and only one depot in all cases. In the case of the PLRP, customers are assigned to a single depot across the complete time horizon.
1.2.2 Thesis objectives
1.3 Methodology
• Investigate, code and compare various clustering methods with regards to CPU times and quality of clustering. This includes:
– Investigate methods to create capacitated clusters using clustering methods. – Investigate the impact of starting solutions on the various iterative
par-titioning methods.
– Investigate the combining of the most effective clustering methods
inves-tigated above into hybrid clustering methods.
• Propose solution approaches using clustering methods to solve variants of the CLRP. This includes:
– Propose a method to create capacitated clusters from clustering methods. – Propose a simple and quick heuristic to create routes once the customers
have been clustered.
• Adapt the newly proposed solution approaches for the discrete CLRP to also solve the PLRP.
1.3
Methodology
The following methodology with regards to the data preparation and approach have been used in this thesis:
1.3.1 Data preparation and experiments
All clustering methods and suggested solution approaches in this thesis were coded and tested by the author using Visual Basic for MS Excel 2013. Excel was chosen because it is an application that leans itself towards the storing and analysis of big datasets. It can also be used as a data input tool and the code can easily be accessed and modified.
The experiments were carried out on an Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz computer with 8GB RAM and using a Windows 7 operating system. Instead of using standard functions or software packages, each of the clustering methods were coded so that different variations and modifications in the algorithms could also be tested.
Three types of datasets are used to test the proposed cluster-based methods sug-gested in this thesis. These include:
• a dataset of 7 681 geographical data points from a South African food dis-tribution company. The dataset represents the delivery locations of customer
1.3 Methodology
points in the distribution network of the company. The customer locations are data points used in the clustering methods with decimal Euclidean coordinates (x, y) used in GIS applications to determine positioning. This data was used in Chapter 4 to test the relevant clustering methods‡;
• academic instances found in the literature for the HpMP, sdCLRP, cdCLRP and PLRP (Gollowitzer et al., 2014, Alvim and Taillard, 2013, Harks et al., 2013 and Prodhon, 2009a); and
• a large-scale TSP by Reinelt (2015) with 13 509 customers from the USA. Not all of the problems have benchmarked datasets and some only have small in-stances. For these problems, the instances were adapted to test the proposed solu-tions for big case studies.
1.3.2 Approach
A couple of challenges have to be addressed in order to use clustering methods in solving the CLRP, these include:
• The methods have to be modified to cater for capacity constraints and varia-tions in demand size.
• While the number of depots and vehicles to use are unknown in the CLRP, they are needed as inputs for the clustering methods.
• Some iterative partitioning clustering methods rely heavily on good starting positions to generate good results, which means that another method must first be identified to generate good starting solutions before these methods will become useful.
• Clustering methods determine their own cluster centres and do not inherently keep the centres of clusters at fixed potential depot locations. For the discrete CLRP, these cluster centres have to be moved back to the depot locations afterwards.
Based on these challenges, the research was divided into six steps. These are as follows:
1.3.2.1 An analysis of clustering methods
The effectiveness of the five different clustering methods are analysed and compared in Chapter 4. Five criteria measures are used to determine the quality of a clustering
‡
This is a privately owned dataset from a food distribution company. Approval has been given to use the data in this thesis. The data is not made publicly available but visualisations can be seen in Chapter 4.
1.3 Methodology
solution. The SSR (sum of squared residuals) measure is used the most in compari-sons because it is the closest to the objective function value of the facility location phase of the CLRP. A critical look is given at the impact of starting solutions on the different iterative partitioning methods.
Some of the SAHN and iterative partitioning methods are then combined in hybrid clustering methods. Here the data is first aggregated with one method before being clustered with a second method.
1.3.2.2 Solution approach for the HpMP
Section 5.1 deal with the creation of Hamiltonian cycles using clustering methods. A heuristic algorithm called the Modified Christofides Algorithm (MCA), is intro-duced to create the route cycles. This method is based on the existing Christofides algorithm. A modified problem formulation for the HpMP is also proposed.
1.3.2.3 Solution approach for the plCLRP
A new problem called the plCLRP is introduced in Section 5.2. Here the multi-depot planar LRP (plLRP) is adapted to a fixed charge problem and multi-depot capacity constraints are added. Based on the current plLRP problem formulation, a new capacitated fixed charge problem formulation for the plCLRP is introduced. A new algorithm to create capacitated clusters from the clustering methods is pro-posed. The method is called the two-phased PROBUC (proportional regret ordering based unconstrained to constrained) method. This is described in Section 5.2.
1.3.2.4 Solution approach for the cdCLRP
The solution approach used in the plCLRP is adapted for the concentrator case in the cdCLRP. Different continuous to discrete methods are discussed. The concept of lollipop routes to approximate route costs is discussed. With the use of lollipop routes the depot locations can be determined without changing the routes. If any route has a round trip cost larger than the cost of opening a depot, the number of depots is increased. Once a decision is made on the number of depots to use, the final distribution costs can be calculated.
1.3.2.5 Solution approach for the sdCLRP
The solution approach for the cdCLRP will be ineffective for the sdCLRP if the depot clusters are not close to potential depots. This solution approach also do not cater for big variations in depot supplies and costs. An alternative heuristic is
1.4 Thesis contribution
proposed for the location phase of the solution approach to solve the sdCLRP.
1.3.2.6 Solution approach for the PLRP
A heuristic to cater for the periodic component of the PLRP is proposed. This heuristic makes use of the solution approaches for the discrete CLRP to solve the PLRP. A modified version of the problem formulation for the PLRP is also intro-duced to enforce the single-source constraint across multiple visits and limit the maximum number of vehicles. These topics are discussed in Chapter 6.
1.4
Thesis contribution
The contributions of this study are as follows:
• Various clustering methods and hybrid combinations are coded, tested and compared using the same instances. The comparisons are based on five mea-suring criteria, CPU times and illustrations of the clustering solutions in maps. • A new problem called the plCLRP is introduced and a new problem
formula-tion based on the plLRP and CVRP is introduced.
• Modified problem formulations are provided for the HpMP and PLRP.
• Cluster-based solution approaches to solve the HpMP, plCLRP, cdCLRP, sdCLRP and PLRP are suggested.
• New large scale instances are introduced for the HpMP, plCLRP and PLRP and benchmarks are provided.
• Results are compared to benchmarked instances for the cdCLRP and sdCLRP and a couple of the smaller instances for the HpMP and PLRP.
1.5
Thesis overview
The remainder of this research can be summarised as follows. Chapter 2 contains a literature review of clustering methods while Chapter 3 discusses the relevant distri-bution network problems. In Chapter 4 the effectiveness of the different clustering methods are analysed and compared. Based on these findings Chapter 5 proposes new cluster-based solution approaches to solve four variants of the CLRP. Chapter 6 proposes a cluster-based solution approach to solve the PLRP. Chapter 7 provides conclusions and identifies future areas of research.
Chapter 2
Literature review - clustering
methods
The goal of clustering methods is to group items together to minimise the total dissimilarity between the points in a cluster. This goes hand-in-hand with maximi-sing the dissimilarity between points from different clusters, (Everitt et al., 2011). The idea of clustering items together to simplify a problem is very intuitive and for most people, it is the first thought that comes to mind when looking at a map of customers to be serviced.
Multiple clustering methods have been developed over the years; the most well-known are discussed in books like Hartigan (1975), Jain and Dubes (1998) and Everitt et al. (2011). Five different types of clustering methods for geographical data are discussed; these include hierarchical, iterative partitioning, graph-based, nearest neighbour and density-based clustering methods.
2.1
SAHN clustering methods
Hierarchical clustering methods use a nested sequence of partitions. There are two types of hierarchical clustering methods: agglomerative and divisive. The agglom-erative hierarchical methods start with each customer or point placed in its own cluster. The clusters are then merged together based on inter-cluster distances, (Jain and Dubes, 1998). In contrast, divisive methods start with all points in one cluster and incrementally divide the points into an increasing number of clusters. According to Everitt et al. (2011, p 84), divisive methods are computationally more demanding than agglomerative methods. If used with an exhaustive search, the time complexity of the methods are O(2n), where n is the number of points to be clustered. This is higher than the maximum estimated time complexity of O(n3), for
2.1 SAHN clustering methods
agglomerative methods, (Müllner, 2013). For these reasons the focus in this work will be on agglomerative hierarchical methods. These methods are referred to as SAHN (sequential agglomerative hierarchical non-overlapping) methods, (Jain and Dubes, 1998, p 79).
SAHN clustering is often depicted with dendrograms, (Jain and Dubes, 1998, p 65). A dendrogram is a tree diagram that graphically represents the merging of the clusters. The horizontal axis represents the points and the vertical axis represents the critical distance measure. Horizontal lines are drawn to represent the merging of two clusters or points. An example of a dendrogram is given in Figure 2.1.
Figure 2.1: A dendrogram representation of SAHN clustering.
In this figure, the dendrogram illustrates that SAHN methods can continue clustering until all customers are in one cluster. This can be prevented by the use of stopping rules as discussed in Section 2.1.10. A stopping rule can be represented by one or more vertical lines through the dendrogram and is called a "cut", (Everitt et al., 2011, p 95).
SAHN methods iteratively calculate the inter-cluster distances and then merge the next shortest distance. Different inter-cluster distances can be used for example: shortest distance, furthest distance, average distance, distance between centroids and the sum of intra-cluster variance, also known as Ward’s method (Everitt et al., 2011). The different SAHN methods are now discussed in more detail.
2.1 SAHN clustering methods
2.1.1 The single linkage method
The single linkage method is also referred to as the "nearest-neighbour" or minimum spanning tree (MST) method, (Han and Kamber, 2006). The distance between two clusters is defined as the distance between their closest points to each other. The two clusters with the smallest distance between them are then merged. The process is repeated until all points belong to one cluster. A stopping rule can also be used to stop earlier, for example when a specified number of clusters are reached. According to Lam (2008), this method tends to produce string-like clusters, known as chaining. The chain effect could be useful when creating routes, but it is not efficient for the location phase, (Lam et al., 2009; Barreto et al., 2007).
2.1.2 The complete linkage method
A second method called the complete linkage method, uses the furthest inter-cluster distance to determine which clusters to merge. Similar to the single linkage method, the distances between all points in the different clusters are determined. The max-imum distance between any two points in two different clusters is identified as the furthest inter-cluster distance for the cluster pair. All furthest inter-cluster distances are ordered based on size and the cluster pair with the shortest furthest inter-cluster distance are merged. All furthest inter-cluster distances for the new merged cluster are calculated and the process is repeated to find the next two clusters to merge. Han and Kamber (2006) call this method the furthest-neighbour clustering algo-rithm. Everitt et al. (2011) state that this method tends to create compact clusters with equal diameters, but does not consider different cluster structures.
2.1.3 The average linkage method
This method merges clusters based on the average distance between all inter-cluster pairs, (Everitt et al., 2011). Similar to the other SAHN methods, the distances are ordered and the cluster pair with the shortest average distance are merged. The process is repeated until a stopping rule is reached or all points are in one cluster. The method is also referred to as the unweighted pair-group method using averages (UPGMA). It is called "unweighted" because all inter-cluster pair distances weigh the same when calculating the average distance. Weighting based on previous merges is not considered.
2.1 SAHN clustering methods
2.1.4 The weighted average-linkage method
The weighted average-linkage method is also called the weighted pair-group method using averages (WPGMA). The merging of the clusters is based on the shortest weighted average inter-cluster distance. It is called "weighted" because the distances used to calculate the average, are weighted based on the number of points in each of the two clusters from the previous cluster merge, (Everitt et al., 2011).
The weighted average distance can be described based on the illustration given in Figure 2.2. Consider two mutually exclusive clusters i and j, each with ni and nj number of points respectively. In the figure, cluster i and j have merged to form cluster k. The total number of points in cluster k is then nk= ni+ nj. Let the sets I, J and H be the set of points in clusters i, j and h respectively. Then I ∪ J = K is the set of points in cluster k.
The weighted average inter-cluster distance between an arbitrary cluster h and clus-ter k (Everitt et al., 2011) can now be calculated as
dhk= 1 ni ni X x=1 nh X y=1 d(xy) ! + 1 nj nj X z=1 nh X y=1 d(zy) (2.1) x ∈ I, y ∈ J, z ∈ H and I ∪ J = K, (2.2) where d(xy) is the distance between points x and y and dhk is the weighted average distance between clusters h and k. The weighting is therefore based on the number of points in clusters i and j and each inter-cluster pair does not contribute the same weight towards the average. Instead, this method ensures that the inter-cluster distances from both clusters i and j contributes equally to the new average.
If the clusters have an equal number of points, the average would be the same as for the average linkage method. If one of the clusters is smaller, the points in the smaller cluster will weigh more than in the big cluster in order to get the clusters to weigh the same. The rationale behind this is to prevent small clusters from being dominated by bigger ones.
2.1.5 The centroid linkage method
In the centroid linkage method, the average of the longitudes and latitudes of all points in a cluster is calculated. This is referred to as the centroid of each cluster, (Everitt et al., 2011). Clusters are then merged based on closest distance between the centroids of the clusters. The centroid linkage method calculates the distances between all centroids.
2.1 SAHN clustering methods
vector in Euclidean space xi = (x1i, x2i, . . . , xli), for xi ∈ <l, then the centroids are determined by the mean of the coordinates, (Everitt et al., 2011). The mean of the rth coordinate for cluster k is represented by
¯xr k= 1 nk nk X i=1 xri, (2.3) where xr
i is the value of the rth coordinate of point i assigned to cluster k and nk is the number of points in cluster k. The centroid is the vector with all mean values as coordinates, ¯xk= (¯x1k,¯x2k, . . . ,¯xlk).
The centroid method is also referred to as the unweighted pair-group method using the centroid approach (UPGMC). Similar to the average linkage method, it is called unweighted because each point in a cluster is equally weighted when determining the centroid.
2.1.6 The weighted centroid (median) linkage method
The weighted centroid linkage method is also called the weighted pair-group me-thod using the centroid approach (WPGMC). The meme-thod also uses the distances between the centroids of the clusters to determine the inter-cluster distances, similar to the centroid linkage method. The difference between the centroid and weighted centroid linkage methods lies in how the centroid of cluster k is calculated, (Everitt et al., 2011). Similar to the weighted average method, this method also gives the contribution of the two clusters that just merged an equal amount when calculating the distances to the centroid of the new merged cluster. If the two merged clusters differ in size, points in the smaller cluster are given a bigger weight compared to the points in the bigger cluster to ensure an equal contribution per cluster.
This is called the median of the two merged clusters and the method is also referred to as the median SAHN method, (Everitt et al., 2011).
2.1.7 Ward’s method
Ward (1963) suggested a method to minimise the incremental sum of square variance within clusters. The sum of squares, Ek, within a cluster k is determined by the sum of the squared distances between the set K containing all the points in cluster k and its mean. If the points are defined by more than one Euclidean space variable, the sum of square variance of all the variables needs to be added together to determine Ek.
2.1 SAHN clustering methods
in every iteration. Ek can be calculated as Ek = nk X i=1 k xi−¯xkk22 i ∈ K (2.4) = l X r=1 nk X i=1 (xr i −¯xrk)2 i ∈ K, (2.5) where ¯xr
kis the mean of the rth variable in cluster k, xri is the value of the rthvariable
for point i in cluster k, nk is the number of points in cluster k and l is the total number of variables defined in the Euclidean space. The total sum of square variance for all the clusters is then
E =
p
X
k=1
Ek (2.6)
where p is the total number of clusters. Everitt et al. (2011, p. 77) refers to this as the within-cluster sum of square errors, while Jain and Dubes (1998) call it the sum of intra–cluster variance.
According to Everitt et al. (2011), this method tends to create equal sized clusters with a spherical form, but it is sensitive to outliers. Barreto et al. (2007) and Lam et al. (2009) found this method to be the most effective in the routing phase of the LRP.
2.1.8 The Lance-Williams formula for SAHN methods
The Lance-Williams formula can be used as an alternative to calculate the new inter-cluster distances after every merge, based on the previous inter-inter-cluster distances, (Lance and Williams, 1967). The advantage of the formula is that the inter-cluster distance calculations are hugely simplified. The authors define a general recurring distance formula to calculate the new inter-cluster distances between the clusters after every merge.
The method starts by calculating the distances between all customers and store the values in a matrix, called a distance matrix (Li et al., 2010). This is used at the start because all customers are in their own clusters. These distances can be calculated in different ways, as discussed in Section 2.8. After every merge the inter-cluster distances are updated based on the previous values. In Figure 2.2, the inter-cluster distance between cluster h and cluster k is represented by dhk = dh(ij) and can be determined as
dhk= αidhi+ αjdhj + βdij+ λ|dhi− dhj|, (2.7) where the Lance-Williams parameters αi, αj, β and λ differ per method.
2.1 SAHN clustering methods
Figure 2.2: An illustration of the hierarchical clustering distance calculation used by the Lance-Williams parameters.
A summary of the Lance-Williams parameters for the different SAHN methods are given if Table 2.1. Method αi αj β λ Single linkage 12 12 −1 2 Complete linkage 12 12 12 Average linkage ni ni+nj nj ni+nj
Weighted average linkage 1
2 1 2 Centroid linkage ni ni+nj nj ni+nj ninj (ni+nj)2
Weighted centroid linkage 12 12 −1
4 Ward’s method nh+ni nh+ni+nj nh+nj nh+ni+nj − nh nh+ni+nj
Table 2.1: The Lance-Williams parameters for different SAHN methods.
2.1.9 CPU times of the SAHN methods
The time complexity of the SAHN methods gives an indication of the order of the CPU times. Both the single linkage and complete linkage methods only need the calculation of a distance matrix between all points once. This means that the time complexity for these two methods is O(n2), (Murtagh, 1983). When symmetrical
distances are assumed, as we do in this thesis, the distance between points i and j is the same in both directions i.e. dij = dji. Even in symmetrical cases where only n(n − 1)/2 = (n2− n)/2 calculations are needed to calculate the distance matrix, the order of the time complexity remains O(n2), (Murtagh, 1983).
2.1 SAHN clustering methods
The other SAHN methods have higher time complexities when the Lance-Williams formula are not used, because recalculations are needed after the merge of every cluster pair. The advantage on the other hand of not using the Lance-Williams formula is that the methods can work without a distance matrix which can save a lot of memory space. When the Lance-Williams formula are used to calculate the inter-cluster distances, the time complexity is O(n2log(n)), Everitt et al. (2011,
p 80), Hansen and Jaumard (1997).
Murtagh (1983) describes another method to lower the CPU time with the use of nearest neighbours (NN). If a point (or cluster) q is the nearest neighbour of point (cluster) p then q = NN(p). The author also defines the concept of mutual or reciprocal nearest neighbours (RNN) as q = NN(p) and p = NN(q). So the two points (clusters) are RNN if both are the nearest neighbour of the other.
According to Murtagh (1983) any RNN found can immediately be merged in the SAHN methods with the exception of the centroid and median methods. The author claims that even though the order of the merging will depend on the order in which the check is performed, the resulting hierarchy will be unique and provide the same results.
2.1.10 Stopping rules in SAHN clustering
Since SAHN clustering methods can continue to merge clusters until all points are in one cluster, a rule is needed to determine when to stop clustering. This is referred to as a stopping rule by Lam et al. (2009). The simplest rule is to stop when the required number of clusters is reached, effectively cutting the dendrogram, illustrated in Figure 2.1, at a certain height.
Care must be taken when using this stopping rule as-is, because the clusters are unconstrained and can have completely different sizes. This is particularly true when dealing with outliers, the outliers can create single–point clusters while all the other points are grouped together in the one cluster. When using the SAHN methods to solve the CLRP, the number of clusters needed in the CLRP is not always known beforehand. If this is the case, the method will have to be repeated over a range of different numbers of clusters to identify the best solution.
Everitt et al. (2011) suggest using a distance measure as cut-off to determine the best number of clusters to use. The dendrogram is then cut at this distance and the number of clusters is determined. This is defined as the Best Cut. The same problem as above can occur with outliers forming their own clusters, which will in return result in expensive routes or extra fixed depot opening costs when used in the LRP. Another type of cut to overcome this problem is the dynamic tree cut, where
2.2 Iterative partitioning methods
clusters are cut at different levels of the dendrogram.
Min (1987) and Barreto et al. (2007) explore the use of vehicle capacity as a stopping rule to ensure that clusters do not exceed capacity limits. Barreto et al. (2007) suggest excluding a cluster if it becomes fully capacitated, and continue with the merging of clusters that still have capacity available. Their conclusion is that this can result in two far-off sub-clusters being merged together, giving suboptimal routing solutions.
To prevent points too far from each other from being clustered together, Lam (2008) suggests using the ratio of inter-cluster distance variation and between-cluster dis-tance variation as a stopping rule. This is called the pseudo F–statistic.
2.2
Iterative partitioning methods
The iterative partitioning methods assume that all data points, also referred to as customers, need to be assigned to p mutually exclusive clusters. The methods start by selecting p customers as seed points. The seed points will serve as the centres of the clusters and all remaining customers are assigned to the closest seed points. This is referred to as a starting solution. The objective is to minimise the sum of all the distances between the customers and their assigned cluster centres.
Unlike SAHN clustering, iterative partitioning clustering methods therefore start off with the data points already in p different clusters. After a starting solution has been created, the methods iteratively cycle between re-assigning points to cluster centres and recalculating the cluster centres. An example of an iteration that illustrates the movement of the cluster centres can be seen in Figures 2.3(a) and 2.3(b). An iteration consists out of two steps:
1. Assign customers to the closest cluster centre; and 2. Recalculate the cluster centres.
A popular criterion to use as the objective function during clustering is to minimise the sum of square residuals (SSR) or variances, of the distances between all the points and their cluster centres. Although minimising the SSR is only effective for spherical and equal sized clusters, (Everitt et al., 2011), it is still the most commonly used criterion for clustering methods.
When recalculating the cluster centres, three different calculations are often referred to; the means, medoid and median calculations. All three calculations methods will form the basis of the different types of iterative partitioning methods discussed below.
2.2 Iterative partitioning methods
(a) Step 1: Customers are assigned to the closest cluster centres.
(b) Step 2: The cluster centres are recalculated based on the assigned customers.
Figure 2.3: An illustration of the k–means method.
One can further distinguish the methods between "k–" and "h–"variants. Both vari-ants start off with all data points already assigned to p different clusters, but the distinction is based on the number of customer points re-assigned during every itera-tion. There is much confusion in the literature regarding which variant of the method is referred to under which name. In this thesis the naming convention suggested by Hansen and Mladenović (2001) is used. This is described as follows:
2.2 Iterative partitioning methods
• The "k–" variant refers to the re-assignment of only one customer point to the closest centre at a time before recalculating the cluster centres. The point to be re-assigned is the one that will give the biggest distance saving.
• The "h–" variant refers to re-assignment of all customer points to their closest centres during every cycle, before recalculating the cluster centres.
On the basis of different centre calculations and variants, six different iterative par-titioning methods are now described as follows:
2.2.1 The k–means method
In the k–means method, the centres of the clusters are iteratively recalculated as the means of the coordinates of the customer points assigned to the cluster. This is called the centroids of the clusters, (Geetha et al., 2009). In xi∈ <2, the coordinates (xj, yj) of the jth centroid are therefore calculated as follows:
xj = nj X i=1 xi nj yj = nj X i=1 yi nj , (2.8)
where nj is the number of points in cluster j and j = (1, 2, . . . , p) for p clusters in total and Pp
j=1nj = n.
The distance savings for re-assigning every customer to the closest centre is cal-culated. In the k–variant, only the customer with the best distance saving is re-assigned, (Hansen and Mladenović, 2001). After every re-assignment the cluster centres of the involved clusters are recalculated. Iterations, consisting of assigning customers and recalculating cluster centres, are repeated until no re-assignments will result in a cost saving or a predefined number of iterations is reached. The method is more sensitive to outliers than other partitioning methods, (Han and Kamber, 2006).
According to Negreiros and Palhano (2006) the time complexity of the method is O(n) per iteration, while Han and Kamber (2006) goes into more detail and place the total computation time at O(npt), where t is the number of iterations. The k–means method is not very scalable to large datasets, because of the large number of iterations needed before the method converges, (Negreiros and Palhano, 2006). To make the k–means easier to use on large datasets, a suggestion was made by Han and Kamber (2006) to first group points into "micro clusters", before performing the k–means. The customer points in the micro clusters are then treated as a single point in the k–means. This is also known as the aggregation of data. The single linkage
2.2 Iterative partitioning methods
SAHN method is often used with a stopping rule based on a specified distance to create the micro clusters.
2.2.2 The h–means method
The above k-means method can be quite time-consuming because only one re-assignment is done per iteration. This makes the method unsuitable for large datasets. To overcome this problem, all customer points can be re-assigned be-fore recalculating the cluster centres in every iteration. This is called the h–means method, (Hansen and Mladenović, 2001).
Care should be taken which variant of the means method is referred to in literature; the h–means variant is often meant when referring to the k–means method. The method has the same time complexity of O(npt) as the k–means method, (Han and Kamber, 2006), but the number of iterations are far less than the k–means method. Han and Kamber (2006) also claims that the method is not as effective in large datasets because it often finds a local minimum instead of the global minimum, especially if the number of clusters or the number of iterations are far less than the number of points. Negreiros and Palhano (2006) refer to the method as the Forgy method.
Hansen and Mladenović (2001) state that the k–means method cannot be improved by the h–means, because the k–means is more accurate. They suggest using the h–means followed by the k–means in large datasets. They referred to this as the
hk–means method.
2.2.3 The j–means (k–medoid) method
The k–medoid method is very similar to the k–means, but is not as sensitive to the impact of outliers. Instead of calculating the means as the cluster centres, this method calculates the medoids. The medoid of a cluster is the most centrally located cluster point. To calculate the most central points, the absolute error criterion is used. The objective criterion is to minimise E, where E is the absolute error value for all clusters summed together. It is formulated and described by Han and Kamber (2006), as follows: E = p X j=1 nj X i=1 kxi− ojk1 (2.9) = p X j=1 nj X i=1 l X r=1 (|xr i − orj|), (2.10)