4.2. Methods adopted in this study
4.2.3. Formation of regions in RFFA
Identification of homogeneous regions is a difficult task in RFFA, particularly in Australia which has a highly variable hydrology. The aim is to form groups of streamflow gauging sites that approximately satisfy the homogeneity criteria. In order to identify groups of catchments of similar hydrologic characteristics, cluster analysis is a widely adopted method, which is also used in this research.
Cluster analysis
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. The objective of clustering the observations of a data set is to seek partitioning of observations into distinct groups so that the observations within each group are quite similar to each other (in relation to some attributes of the data), while observations in different groups are quite different from each other.
52 Figure 4.4 Different Clustering Techniques
Clusters are formed with sites having similar site characteristics. When the regions are intended for use in RFFA, some special considerations apply to cluster analysis. Most clustering algorithms can be classified into two categories (Jain and Dubes, 1988): hierarchical clustering and partitional clustering.
Hierarchical clustering procedures provide a nested sequence of partitions, whereas partitional clustering procedures generate a single partition of the data in an attempt to recover the natural grouping present in the data. In this subsection, a brief description of these clustering procedures is presented.
53 Hierarchical clustering
The hierarchical clustering process (both agglomerative and divisive) can be represented as a nested sequence or tree, called dendrogram, which shows how the clusters that are formed at the various steps of the process are related. Hierarchical clustering algorithms can be subdivided into two categories: Agglomerative and Divisive.
The agglomerative hierarchical clustering begins with singleton clusters and proceeds successively by merging smaller clusters into larger ones. For a given set of N feature vectors, the agglomerative hierarchical clustering procedures begin with N singleton clusters. The singleton clusters are those that consist of only one feature vector. A distance measure such as the Euclidean is chosen to evaluate the dissimilarity between any two clusters. The clusters that are least dissimilar are found and merged. This provides N-2 singleton clusters and a cluster with two feature vectors. The process of identifying and merging two closest clusters is repeated till the desired number of clusters is obtained.
Algorithms that are representative of the agglomerative hierarchical method of clustering include: (i) single linkage or nearest neighbour; (ii) complete linkage or furthest neighbour; (iii) average linkage; and (iv)Wardβs algorithm. These algorithms differ from each other by the strategy used for defining nearest neighbour to a chosen cluster. Clusters with the smallest distance between them are merged.
Different linkage algorithms for agglomerative hierarchical clustering
The algorithms begin with N singleton clusters each comprising a rescaled feature vector. Among the N singleton clusters, two closest clusters xi and xj are identified and merged to
form a new cluster [xi , xj ].
In the single linkage algorithm, distance between two non-singleton clusters [xi , xj ] and any
other singleton cluster xk is the smaller of the distances between xi and xk ,or xj and xk. In
general, the distance between two non-singleton clusters is the smallest of the distances between all possible pairs of feature vectors in the two clusters. This algorithm tends to form a small number of large clusters, with remaining small outlying clusters on the fringes of the
54 space of site characteristics and is not likely to yield good regions for regional flood frequency analysis (Hosking and Wallis, 1997; Rao and Srinivas, 2006).
In the Complete linkage algorithm, between the new cluster [xi , xj ] and any other singleton
cluster xk is the greater of the distances between xi and xk, or xj and xk. In general, the
distance between two non-singleton clusters is the largest of the distances between all possible pairs of feature vectors in the two clusters. This algorithm tends to form small, tightly bound clusters. It is usually not suitable for the application to large data sets.
In the average linkage algorithm, the distance between two clusters is defined as average distance between them. There are several methods available for computing the average distance. These include unweighted pair-group average, weighted pair group average, unweighted pair group centroid and weighted pair group centroid.
Unweighted pair-group average (UPGA): The distance between two clusters is defined as average distance between all pairs of feature vectors, each of which is in one of the two clusters.
Weighted pair-group average (WPGA): This method is identical to the UPGA, except that in the computations, the size of the respective clusters (i.e., the number of feature vectors contained in them) is used as a weight. This method is preferred when the cluster sizes are suspected to be greatly uneven.
Unweighted pair-group centroid (UPGC): The distance between two clusters is defined as the distance between their centroids. The centroid of a cluster is the mean vector of all the feature vectors contained in the cluster. In this method, if two clusters to be merged are very different in their size, the centroid of the cluster resulting from the merger tends to be closer to the centroid of the larger cluster.
Weighted pair-group centroid (WPGC): This method is identical to the UPGC, except that feature vectors are weighted in proportion to the size of clusters.
Wardβs algorithm (Ward, 1963) is a frequently used technique for regionalisation studies in hydrology and climatology (Acreman and Sinclair, 1986; Hosking and Wallis, 1997;
55 Kalkstein and Corrigan, 1986; Nathan and McMahon, 1990; Willmott and Vernon, 1980; Winkler, 1985).
The objective function, W, of Wardβs algorithm (Ward Jr, 1963) minimizes the sum of squares of deviations of the feature vectors from the centroid of their respective clusters.It is based on the assumption that if two clusters are merged, the resulting loss of information, or change in the value of objective function, will depend only on the relationship between the two merged clusters and not on the relationships with any other clusters. The governing equation of Wardβs algorithm is written as:
π = β β β(π₯πππ β π₯.ππ)2 ππ π=1 π π=1 πΎ π=1 β¦(4.15)
Divisive hierarchical clustering
The divisive hierarchical clustering begins with one large cluster comprising all the N feature vectors and proceeds by splitting them into smaller clusters. The feature vector that has the greatest dissimilarity to other vectors of the cluster is then identified and separated to form a splinter group. The dissimilarity values of the remaining feature vectors in the original cluster are then examined to determine if any additional vectors are to be added to the splinter group. This step divides the original cluster into two parts. The larger cluster is subjected to the aforementioned procedure in the next step. The process continues until a stopping criterion (such as the requested number of clusters) is achieved. The algorithm terminates when the desired number of clusters is obtained. If no stopping criterion is specified, the algorithm terminates when clusters resulting from the analysis are all singleton clusters. Description of divisive clustering algorithms can be found in Murtagh (1983), Guenoche et al. (1991). Savaresi et al. (2002) discussed strategies for the selection of a cluster to be split in divisive clustering algorithms. The divisive clustering methods are yet to be applied in regionalization studies.
Divisive hierarchical clustering algorithms always split clusters. In contrast, agglomerative algorithms always merge clusters.
56 While hierarchical clustering procedures are not influenced by initialization and local minima, partitional clustering procedures are influenced by initial guesses (e.g. number of clusters, cluster centres, etc.). The partitional clustering procedures are dynamic in the sense that feature vectors can move from one cluster to another to minimize the objective function. In contrast, the feature vectors committed to a cluster in the early stages cannot move to another in hierarchical clustering procedures.
Steps in regionalisation by cluster analysis
The steps in cluster analysis for RRFA applications are noted below:
1. Selection of attributes: It is important to select the attributes influencing the flood responses in the study region. Therefore, data exploration of various predictor variables to identify the attributes is carried out in this step.
2. Preparing feature vectors: The data available for each attribute are rescaled to nullify differences in their variance and relative magnitude. The rescaling may involve transforming the values of attributes by appropriate transformation function (such as logarithmic) and dividing the transformed values by standard deviation. Each feature vector consists of rescaled (dimensionless) attributes of a catchment.
3. Forming clusters: This step involves selection of a clustering algorithm to partition feature vectors prepared in step 2 into disjoint or overlapping clusters. The catchments represented by feature vectors in a cluster constitute a region for flood frequency analysis. In general, distance (or dissimilarity) measure and a clustering criterion characterize a clustering algorithm.
4. Selecting optimum number of regions: The clusters formed in step 3 are interpreted visually and by using cluster validity indices to determine the optimum number of regions.
5. Visual interpretation: Clusters obtained in step 3 are visually interpreted by plotting them in the geographical space of the study region to identify stable regions. The
57 stable regions do not change their configuration drastically with a change in the number of clusters formed by the clustering algorithm.
Figure 4.5 Steps in Regionalization using Cluster Analysis
Selection of Attributes
Preparing Feature Vectors
Forming Clusters
Selecting Optimum Number of Regions
Testing the Regions for Homogeneity
Are the Regions Homogeneous?
Adjusting for homogeneous regions
58 Dissimilarity measures for computing distance between cluster centroids, or feature vectors:
Distance measure: Equation:
Euclidean ββ(π₯ππβ π₯ππ)2 π π=1 β¦(4.16) Squared Euclidian β(π₯ππ β π₯ππ) 2 π π=1 β¦(4.17) Mahalonobis distance β(π₯π β π₯π) π ββ1(π₯ πβ π₯π) β¦(4.18)
Manhattan or City Block
β|π₯ππ β π₯ππ| π π=1 β¦(4.19) Canberra β |π₯ππβ π₯ππ| |π₯ππ| + |π₯ππ| π π=1 β¦(4.20) Chebychev max 1β€πβ€π|π₯ππβ π₯ππ| β¦(4.21) Cosine 1 β β π₯πππ₯ππ π π=1 ββ π₯ππ2 β π₯ ππ2 π π=1 π π=1 β¦(4.22) Minkowski (β|π₯ππβ π₯ππ|π‘ π π=1 ) 1 π‘ β¦(4.23)
n: number of attributes; xik: attribute k of feature vector xi in cluster 1; xjk : attribute k of
feature vector xj in cluster 2; In Mahalanobis distance measure, T is transpose of matrix, and
Ξ£ is covariance matrix. If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. t denotes the order of Minkowski distance.
Partitional clustering methods
In partitional clustering procedures, an attempt is made to recover the natural grouping present in the data through a single partition. These procedures are subdivided into K-means and K-medoids methods.
In K-means method (Ball and Hall 1965; MacQueen, 1967), each cluster is represented by its centroid, which is mean (weighted or unweighted average) of feature vectors within the cluster. This method is known for its efficiency in clustering large data sets with numerical
59 attributes. However, it has limitations in clustering categorical data (Ralambondrainy, 1995; Huang and Ng, 2003). Further, the method is sensitive to the presence of outliers.