4.2 Data Sample, Variable Estimation and Research Methods
4.2.4 Research methods
This study intends to classify the universe of common stocks into their
homogeneous clusters based on basic corporate financing decisions and firm size as
explained under sub-section 4.2. There are mainly two types of cluster analysis. First is
hierarchical cluster analysis, which is used for a limited number of observations. The
second type is non-hierarchical (K-means) cluster analysis that is used for a large number
of observations. Cluster analysis and its technique detail are given in Gordon (1999), Hair
et al. (2006), and in Everitt et al. (2011). This essay employs K-means cluster analysis
technique to cluster the firms and obtain the internally homogeneous and externally
158 4.2.4.1 K-means cluster analysis
The simple K-means cluster analysis is a pattern recognition technique that groups
the objects (firms) to discover the common practice styles in a dataset. Each group is called
a cluster. The technique joins firms together into clusters by reducing the rows in the
dataset. Firms in a cluster are similar to one another and firms in different groups are
dissimilar in terms of clustering attributes. The grouping process starts with the similarity
in observational firms over the variables. Cluster analysis does not have dependent
variables. The quantification of similarity is done through distance measures between the
objects in one group and between groups.
Cluster analysis employed here in this research measures distance by Euclidean
Distances measure that is the length of a straight line between two observations. It helps to
identify relatively homogeneous clusters of firms based on selected variables. Each cluster
has a Centroid, which is the mean of all observations within the cluster also referred as a
cluster center. The Euclidean distances are employed to allocate observations to each
cluster, and least squares estimation used for cluster centers. Following iterative relocation
algorithm of K-means is used, along with the discreteness constraints, as given in Lauprete
(1998).
Minimize π½(π, π£) = βππ=1 βππ=1 (π’ππ)(πππ)2
s.t βππ=1 π’ππ = 1, βk = 1,β¦, n
π’ππ π[0,1], βπ = 1, β¦ , π, βπ = 1, β¦ , π (πππ)2 = (π₯πβ π£π)β²π΄π(π₯πβ π£π), βπ, βπ
159
where, U is a (c x n) matrix of weights uik, and v is (p x c) matrix for which ith column vi is a vector representing the cluster iβs cluster center. Thus i indicate the cluster, c indicates the number of clusters, k represents the point, n is the number of points, uik equals 1 if point k belongs to cluster i, and 0 otherwise. xk, is a p-dimensional vector, which is kth data point.
While (dik)2 is the distance of k point to cluster i, defined in terms of a positive definite symmetric matrix Ai. K-means cluster analysis is optimization process implying following condition: π£π = βn k=1 (uik)π₯π βn k=1 (uik)
where, vi are the cluster centroids. This analysis partitions the firms into groups with
similarities optimally. It should be noted that this is near to optimal, but not necessarily optimal. The distance measure that incorporates our corporate finance decisions and firm size is described as under:
d(i,k)2=(πΉπΌπ_πΉπΏπΈππβ πΉπΌπ_πΉπΏπΈππ)2 +(ππ_πΆπ πΈπ·πΌππβ ππ_πΆπ πΈπ·πΌππ)2+ (πΏπ_πΌπππβ πΏπ_πΌπππ)2+ (πΆππ_π·πΈπ΅ππβ πΆππ_π·πΈπ΅ππ)2+ (πππΎ_πππΈπβ πππΎ_πππΈπ)2+ (πΏππΌππΈπβ πΏππΌππΈπ)2
where, FIN_FLEX, ST_CREDIT, LT_INV, CVT_DEBT, and PSK_USE are the latent
growth variables representing the change in firms' financial flexibility, short-term credit,
long-term investment, convertible debt usage, and preferred stock usage respectively.
According to Bushee (1998) there are no standard objective criteria for choosing the
number of clusters. It is the long standing question with no clear solution, but often a matter
of trial and error, educated guess, and judgement (Hair et al., 2006). This study uses the
Pseudo-F statistics β the ratio of between-cluster variances to within cluster variance β and
160
squared to the approximate expected R-squared using an approximate variance-stabilizing
transformation β for the choice of the right number of the clusters, where a higher positive
number of both statistics are better with no sudden jumps. The negative numbers are
indicative of the outliers. The positive CCC values indicate that R-squared is greater than
expected when data sample is uniformly distributed, and thus would be a sign of the
possible presence of clusters.
The use of K-means clustering techniques has manifold advantages. First, it
provides the descriptions of the individual cluster that helps to identify what is unique in
each cluster. Second, the firms with similar operating and financing styles can be identified
and allocated to one group. Third, the clusters of different operating and financing styles
can be compared in regard to their performance which might be explained by the
differences. However, in K-means clustering methods, several analyses need to be
performed before reaching the conclusion of how many clusters can be obtained. Thus, this
research hypothesises that firm groups exist. This should not be a problem for the business
analysis, if one closely checks the F and CCC statistics until there exists heterogeneity in
the data sample.
This study uses the SAS software FASTCLUS procedure to select the initial K
centroid, where K is the number of clusters pre-specified, and then each observation is
allocated to the nearest centroid. The group of such observations allocated to a centroid is
known as a cluster. K-means cluster analysis updates the centroid every time a new
161
This algorithm was first proposed by Elton and Gruber (1971) in financial research
for reducing uncertainty about Markowitzβs model inputs. Later among others60,
Goetzmann and Wachter (1995) used this procedure in real estate portfolio diversification.
The authors claim that this procedure addresses the problem of estimation error in the
Markowitz model of mean-variance by aggregating cross-sectional asset series. Further,
they suggest that this model projects the forecasts with higher precision by decreasing
dimensionality. Apart from such findings, they suggest that clustering offers clear
recommendations for improving diversification.