Research methods - Data Sample, Variable Estimation and Research Methods

4.2 Data Sample, Variable Estimation and Research Methods

4.2.4 Research methods

This study intends to classify the universe of common stocks into their

homogeneous clusters based on basic corporate financing decisions and firm size as

explained under sub-section 4.2. There are mainly two types of cluster analysis. First is

hierarchical cluster analysis, which is used for a limited number of observations. The

second type is non-hierarchical (K-means) cluster analysis that is used for a large number

of observations. Cluster analysis and its technique detail are given in Gordon (1999), Hair

et al. (2006), and in Everitt et al. (2011). This essay employs K-means cluster analysis

technique to cluster the firms and obtain the internally homogeneous and externally

158 4.2.4.1 K-means cluster analysis

The simple K-means cluster analysis is a pattern recognition technique that groups

the objects (firms) to discover the common practice styles in a dataset. Each group is called

a cluster. The technique joins firms together into clusters by reducing the rows in the

dataset. Firms in a cluster are similar to one another and firms in different groups are

dissimilar in terms of clustering attributes. The grouping process starts with the similarity

in observational firms over the variables. Cluster analysis does not have dependent

variables. The quantification of similarity is done through distance measures between the

objects in one group and between groups.

Cluster analysis employed here in this research measures distance by Euclidean

Distances measure that is the length of a straight line between two observations. It helps to

identify relatively homogeneous clusters of firms based on selected variables. Each cluster

has a Centroid, which is the mean of all observations within the cluster also referred as a

cluster center. The Euclidean distances are employed to allocate observations to each

cluster, and least squares estimation used for cluster centers. Following iterative relocation

algorithm of K-means is used, along with the discreteness constraints, as given in Lauprete

(1998).

Minimize 𝐽(𝑈, 𝑣) = ∑𝑐_𝑖=1 ∑𝑛_𝑘=1 (𝑢_𝑖𝑘)(𝑑_𝑖𝑘)2

s.t ∑𝑐_𝑖=1 𝑢_𝑖𝑘 = 1, ∀k = 1,…, n

𝑢_𝑖𝑘 𝜖[0,1], ∀𝑖 = 1, … , 𝑐, ∀𝑘 = 1, … , 𝑛 (𝑑𝑖𝑘)2 = (𝑥𝑘− 𝑣𝑖)′𝐴𝑖(𝑥𝑘− 𝑣𝑖), ∀𝑖, ∀𝑘

159

where, U is a (c x n) matrix of weights u_ik, and v is (p x c) matrix for which ith column vi is a vector representing the cluster i’s cluster center. Thus i indicate the cluster, c indicates the number of clusters, k represents the point, n is the number of points, u_ik equals 1 if point k belongs to cluster i, and 0 otherwise. xk, is a p-dimensional vector, which is kth data point.

While (d_ik)2 is the distance of k point to cluster i, defined in terms of a positive definite symmetric matrix A_i. K-means cluster analysis is optimization process implying following condition: 𝑣𝑖 = ∑n k=1 (uik)𝑥𝑖 ∑n k=1 (uik)

where, vi are the cluster centroids. This analysis partitions the firms into groups with

similarities optimally. It should be noted that this is near to optimal, but not necessarily optimal. The distance measure that incorporates our corporate finance decisions and firm size is described as under:

d(i,k)2=(𝐹𝐼𝑁_𝐹𝐿𝐸𝑋𝑖− 𝐹𝐼𝑁_𝐹𝐿𝐸𝑋𝑘)2 +(𝑆𝑇_𝐶𝑅𝐸𝐷𝐼𝑇𝑖− 𝑆𝑇_𝐶𝑅𝐸𝐷𝐼𝑇𝑘)2+ (𝐿𝑇_𝐼𝑁𝑉𝑖− 𝐿𝑇_𝐼𝑁𝑉𝑘)2+ (𝐶𝑉𝑇_𝐷𝐸𝐵𝑇𝑖− 𝐶𝑉𝑇_𝐷𝐸𝐵𝑇𝑘)2+ (𝑃𝑆𝐾_𝑈𝑆𝐸𝑖− 𝑃𝑆𝐾_𝑈𝑆𝐸𝑘)2+ (𝐿𝑆𝐼𝑍𝐸𝑖− 𝐿𝑆𝐼𝑍𝐸𝑘)2

where, FIN_FLEX, ST_CREDIT, LT_INV, CVT_DEBT, and PSK_USE are the latent

growth variables representing the change in firms' financial flexibility, short-term credit,

long-term investment, convertible debt usage, and preferred stock usage respectively.

According to Bushee (1998) there are no standard objective criteria for choosing the

number of clusters. It is the long standing question with no clear solution, but often a matter

of trial and error, educated guess, and judgement (Hair et al., 2006). This study uses the

Pseudo-F statistics – the ratio of between-cluster variances to within cluster variance – and

160

squared to the approximate expected R-squared using an approximate variance-stabilizing

transformation – for the choice of the right number of the clusters, where a higher positive

number of both statistics are better with no sudden jumps. The negative numbers are

indicative of the outliers. The positive CCC values indicate that R-squared is greater than

expected when data sample is uniformly distributed, and thus would be a sign of the

possible presence of clusters.

The use of K-means clustering techniques has manifold advantages. First, it

provides the descriptions of the individual cluster that helps to identify what is unique in

each cluster. Second, the firms with similar operating and financing styles can be identified

and allocated to one group. Third, the clusters of different operating and financing styles

can be compared in regard to their performance which might be explained by the

differences. However, in K-means clustering methods, several analyses need to be

performed before reaching the conclusion of how many clusters can be obtained. Thus, this

research hypothesises that firm groups exist. This should not be a problem for the business

analysis, if one closely checks the F and CCC statistics until there exists heterogeneity in

the data sample.

This study uses the SAS software FASTCLUS procedure to select the initial K

centroid, where K is the number of clusters pre-specified, and then each observation is

allocated to the nearest centroid. The group of such observations allocated to a centroid is

known as a cluster. K-means cluster analysis updates the centroid every time a new

161

This algorithm was first proposed by Elton and Gruber (1971) in financial research

for reducing uncertainty about Markowitz’s model inputs. Later among others60,

Goetzmann and Wachter (1995) used this procedure in real estate portfolio diversification.

The authors claim that this procedure addresses the problem of estimation error in the

Markowitz model of mean-variance by aggregating cross-sectional asset series. Further,

they suggest that this model projects the forecasts with higher precision by decreasing

dimensionality. Apart from such findings, they suggest that clustering offers clear

recommendations for improving diversification.

In document Pattern recognition techniques and financial analysis : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Finance at Massey University, Palmerston North, New Zealand (Page 176-180)