Chapter 3 – Data, Measurement and Methods
3.5 Statistical Methods of Analysis in Chapter 5
3.5.1 Cluster Analysis
Cluster analysis helps in picking out the natural trends in the data. It groups like with like on the basis of selected attributes. The grouping of firms with similar perceptions about the components of the entrepreneurial ecosystem will help in identifying different ecosystems existing in Pakistan. These entrepreneurial ecosystems will then be used for assessing their differential effect on firm performance operating within those clusters.
A wide range of clustering techniques and procedures has been developed over the last four decades. These are divided into two major groups, named hierarchical clustering methods and disjoint clustering methods (Li et al., 2015). Other than statistical differences, the major
138 difference between these two methods is that in hierarchical clustering the decision about the optimal number of groups can be made after employing the relevant clustering approach, whereas in disjoint clustering methods the number of groups is to be decided beforehand. Moreover, the hierarchical cluster analysis ensures minimum intra-group variations and maximum inter-group variation (Everitt et al., 2011; Kaufman and Rousseeuw, 2008).
Given that our objective is to identify the patterns existing in the responses of the firms about components of the entrepreneurial ecosystem of Pakistan, and so to determine the optimal number of groups (i.e. ecosystems), it is impossible to decide this in advance. Therefore, hierarchical clustering methods have been adopted for the classification of the data. This method is further classified into agglomerative and divisive methods on the basis of the way to make groups. The agglomerative methods start by considering each firm as a separate group then gradually making larger groups of similar firms, and ending with all the firms in one main group. Alternatively, the divisive methods starts from treating all firms in one main group and then keeps on refining the groups by excluding firms with dissimilar characteristics, and ends with each firm being in a separate group. Agglomerative methods have been more commonly used in recent research studies, and are used for this chapter.
The response of the firms on the components of the entrepreneurial ecosystem including government regulations, tax rates, corruption, access to finance, infrastructure, political instability, competition with informal sector, the non-availability of an educated workforce, and electricity supply have been used as covariates to determine the clusters in the data. The next step is to decide the similarity or dissimilarity measures, so that closely related firms are clustered. These measures vary for continuous, categorical and mixed data.
139 Since our data on the components of the entrepreneurial ecosystems is categorical in nature, one similarity measure from a number of methods can be used. Those methods include the Matching method, the Jaccard method, the Russell method and the DICE method. How these similarity measure work can be explained through a simple example. Table 3.2, below, represents the binary responses of two firms i and j on a covariate of interest. The rows represent a certain set of characteristics being present or absent (1,0) in firm i. Similarly, the columns represent the presence or absence (1,0) of a certain set of characteristics in firm j. The cell value ‗a‘ indicates the presence of some characteristic in both firms i and j. The cell values ‗b‘ and ‗c‘ indicate the characteristic being present in either firm i or j. The cell value ‗d‘ indicates that this characteristic is not common in both i and j.
Table 3. 2: A 2x2 response table
Firm j
Firm i 1 0
1 a b
0 c d
The Russell method calculates the distance between firm i and j by taking the proportion of cases in which both traits were present, as shown in equation 3.13
1
𝑎+𝑏+𝑐+𝑑 3.13
The Jaccard method is similar to the Russell method but it excludes the cases in which both firms have dissimilar characteristics like ‗d‘. The calculation in the Jaccard method is shown in equation 3.14.
𝑎
140 The matching method is another variation on the Jaccard method. It includes both totally matched (a) and totally unmatched (d) cases in calculating the distance. The calculation according to the matching method is given below in equation 3.15.
𝑎 +𝑑
𝑎+𝑏+𝑐+𝑑 3.15
The DICE coefficient is the final method which is also closely related to the Jaccard method, except for assigning more weight to the mutually existing characteristics. The calculation of the DICE coefficient is given in equation 3.16:
2𝑎
2𝑎+𝑏+𝑐 3.16
The matching method is the most commonly used similarity measure for categorical data (Finch, 2005; Murtagh and Legendre, 2014). It results in the smallest distances among the firms and refined clustering of data by considering both similar and dissimilar attributes of the firms. Therefore, Ward‘s linkage algorithm with the matching method as similarity measure is adopted in this study for classification of the firms.6
The use of hierarchical clustering using Ward‘s linkage algorithm and the matching method as a similarity measure begins by considering each firm as a separate cluster, and in subsequent stages each firm with similar characteristics is made part of another cluster (Everitt et al., 2011). This process ends when all the firms become part of one cluster. The decision on the meaningfulness of the number of clusters is made on the basis of: (1) homogeneity within the cluster; (2) heterogeneity between the clusters; and (3) a balanced distribution of firms in the clusters. Moreover, dendrogram is used to give a structural view of how firms are part of
6The Stata 14 version has been used to implement the hierarchical clustering methods using Ward‘s linkages among
the agglomerative methods. The selection of an appropriate set of covariates is the first step in the implementation of a routine.
141 different clusters and lower down the dendogram how different clusters merge to form bigger clusters of similar firms.
The sensitivity of the cluster analysis was tested by using different versions of similarity measures for categorical data including the matching method, the Jaccard method and the Russell method. The results were not significantly different, however, the outcome of the matching methods were more similar in terms of homogeneity within the group and heterogeneity between them.