LIST OF ABBREVIATIONS, ACRONYMS AND INITIALISMS
3 RESEARCH METHODOLOGY
3.8 Data selection and cleaning procedures (a) Data selection (a) Data selection
The previous descriptive analysis allowed confirming that there was no missing data. This was expected because the software did not allow finishing the questionnaire without answering all the questions. Even so, all responses were inspected in order to check for possible problems. The questionnaire data were examined relatively to the employee function, the company size, the company nationality and unusual data values.
As it was said before, the survey defined the middle or top managers, either from the business or IT, as potential respondents. There were some responses made by employees with an inadequate function to answer the questionnaire and so, they were discarded. Other responses were also discarded because the company size did not fulfill the minimum dimension criteria of large or medium companies. These answers came from respondents working at micro or small companies. There were also a few rejected cases with regard to employees working for non-Portuguese companies. Table 30 presents the number of rejected responses by correspondent reason of rejection and the percentage of rejection relative to the total number of respondents.
Code Rejection reason Number of rejections %
IEF Inadequate Employee Function 23 42%
ICS Inadequate Company Size 7 13%
NPC Not a Portuguese Company 3 5%
OR Outlier Rejection 22 40%
Total number of rejections 55 100%
Total number of respondents 449
% of rejection/respondents 12,2%
Table 30. Number of rejected responses by correspondent reasons and percentage of rejection relative to the total number of respondents
In addition, possible outlier effects were also looked for on answers. There can be univariate or multivariate outliers. Yet, it would be imprudent to discard individuals just because they responded at either the low or high end of a narrow spectrum as the one used at this survey (a Likert scale varying from 1 to 5). So, each univariate distribution was tested for normality (using skewness and kurtosis tests), but not for univariate outliers.
Nevertheless, test for multivariate outliers could make sense. One possible method to detect outliers consists in using a scatterplot, where each point represents an answer, combined with a regression line that allows a visual comparison with all those points, evidencing possible outliers.
Figure 41 presents a scatterplot graph with enterprises represented as single points in the incentive and alignment dimensions.
RESEARCH METHODOLOGY
165
Figure 41: Scatterplot graph with enterprises represented at incentive and alignment axes evidencing a rejected outlier response
Although it may not be easy to detect, some outliers mat comes from an intentional/motivated misreporting or careless responses. The respondents that did not take enough attention to each question and responded them without a careful reflection may excessively repeat the same answer producing a big kurtosis value. A clear situation is shown in Figure 41. There, it is shown a point representing a outlier response that was rejected because it is clearly not correct as it has all answers of incentive domain classified as 1 and all answers of alignment classified as 5. Yet, other outliers are more difficult to detect, and even they are detected, are not so clearly judged as to be rejected.
A possible way that could help the detection or confirmation of deviant responses and behaviours is the usage of the kurtosis and the skewness tests in multivariate approach. Indeed, the same rejected response presented at Figure 41 has a high kurtosis value (infinite if the incentive and the alignment domain are analyzed separately).
Another possible way to check for responses with an abnormally number of question responses, concerns the possible outliers identification associated to data points lying evidently outside the general linear pattern of which the midline is the regression line defined using the dependent variable of alignment and the independent variable of incentives. Usually, observations with high standardized residual values are likely to be outliers. A standardized residual value above or below ±2.24 requires close scrutiny since it indicates that an observation is unusual in the Y value (Aguinis, Gottfredson, &
Joo, 2013). The previous shown case at top left of Figure 41 clearly violates this rule as well (with a standardized residual of -5.41).
Of course, detecting outliers doesn't mean we should throw them out without thinking, neither ignoring them. Their detection is an opportunity to think of reasons why observation may be different.
Just after closer analysis, the decision of dropping them or not is made. This is why, in case of doubt, several other responses were kept as valid at this survey data analysis. Indeed, they could result of
1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0
1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0
Global Assessment of Alignment
Global Assessment of Incentive rejected outlier response
outliers representing legitimate cases sampled from the correct population or outliers coming from faulty distributional assumptions (Osborne & Overbay, 2004).
(b) Cleaning procedures
The objective of this section is to decide what to do with possible quality problems of the data.
Although it was difficult to check all questions, at least it was possible to verify those relating the enterprise, respectively its activity sector and its size. Those two questions could be roughly checked because each token was associated with a particular enterprise of the D&B database containing that same enterprise information. The activity sector and the size were analyzed and some corrections were made. An example with a cleaning procedure made to a set of answers of a respondent is presented below, showing the possible complexity of existing relations among the companies and its employees of a group of enterprises.
One respondent, an executive board member of a medium sized Information Technology and Services company, answered the questionnaire, classifying its company as an IT company. Yet, after consulting and analyzing its profile at LinkedIn social network, it was possible to clarify that this professional was not only responsible for this company, but he was also the Chief Information Officer in a large company of Infrastructures management of land transport with more than 1000 employees.
Indeed, the first medium sized company is part of the group of enterprises that the second company aggregates. Consequently, the class code regarding the size of the company and the code regarding the economic activity was changed accordingly. Also, there is the need to insert a tax identification code for each one of the respondents that will act as a database key, allowing a later aggregation at company level. This will permit the computation of averages and other statistics at company level, and so, its later analysis. At this example, the tax identification code became the code of the large company, the holding company of the group.
Like this previous example, other responses were adjusted. This was a very time consuming process, requiring a careful qualitative analysis of the respondents’ professional experience, of the companies at their curriculum, about possible relations among those and others companies. Even so, although there was a substantial effort at the analysis of most these complex networks, as this process is complicated and the available information is not complete, it cannot be stated that all adequate changes were made.
RESEARCH METHODOLOGY
167