• No results found

Partial Correlations and Linear Regression

IV. METHODOLOGY

4.4 Data Analysis

4.4.2 Partial Correlations and Linear Regression

The second portion of my methodology focused on preparing the metrics I gathered for analysis. This process occurred in four main phases: cleaning the data, understanding and normalizing the data, and analysing the data using SPSS.

In the first phase of my data analysis, I focused on cleaning the dataset, which I accomplished by eliminating redundancies (i.e. usernames that I accidentally entered twice), searching for any omissions during the collection, and formatting the data for use in SPSS. This included removing additional spaces at the end of usernames, replacing missing data with a period, and forming a new variable that measured the total duration of community membership in days (subtracting the first login from the most recent login).

In the second phase of data analysis, I developed an overall ‘picture’ of the average expert in the community, looking at general commenting trends and practices. Specifically, I used SPSS to compute variables such as mean, mode, and standard deviation to develop a better understanding of the “expert group” identified in the course of my research. In this phase of my analysis, I developed a greater appreciation and understanding of the range of the data, and gathered information on the behaviour of the “average” expert member in the ENoP. I have discussed the

issue of using expert nominations as a proxy for perceived expertise in 4.3.3. In essence, this was a small, yet important step that allowed me to develop a general understanding of my data and the contributory patterns of expert members.

In addition, it was during this stage of analysis that I checked my data for normality and homoscedasticity by looking at the median, mode, skew and kurtosis of my data using Frequencies and Descriptives in SPSS. At this time, I confirmed most of my data were positively skewed. Given the fact my data were highly susceptible to skew due to natural process limits, this is very unsurprising. These natural process limits are in place because a contributor cannot contribute less than zero times, nor can they receive fewer than one expert nomination to be part of my sample.

Following standard procedure to address the issue of skew, I first examined the data for extreme outliers, of which I found only one which was erroneous, and corrected it. This outlier had no effect on the skew of my data. I then looked at the residuals as further confirmatory evidence that my data were not normally distributed. In a normal distribution, one expects the residuals to be roughly normal, with a mean of zero and a constant variance around that mean. I checked my residuals by first looking at Q-Q plots to see if my data exhibited skewness, which it did, so therefore I took the next steps of examining a histogram of each of the constructs individually. To contextualize this in an example, one of my constructs, Total Expert Nominations, had a median that was eight times that of the mode. Furthermore, process limits curtailed the development of a normal distribution by creating a “cut-off” for scores, as one could not receive a score below zero. A histogram of total expert nominations depicts the extent of this skew.

Part of my research questions focused on understanding the characteristics associated with perceptions of expertise. However, my data violated the assumptions of normality. To be able to use the data in a linear regression (and thus compare among the various factors associated with expert nominations), it was necessary to normalize or the data. Since I wanted to compare among variables, and examine each variable by itself, for ease of interpretation, I opted not to use Box-Cox transformation. Using logarithm transformations as much as possible allowed me the opportunity to compare more directly between variables. I did so by following the steps prescribed by Tabachnick and Fidell (2007); specifically, I used transform function in SPSS to calculate the logarithm. This was done by taking the Log of the variable plus one.

Furthermore, since my objective was to form a sense of how these variables related to each other, as opposed to forming causal relationships, the logarithm transformations were the most appropriate selection for what I wished to achieve in my analysis. Since I had strong reason to believe that many of the variables I was

working with were correlated with each other, the selection of a linear regression made even more sense.

I performed logarithm transformations on the following variables, which had skew values between 3.8 and 7, and kurtosis between 39 and 80:

 Number of Logins

 Expert Nominations Given

 Total Expert Nominations Received

 Queries

 Replies

 Total Posts

 Forums Commented in (total number of unique forums one commented in)

 Specialist areas Commented in (total number of unique speciality areas one commented in

I performed a square root transformation on the number of specialist areas commented in, which displayed a smaller skew and kurtosis, which was in keeping with the smaller likelihood of outliers, and the smaller standard deviation possessed by the construct. These transformations were important because they allowed me to use multiple linear regression, versus other forms of non-parametric analyses that might be less appropriate for addressing my research questions. In the course of these analyses, I noticed that one of the metrics I had collected, FAQ written, displayed an extremely large skew and kurtosis, and I used the more intensive inverse transformation to normalize the data. Since the data included 0, I used the form of (1/[x+1]) to complete the transformation. However, even these transformations did not produce meaningful data, because it was primarily dichotomous in nature. Since this factor did not feature heavily in theorising around ENoPs, and it exhibited very low variability in its measurement, I omitted it from my

data set because these features made it extremely difficult to draw relevant conclusions related to the construct.

Finally, in the third stage of my research, I actually conducted analyses using SPSS, which allowed me to examine the relationship between two or more variables. To do so, I performed a several partial correlations, looking at the relationship between several variables I measured in my study. This was undertaken with the objective of understanding the casual direction of the relationship, as well as the intensity of the relationship. In this step, I examined a variety of relationships, exploring how the variables fit together. I performed correlations (and partial correlations) between:

 Expert Nominations Received and Total Posts

 Expert Nominations Received and Expert Nominations Given (Controlling for Total Posts)

 Expert Nominations Received and Queries (controlling for total posts and expert nominations given)

 Expert Nominations Received and Replies (controlling for total posts and expert nominations given)

 Expert Nominations Received and Total Number of Forums Commented in (controlling for total posts and expert nominations given)

 Expert Nominations Received and Total Number of Specialties Commented in (controlling for total posts and expert nominations given)

Table 8 below extends Table 5 to include the types of techniques that will be used to address my research questions.

Table 8: Relationship between literature, research questions & methodology