• No results found

User Centred Evaluation (UCE)

2. Methodology

2.6. User Centred Evaluation (UCE)

The evaluation of user centred systems is notoriously difficult, primarily as the field is subject to a large degree of bias and there are also numerous factors to take into account. The first stage is a subjective evaluation, which is made on the basis of questionnaire responses and interview findings

18 [75]. The second stage involves an objective evaluation, carried out based on the data log files that are generated through the practical usage of the software [68].

The most popular and widely used method of evaluating user experience is the user centred evaluation (UCE) framework, which analyses the attitude of the users and their perception of the quality of service offered by the application, from a subjective standpoint. This approach is an effective means of appraising experimental systems and evaluations [72, 140].

In this research, the evaluations concentrate primarily on effectiveness and efficiency, as these attributes can be applied to evaluate which specific aspects of the software played a key role in satisfying businesses and users expectations and eliciting businesses and users’ acceptance as well as systems’ high level performance.

Likert designed a summative ranking scale referred to as the Likert scale [98]. This scale is widely used in the field of research, particularly when using questionnaires, as it is the simplest rating scale to compile [81], as respondents are asked to indicate the extent of their agreement or disagreement with a given statement [34]. For the purposes of the research presented in this thesis, a Likert scale was provided, as a response option for the closed-ended questions. Moreover, each statement had a corresponding neutral midpoint, using five ordered response levels (e.g. 1, 2, 3, 4, and 5) in the Likert scale, to prevent an acquiescence bias.

Furthermore, as usability is generally connected to system functionality, this study assesses system granularity on two levels, in terms of effectiveness and efficiency. These levels include the overall system and sub-system functionalities.

The highly regarded System Usability Scale (SUS) [22] is employed to evaluate the first level, the overall system, and contains a ten-item Likert scale to provide a broad overview of business owner's and Internet user's perceptions, regarding overall usability. This scale was designed by Brooke in 1966, to quickly determine the response of consumers to a specific product or service. This scale is widely used in the field of research and business and is also cheap, as it is non-proprietary. Furthermore, as this scale is technology agnostic, SUS can easily be adapted to assess a variety of

19 items, such as websites, applications, software or hardware. The scale is also quick and easy for both researchers and respondents to use, and generates one score on the scale, which is simple to interpret [10]. These ten items or statements are listed below.

1. I think that I would like to use this system frequently. 2. I found the system unnecessarily complex.

3. I thought the system was easy to use.

4. I think that I would need the support of a technical person to be able to use this system. 5. I found the various functions in this system were well integrated.

6. I thought there was too much inconsistency in this system.

7. I would imagine that most people would learn to use this system very quickly. 8. I found the system very cumbersome to use.

9. I felt very confident using the system.

10. I needed to learn a lot of things before I could get going with this system.

A five-point Likert scale which ranges from ‘strongly disagree (1)’ to ‘strongly agree (5)’ is used to measure the ten statements in the SUS. This scale switches between being positive and negative, thus, a more effective rating scale would assign higher values to Questions 1, 3, 5, 7 and 9 and lower values for Question 2, 4, 6, 8 and 10. The score generated by the SUS falls between zero and one hundred with a higher score indicating a higher degree of usability. Thus, an outstanding system would obtain a score of 90+ while a good system would obtain a score of between 70 and 80 [10].

The Likert scale was also applied in further questions, which were posed to judge the effectiveness

and efficiency of sub-system functionalities. In doing so, each question referred to a single system function or feature, with particular emphasis on its effectiveness and ease of use. A five-point Likert scale was provided which ranged from ‘very useless/hard to use (1)’ and ‘very useful/easy to use (5)’. The evaluation processes discussed in Chapters 6-9, sections 6.3, 7.4, 8.3 and 9.4, utilised this

20 The validity and credibility of the questionnaire must be guaranteed by the analysis method used to process the data collected using the research methods discussed [122]. Descriptive statistics were then applied to synthesise and discuss the findings.

For the purposes of this study Cronbach’s Alpha [51], is employed to measure reliability, as this method is suitable for the measurement of internal consistency, particularly as the present study includes Likert scales.

Cronbach’s Alpha has a theoretical value, which ranges from 0 to 1. Although there is no minimum value for this measurement, a higher level of internal consistency is indicated by a score close to 1.0. As illustrated in Table 2.1 [66], George and Mallery argue that a Cronbach’s Alpha value of at least 0.8 is desirable [69]. Therefore, to determine the reliability of the evaluation processes in this study, a baseline Cronbach’s Alpha value of 0.8 has been set (sections 6.3, 7.4, 8.3 and 9.4).

Table 2.1 Rule of thumb for describing internal consistency [66]

Cronbach’s alpha Internal Consistency

α ≥ 0.9 Excellent (High Stake Testing) 0.7 ≤α˂ 0.9 Good (Low Stakes Testing) 0.6 ≤α˂ 0.7 Acceptable 0.5 ≤α˂ 0.6 Poor

α˂ 0.5 Unacceptable

In order to ascertain if two sets of data are significantly different, it is necessary to apply a statistical hypothesis test which, for the outcome of this study, is the T-test [125] and this deals specifically with inference difficulties related with having "small" samples. In sections 6.3, 7.4, 8.3 and 9.4 the evaluations are all based on the paired T-test and they each contain a comparison of the average score of all of the features and functions with the neutral response of (3). It was found that the result was significant at p≤ 0.05 (which is the normal significance threshold expected within statistical significance research areas). Furthermore the Mann-Whitney U test [125], which is a nonparametric test, has also been used in all of the evaluations found in sections 6.3, 7.4, 8.3 and 9.4 and again contains a comparison of the average score of all of the features and functions with the neutral response of (3). The Mann-Whitney U test is important as it performs two functions; firstly it

21 compares two population means from the same population and secondly it also tests whether two population means are equal or not. In this case, it was found that the data reflected a normal distribution at p≤ 0.05.