Measuring Teaching Quality in Hong Kong s Higher Education: Reliability and Validity of Student Ratings

(1)

Measuring Teaching Quality in Hong Kong’s Higher Education:

Reliability and Validity of Student Ratings

Kwok-fai Ting Department of Sociology The Chinese University of Hong Kong

Hong Kong SAR, China

Tel: +852 2609 6626, Fax: +852 2603 5213, E-mail: [email protected]

Abstract: As an institutional response to the demand for public accountability, course evaluation has recently been adopted as a routine exercise among universities in Hong Kong. Having the appeal of being objective and precise, student ratings are treated as the yardstick for measuring and monitoring teaching quality. Most teachers and administrators focus primarily on the outcomes of the ratings without paying much attention to the measurement issues. Because course evaluation is a rather new practice in Hong Kong, it is not known whether students’ rating behavior follows patterns in the West. Using data collected from 11,190 students for 242 sociology classes between 1994 and 1998 at the Chinese University of Hong Kong, this study examines the reliability of student ratings across time and the cross-sectional agreement of student ratings within and between classes. This study further employs the construct-validation approach to examine the meanings of three overall course ratings: the overall satisfaction with lecturing performance, the overall satisfaction with course design, and students’ self-rated efforts in studying. Overall, Hong Kong students can be considered as reliable raters; findings indicate a moderate to high degree of stability in student ratings between classes across time and between students within the same classes. By and large, student ratings are valid measures of teaching quality although ratings on the satisfaction with course design are biased by their perceptions of teachers’ attributes. The three overall ratings do correlate considerably with elements from related domains in the expected directions. This evidence should help to clear up some of the skepticism that students’ opinions cannot be trusted and have little value for guiding teaching quality.

1. Introduction

The teacher-student relationship in Hong Kong has undergone fundamental changes during the past few decades as the traditional Chinese culture has gradually given way to Western values, ideas, and practices. In the past, Confucianism bestowed paternalistic authority and unconditional respect on teachers, which means students were not expected to criticize their teachers. But the rise of democratic institutions and the propagation of consumerism today have presented a serious challenge to such a relationship, especially in the higher education sector. The de-colonization process has further accelerated this change. Fearing a massive emigration of middle-class professionals, the colonial government took drastic steps to expand the higher education sector during the early 1990s. With the huge increase of student enrolment at the two existing universities and the addition of five more centrally funded universities in this short period, universities no longer aim at serving a minority of elites; rather, they assume a mission of providing quality education for the masses and are held accountable to them.

In response to the growing demand for public accountability, universities in Hong Kong have begun to adopt course evaluation as a routine exercise to monitor teaching quality. This is a revolutionary change in the traditional teacher-student relationship in Chinese society. Measuring teaching quality based on students’ opinions was previously unthinkable because, according to Confucian teaching, students should respect the teachers as they respect their own fathers. Moreover, the traditional criteria for assessing teachers are different from those which now apply. Rather than emphasizing the teaching technique, classroom management, and course design, Confucian thought stresses the role of teachers’ moral characters in achieving the education objectives. In light of this change, teachers in higher education institutions have to reconsider their relationships with students and to pay closer attention to previously downplayed issues, such as presentation styles,

curriculum designs and strategies, and grading schemes.

There are many potential benefits of course evaluation. First, course ratings provide feedback on a non-face-to-face basis where students can express their opinions frankly without the fear of sanctions. Second, from a

(2)

service provider’s point of view, student ratings help educators to understand how the patrons define education quality. Third, student ratings can assist teachers to identify areas of improvement and encourage them to raise the standard of teaching quality. Fourth, student ratings provide a more objective and a broader base of assessment than evaluations conducted by the department chairperson and administrators. Fifth, course evaluation is less intrusive than other forms of classroom appraisal, such as peer ratings. These benefits, however, can only be realized if students are competent judges of teaching quality.

Unlike in the West where the reliability and validity of student ratings have been well established, teachers in Hong Kong are uncertain how students rate their performance and what the ratings actually mean. As a result, teachers are generally sceptical of the new practice, especially when the ratings have implications in tenure and promotion decisions as is the case in the West (Dowd 1988; Hamilton, 1980). Often, student ratings are perceived as a threat to faculty, causing much resistance to this practice. These fears are not without grounds; some studies suggest that students may consider entertainment and generosity of grades, consciously or unconsciously, in the ratings of their learning experience (Hamilton, 1980; Ware & Williams, 1979, 1980). Without a clear understanding of its meanings, teachers do not have much confidence in students’ judgement of their performance, let alone use it as a yardstick for improving education quality.

Course evaluation is a new practice in Hong Kong. It is not known whether students’ rating behavior is comparable to patterns in the West. Using data collected from 242 sociology classes between 1994 and 1998 at the Chinese University of Hong Kong, this study examines how students react to the newly-implemented course evaluation system, with a focus on the reliability and validity of student ratings as measures of teaching quality. As one of the first available data sources for analysts to explore the underlying meanings of student ratings in Hong Kong, results of this study can help teachers to assess the relevancy of the new evaluation system to their teaching practices.

2. Reliability and Validity of Course Ratings

All empirical measures, including student ratings, have measurement errors. For example, irritating noise due to the renovation next door during the evaluation exercise may affect the students’ rating of the class. Reliability refers to the part of a measure that is free from random measurement errors. If students’ ratings are indeed dominated by random errors, they are poor measures because ratings will tend to be unstable and inconsistent. Therefore, it is essential to determine whether student ratings are reliable before they can be used as a guide to enhance teaching quality. Reliability can be determined by the consistency of the measure. Typically, researchers correlate between different measures intended to gauge the same phenomenon or between raters on the same item to assess the reliability of a measurement.

Reliability of a measure is partially determined by research design. Many studies use factor analysis to isolate measurement error in order to obtain a reliable measure. A factor is constructed by extracting one common aspect of effective teaching from several indicators, and its consistency depends on the quality of the questionnaire. Previous experience suggests that the class average of a carefully constructed factor, such as those from the SEEQ (Students’ Evaluations of Educational Quality), can attain an acceptable reliability of 0.74 for classes with as few as ten students, and the reliability increases with larger class sizes (Marsh, 1987). This implies that using multiple indicators to capture a factor, i.e. the same aspect of a phenomenon, is preferable (Asher, 1997), and that class averages are more trustworthy for larger classes than smaller ones.

Research design aside, studies have demonstrated that students are consistent raters (Costin, Greenough & Menges, 1971; Marsh, 1984, 1987). For example, there is a strong agreement on the ratings of the same class between current and former students in cross-sectional studies (Centra, 1979; Marsh, 1977) and between two time points several years apart for the same raters in longitudinal studies (Marsh & Overall ,1979; Overall & Marsh, 1980). Moreover, the reliability of class-average measures holds for classes taught by the same teachers but not for classes identified by the same course code. This indicates that student ratings can be generalized for the same teachers across different course settings (Marsh, 1982). Overall, evidence suggests that students can be a reliable source of information on teaching quality.

It is quite possible to have a highly reliable but invalid measure because reliability only concerns the

(3)

validity of student ratings, as teaching quality is an elusive concept. The education community has yet to reach a consensus on what criteria constitute quality teaching, not to mention how they can be measured (Dowd, 1988; Hildebrand & Wilson, 1970; Mitzel, 1960; McKeachie, 1969). Researchers nowadays tend to view teaching quality as a multidimensional concept, which means educators themselves have to judge whether responses from students are relevant to their education objectives.

Critics charge that student ratings are biased measures of teaching quality. They claim that students favor an entertaining performer rather than an effective educator; thus, teachers’ personality traits outweigh the importance of efforts devoted to class management (Sherman & Blackburn, 1975). Students often equate expressiveness with good teaching, as vocal skills and expressive movement help to enhance ratings (Murray & Lawrence, 1980). Experiments also demonstrate that enthusiastic, expressive teachers tend to solicit favorable evaluations regardless of lecture content, even though the latter is far more relevant to students’ achievement – a phenomenon widely known as the Dr. Fox effect (Ware & Williams, 1979, 1980). Research findings also reveal that course ratings are biased by students’ satisfaction with their grades (Hamilton, 1980).

In light of these criticisms, it is important to assess the extent to which student ratings are biased by factors unrelated to teaching excellence. Most researchers use the construct-validation approach, where student ratings are related to other measures that are assumed to be indicative of effective teaching. If the two measures agree with each other, there is evidence of measurement validity (Marsh & Overall, 1980). Student ratings have been validated against such criteria as the affective consequences, cognitive achievement, and self-ratings of the faculty. Findings confirm that the best-rated classes tend to secure the most desirable outcomes. Other studies also come to the conclusion that the sources of biases as charged by critics do little harm to the overall validity of student ratings (Marsh & Overall ,1980; Costin, Greenough & Menges, 1971; Hildebrand, Wilson & Dienst, 1971; Marsh, 1980a, 1980b; McKeachie, 1973, 1979).

Although course evaluation has been a well-established practice in North America, few studies test whether it can be applied to other countries with the same degree of reliability and validity. Based on a comparison among students from Australia, Spain, and Papua New Guinea, Marsh (1987) concludes that results from the North America studies can be generalized to a wide variety of educational contexts. Watkins (1994) reaches a similar conclusion with samples from India, Nepal, Nigeria, the Philippines, New Zealand, and Hong Kong. Marsh et al. (1997) also test the SEEQ instrument with a Hong Kong sample with favorable results. Although studies attempting to replicate these results are limited, and findings in Watkins’s study are based on small samples (e.g. 87 respondents in the Hong Kong sample), these preliminary findings support the use of student ratings as a means to measure teaching quality in different cultural contexts. With a specific focus on Hong Kong’s higher education, this study uses a large sample to assess the value of student ratings in monitoring teaching quality.

3. Data and Method

The data for this study come from 242 classes, which involve 11,190 students, 25 teachers, and 44 sociology undergraduate courses, between 1994 and 1998 at the Chinese University of Hong Kong. Course evaluations are routinely administered at the end of each semester, and the results are released to all teachers and students. The course evaluation form has three major sections. The section on course design has four questions on the scope, content, difficulty, and workload of the course; the section on lecturing skills asks five questions on lecture organization, clarity, intellectual stimulation, lecturing pace, and teaching attitude; the last section includes three overall evaluation items; they are students’ overall satisfaction with the course design, overall satisfaction with the teacher, and their own effort devoted to studying.

In addition to students’ subjective opinions obtained from the evaluation form, this study also incorporates information on students’ backgrounds, teachers’ attributes, class characteristics, and course types in the analysis. The course evaluation form provides information on the student’s major and year of enrollment. The teacher-level variables include teachers’ rank, administrative experience, research publications, and leniency in grading. The class-level variables are class size, heterogeneity in student composition, and students’ attendance rate. The course-level attributes include the level of difficulty and course types classified in terms of elective, required, and general education courses.

(4)

This study uses two types of measures to assess the reliability of students’ ratings. The inter-class reliability measures compare the agreements of average ratings between classes. In Table 1, seventeen teachers are ranked annually according to their average ratings on the three overall evaluation items over a four-year period. The stability of rankings across time reflects the reliability of students’ judgement of their teachers. In Table 2, ratings on all twelve items on the evaluation form in one year are correlated with the same items in the following year for the same teachers. If students’ judgement is reliable, the ratings should be consistent between two adjacent years. Table 3 compares the teacher and course effects on the consistency of ratings between classes arranged in pairs. Items rated in one class are then correlated with the same items from another class, and these correlations are compared by whether the same teacher teaches them and whether they belong to the same course. The intra-class reliability measure compares the agreement between two randomly assigned groups of students within the same class. If students are capable of judging teaching quality, there should be a high correlation between the two groups on all evaluation items in Table 3.

Using the construct-validation approach, this study examines the meanings of the three overall rating items of teaching quality, namely, students’ overall satisfaction with the lecturing performance, overall satisfaction with the course design, and effort devoted to studying. These three measures are correlated with 21 items pertaining to teaching quality in Table 4. Nine of the items are students’ opinions on various aspects of course design and lecturing skills, and they should correlate closely with students’ satisfaction on the same domain. Other items that are influential to students’ learning include students’ background, teachers’ attributes, class characteristics, and course types. In particular, teachers’ attributes are expected to correlate more closely with the satisfaction of lecturing performance, whereas class characteristics and course types should have a stronger association with the satisfaction of course design. Effective course designs and excellent lecturing skills together stimulate students’ interests in learning; therefore, items relevant to these two domains should correlate with students’ rating of their studying effort.

4. Results

A primary concern for educators is whether students can consistently discriminate between effective and ineffective teachers, between good and bad course designs, and between stimulating and boring courses. In Table 1, students consistently identify A, B, and C as effective teachers and N, O, and P as ineffective teachers during the four-year period. In terms of course designs, students repeatedly rated teachers A, B, and E with good course designs and teachers N, O, and P with poor course designs. As for effort in studying, teachers B, G, and M are regularly rated the best in motivating students to spend more effort in studying, whereas teachers I, O, and P fail to do so during the whole period. Clearly, students are capable of making reliable rankings among teachers on the top and bottom on these three measures. However, the reliability of rankings is far from perfect for those ranked in the middle.

Table 2 further examines the stability of student ratings on all items between two consecutive years for the same teacher. Except for a few items in the 95/96 to 96/97 period, all items are correlated above the 0.5 level and some of them almost reach the 0.9 level. Obviously, some items receive more consistent ratings than others. In terms of course design, difficulty and workload are more reliable measures than the scope and content of the course. The reliability for items on lecturing skills is very similar across time, with intellectual stimulation and teaching attitude being slightly more stable than organization, clarity, and pace. The reliability of the three overall ratings is fairly stable and similar to each other. Overall, the reliability of student ratings varies from moderate to high across time.

(5)

Lecturing performance Course design Effort in studying Teacher 94/95 95/96 96/97 97/98 9495 95/96 96/97 97/98 9495 95/96 96/97 97/98 A 1 1 1 1 1 1 1 2 6 5 3 4 B 2 2 5 4 2 2 4 5 1 1 2 6 C 3 7 6 7 9 10 5 11 7 6 6 8 D 4 9 11 10 6 9 14 9 4 15 13 11 E 5 8 3 6 3 8 3 3 8 7 7 7 F 6 12 7 5 4 11 7 7 5 12 9 5 G 7 3 4 3 8 3 6 4 2 2 1 3 H 8 6 10 2 7 6 9 1 12 8 10 1 I 9 14 14 15 11 14 12 14 15 16 11 14 J 10 13 8 9 15 12 8 6 17 13 8 9 K 11 10 9 11 10 13 13 12 9 10 15 16 L 12 11 13 12 12 7 11 8 13 9 12 12 M 13 5 2 13 5 5 2 10 3 3 5 2 N 14 15 15 8 17 17 16 13 14 4 4 10 O 15 17 16 16 16 16 15 16 10 14 16 17 P 16 16 12 14 14 15 10 15 16 17 14 13 Q 17 4 17 17 13 4 17 17 11 11 17 15

Table 1: Ranking of teachers by overall ratings, 1994-98

Evaluation items 94/95 - 95/96 95/96 - 96/97 96/97 - 97/98 All years

Course designs Narrow scope 0.767 0.768 0.611 0.678 Inappropriate content 0.652 0.396 0.543 0.526 Low difficulty 0.904 0.809 0.790 0.820 Light workload 0.594 0.884 0.793 0.755 Lecturing skills Organization 0.604 0.450 0.785 0.584 Clarity 0.550 0.334 0.866 0.578 Intellectual stimulation 0.679 0.508 0.862 0.689 Fast pace 0.511 0.659 0.766 0.609 Serious attitude 0.687 0.479 0.799 0.641 Overall ratings Lecturing performance 0.582 0.277 0.795 0.562 Course design 0.792 0.466 0.691 0.650 Effort in studying 0.702 0.764 0.655 0.605 Average correlation 0.669 0.566 0.746 0.641

Table 2: Correlation coefficients of average ratings of teachers between two consecutive years

Tables 1 and 2 may underestimate the reliability of student ratings as teachers may adjust their teaching strategies over time. The intra-class correlation coefficients in Table 3 examine the rating agreement between two random groups of students within the same class. The correlation coefficients of all items vary between the 0.5 and 0.8 level, suggesting a moderate-to-high degree of consistency between raters within the same class. In the same table, the between-class correlation coefficients compare the consistency of ratings between teachers and courses. Findings show that student ratings are consistent among classes taught by the same teachers, with

(6)

an average correlation above 0.5 for the same courses and above 0.2 for different courses. When different teachers teach the classes, the average between-class correlation coefficients are negligible whether or not the courses are the same. Student ratings, as suggested by these findings, are directed toward their teachers rather than the courses.

Between-class correlation Evaluation items Intra-class

correlation Same teacher same course Same teacher different course Different teacher same course Different teacher different course Course design Narrow scope 0.576 0.498 0.143 0.022 -0.026 Inappropriate content 0.515 0.333 0.220 0.087 -0.020 Low difficulty 0.717 0.705 0.397 -0.016 -0.029 Light workload 0.836 0.658 0.130 0.132 -0.040 Lecturing skills Organization 0.767 0.454 0.313 -0.033 -0.019 Clarity 0.794 0.499 0.292 -0.050 -0.019 Intellectual stimulation 0.786 0.580 0.303 -0.083 -0.014 Fast pace 0.753 0.522 0.328 -0.062 -0.025 Attitude 0.731 0.490 0.364 -0.083 -0.021 Overall ratings Lecturing performance 0.774 0.506 0.235 -0.027 -0.023 Course design 0.650 0.547 0.103 0.051 -0.021 Effort in studying 0.730 0.714 -0.042 0.180 -0.043 Average correlation 0.719 0.542 0.232 0.010 -0.025

Table 3: Intra-class and between-class correlation coefficients

To ascertain the validity of student ratings, the three overall ratings are correlated with the 21 items pertaining to quality education and background characteristics in Table 4. Items associated with lecturing skills, including organization, clarity, intellectual stimulation, and teaching attitudes, as rated by students, exhibit the strongest association with the overall satisfaction with lecturing performance. In addition, course content, an item from course design, also has a substantial correlation with this overall rating. These results support the satisfaction rating as a valid measure of lecturing performance. Surprisingly, teachers’ attributes, including administrative experience, research activities, rank, and leniency in grading, have little effect on students’ perception of teachers’ lecturing performance.

The satisfaction with course design, as a valid measure, should be strongly related to ratings of its components, but findings reveal only a moderate relationship. Students’ satisfaction with the course design has a moderate correlation with the scope and content of course materials, and a weak association with course difficulty and workload. In contrast, elements of lecturing skills, such as organization, clarity, intellectual stimulation, and teaching attitude, have a much stronger association with the satisfaction rating on course design. This indicates that student ratings, like those in the United States, are more teacher-oriented even when the evaluation is not directed toward the person.

Students’ effort in studying provides an alternative to the satisfaction measures and gives educators another angle to look at quality education. Besides excellent course design and skillful teaching, students’ involvement with their studies should also depend on students’ backgrounds and the learning environment. This is confirmed by the results. The scope and workload are the two components of course design that have a substantial

correlation with students’ effort in studying. Intellectual stimulation is the strongest item in lecturing skills that has considerable association with students’ effort. In terms of students’ backgrounds, percentage of freshman and percentage of students in the same major as the course correlate with the effort measure. As for the learning environment, class size, course difficulty, and general education courses are associated with students’ effort,

(7)

Variables Lecturing performance Course design Effort in studying Course design Narrow scope 0.294 0.441 0.412 Inappropriate content -0.652 -0.580 -0.210 Low difficulty 0.023 0.167 -0.096 Light workload -0.185 -0.152 -0.592 Lecturing skills Organization 0.805 0.681 0.288 Clarity 0.895 0.749 0.329 Intellectual stimulation 0.788 0.712 0.460 Fast pace 0.257 0.193 0.230 Serious attitude 0.795 0.634 0.314 Student backgrounds % freshman -0.156 -0.198 -0.481

% major in same subject 0.197 0.276 0.620

Teacher attributes Administrator -0.070 -0.092 -0.082 Publications -0.013 -0.036 -0.079 Senior rank 0.022 -0.036 -0.039 Lenient grading -0.178 -0.177 -0.058 Class characteristics Class size -0.124 -0.196 -0.506 Heterogeneity -0.039 -0.058 -0.147 Attendance rate 0.158 0.226 0.298 Course type Level of difficulty 0.107 0.194 0.489

General education course -0.133 -0.239 -0.505

Required course -0.069 -0.049 0.078

Table 4: Correlation coefficients between overall course ratings and related items

5. Summary and Discussion

Course evaluation has recently been adopted as a formal routine to monitor teaching quality in Hong Kong’s higher education with the purpose of assisting teachers to enhance teaching strategies based on students’ evaluation of their performance. The underlying assumption is that students are capable judges in evaluating the quality of their learning conditions and that their opinions will provide valuable hints for a better quality in teaching. Many teachers, however, are skeptical of the new system, for there is no proof that this assumption is valid. Indeed, without a clear understanding of what the student ratings actually mean, there is a danger of misunderstanding their implications. To resolve this controversy, this study uses a large data set collected recently to assess the usefulness of student ratings in Hong Kong.

Reliability and validity are a matter of degree. Good measures of teaching quality can also be contaminated by measurement errors and biased by factors irrelevant to the objectives of the measure. Overall, Hong Kong students can be considered as reliable raters; findings indicate a moderate-to-high degree of stability in student ratings between classes across time and between students within the same classes. By and large, student ratings are valid measures of teaching quality although ratings on the satisfaction with course design are biased by their perceptions of teachers’ attributes. The three overall ratings do correlate considerably with elements from related domains in the expected directions. In short, the results are consistent with those found in the United States where course evaluation has a long history. This evidence should help to clear up the common skepticism that students’ opinions cannot be trusted and have little value for guiding teaching quality.

(8)

Although the utility of student ratings is affirmative, findings also show that they are far from perfect as

compared to those results reported by Marsh (1982). There is plenty of room for improving the measurement of teaching quality based on students’ opinions. The measurement instrument for the present study, like many others, is designed by an ad hoc committee, which has little expertise on educational measurements. Thus, quality of the measurement varies between departments, inviting skepticism on the usefulness of the results. Indeed, there is an urgent need for a better-tested instrument like the SEEQ (Marsh, 1987) and the Endeavor questionnaires (Frey, Leonard & Beatty, 1975), which have been developed in the West to capture different components of teaching effectiveness. It is important to bear in mind that course evaluation can serve its purpose only when teachers trust what it has measured. A Chinese version of the SEEQ has recently been tested in Hong Kong, and, with some forms of modification, it has the potential for local adoption (Marsh et al., 1997). Measurement issues aside, teaching quality is an elusive concept that has diverse meanings. Even if a single set of criteria for “quality teaching” can be defined and measured unambiguously, it may not be a desirable thing to do. One must ask who decides these qualities. Once these qualities are admitted as universal, they will be imposed indiscriminately on all teachers and dictate how classes should be conducted and managed. As no students learn exactly the same way, a single teaching model deprives students of a diversified learning environment. Similarly, no single teaching method fits the objectives of all subject matter. For these reasons, student ratings, or any other evaluation methods, should not be treated as the sole means of measuring teaching quality.

6. References

[Asher 1997] Asher, J.W. (1997), “The role of measurement, some statistics, and some factor analysis in family psychology research,” Journal of Family Psychology, 11(3), pp. 351-360.

[Centra 1979] Centra, J.A. (1979), Determining Faculty Effectiveness. San Francisco: Jossey-Bass.

[Costin, Greenough & Menges 1971] Costin, F., Greenough, W.T., & Menges, R.J. (1971), “Student ratings of college teaching: Reliability, validity, and usefulness,” Review of Educational Research, 41(5), pp. 511-535.

[Dowd 1988] Dowd, J. (1988), “Sociology is different: The misevaluation of teaching effectiveness,” Sociological Inquiry, 58(4), pp. 393-412.

[Frey, Leonard & Beatty 1975] Frey, P.W., Leonard, D.W. & Beatty, W.W. (1975), “Student ratings of instruction: Validation research,” American Educational Research Journal, 12(4), pp. 327-336.

[Hamilton 1980] Hamilton, L.C. (1980), “Grades, class size, and faculty status predict teaching evaluations,” Teaching

Sociology, 8(1), pp. 47-62.

[Hildebrand & Wilson 1970] Hildebrand, M. & Wilson, R.C. (1970), Effective University Teaching and Its Evaluation, Berkeley: University of California, Center for Research and Development in Higher Education.

[Hildebrand, Wilson & Dienst 1971] Hildebrand, M., Wilson, R.C., & Dienst, E.R. (1971), Evaluating University Teaching, Berkeley: University of California, Center for Research and Development in Higher Education.

[Marsh 1977] Marsh, H.W. (1977), “The validity of student’s evaluations: Classroom evaluations of instructors independently nominated as best and worst instructors by graduating seniors,” American Educational Research Journal, 14(4), pp. 441-447.

[Marsh 1980a] Marsh, H.W. (1980a), “The influence of student, course, and instructor characteristics in evaluations of university teaching,” American Educational Research Journal, 17(2), pp. 219-237.

[Marsh 1980b] Marsh, H.W. (1980b), “Research on students’ evaluations of teaching effectiveness,” Instructional

Evaluation, 4, pp. 5-13.

[Marsh 1982] Marsh, H.W. (1982), “The use of path analysis to estimate teacher and course effects in student ratings of instructional effectiveness.” Applied Psychological Measurement, 6(1), pp. 47-59.

(9)

[Marsh et al. 1997] Marsh, H.W., Hau, K.T., Chung, C.M. & Siu, T. (1997), “Students’ evaluation of university teaching: Chinese version of the Students’ Evaluations of Educational Quality instrument,” Journal of Educational Psychology, 89(3), pp. 568-572.

[Marsh & Overall 1979] Marsh, H.W. & Overall, J.U. (1979), “Long-term stability of students’ evaluations: A note on Feldman’s consistency and variability among college students in rating their instructors and course,” Research in Higher

Education, 10(2), pp. 139-147.

[Marsh & Overall 1980] Marsh, H.W. & Overall, J.U. (1980), “Validity of students’ evaluations of teaching effectiveness: Cognitive and affective criteria,” Journal of Educational Psychology, 72(4), pp. 468-475.

[McKeachie 1969] McKeachie, W.J. (1969), Teaching Tips – A Guidebook for the Beginning College Teacher (6th

ed.), Lexington, Mass.: Heath & Company.

[McKeachie 1973] McKeachie, W.J. (1973), “Correlates of student ratings.” in A.L. Sockloff (Ed.), Proceedings: The First

Invitational Conference on Faculty Effectiveness as Evaluated by Students. Philadelphia: Temple University, Measurement

and Research Center.

[McKeachie 1979] McKeachie, W.J. (1979), “Student ratings of faculty: A reprise,” Academe, 65(6), pp. 384-397. [Mitzel, H.E. 1960] Mitzel, H.E. (1960), “Teacher effectiveness,” in C.W. Harris (Ed.), Encyclopedia of Educational

Research. New York: Macmillan.

[Murray & Lawrence 1980] Murray, H. G. & Lawrence, C. (1980), “Speech and drama training for lecturers as a means of improving university teaching,” Research in Higher Education, 13(2), pp. 73-90.

[Overall & Marsh 1980] Overall, J.U. & Marsh, H.W. (1980), “Students’ evaluations of instruction: A longitudinal study of their stability,” Journal of Educational Psychology, 72(3), pp. 321-325.

[Sherman & Blackburn 1975] Sherman, B.R. & Blackburn, R.T. (1975), “Personal characteristics and teaching effectiveness of college faculty,” Journal of Educational Psychology, 67(1), pp. 124-131.

[Ware & Williams 1979] Ware, J.E. & Williams, R.G. (1979), “Seeing through the Dr. Fox effect: A response to Frey,”

Instructional Evaluation, 3, pp. 6-10.

[Ware & Williams 1980] Ware, J.E. & Williams, R.G. (1980), “A reanalysis of the Doctor Fox experiments,” Instructional

Evaluation, 4, pp. 15-18.

[Watkins 1994] Watkins, D. (1994), “Student evaluations of university teaching: A cross-cultural perspective,” Research in

Higher Education, 35(2), pp. 251-266.

TEHE Ref.: R51/t7a3 Publication Details:

Ting, K.F. (1998). Measuring teaching quality in Hong Kong’s higher education: Reliability and validity of student ratings. In J. James (Ed.) Quality in Teaching and Learning in Higher Education: A collection of refereed papers from the first conference on Quality in Teaching and Learning in Higher Education (pp. 46-54). Hong Kong polytechnic University, Educational Development Centre.