2.10 Vocabulary testing
2.10.4 Multiple-choice tests
This type of test is one of the most frequently used assessment formats in all educational testing. The test format consists of a few options, usually four, and respondents must choose one option that can fill in a gap in a sentence, answer a question, define a word, …etc.
There are several advantages of using this type of testing. The time required to answer and mark this type of test is relatively shorter than that needed for other types of testing, except yes/no tests. Especially with technological advancement and the development of machines that can scan and process data, marking the tests can be automated and done in a short time. Additionally, the results can be instantly available if the test was computerized. Lastly, assessment and marking of this type of testing is objective and marking the test even by hand only usually involves checking the
answers against key answers, which could enable any person to mark the test. Two examples of this type of vocabulary test are presented in the next section.
2.10.4.1 Vocabulary Levels Test (VLT)
The Vocabulary Levels Test (Schmitt, Schmitt & Clapham, 2001) is perhaps the most widely known and used vocabulary size test (Read, 2007; Webb & Sasao, 2013). The Vocabulary Levels Test (VLT) was first designed by Nation (1983, 1990) and has gone through several improvements (Beglar & Hunt, 1999; Schmitt et al., 2001). Although the test was initially designed to help teachers in the development of suitable vocabulary learning materials, as soon as it got published, it became widely used internationally in English speaking countries to test the vocabulary size of international students and migrants at a range of vocabulary frequency levels (Xing & Fulcher, 2007).
The VLT is designed to produce the vocabulary profile of learners at five frequency levels (i.e., 2,000-word level, 3,000-word level, 5,000-word level, Academic word Level and 10,000-word level). The word frequency levels are based on Thorndike and Lorge’s word list (1944), the General Service List (West, 1953) and Francis and Kucera’s word list (1967).
The test is a type of multiple-choice test and requires respondents to choose the meaning of a written English word from a list of possible answers also in English. In its latest version (Schmitt, et al., 2001) the VLT contains 30 items at each frequency level and comes in three versions (B, C and D). Unlike traditional multiple-choice tests, the VLT minimizes guessing by introducing the items in clusters of six words and three definitions. In addition, the words in each cluster are always put in an alphabetical order, as shown in the example below:
1. business
2. clock _____ part of a house.
3. horse _____ animal with four legs.
4. pencil _____ something used for writing.
5. shoe 6. wall
The VLT has been shown to be a valid and reliable test (Beglar & Hunt, 1999; Schmitt et al., 2001). However, the VLT might not be an adequate vocabulary size test to be used in this study due to the following limitations. First, the frequency lists that were used for creating the VLT are very old and there might be “variation between the occurrence of words 50-70 years ago and today” (Webb & Sasao, 2013, p. 265). For example, the 2,000-word frequency level was created using the General Service List (West, 1953), whereas the 3,000, 5,000, and 10,000-word frequency levels were created using Thorndike and Lorge (1944) and Kuĉera and Francis (1967). Second, each question in the VLT consists of three test items and six possible choices. Thus, it is possible that “the learner’s knowledge of some of the items is likely to have an impact on the ability to work out the answers to other items where these are not known” (Milton, 2009, p. 75). Lastly, the VLT “is not really designed to provide an estimate of a person’s overall vocabulary size [...] the test is better used to supply a profile of learner’s vocabulary, which is particularly useful for placement and diagnostic purposes” (Schmitt, 2010, p. 198). Furthermore, if the test is used to give the total vocabulary size, it produces an estimation of vocabulary size based only on four-word frequency levels (i.e., 2nd 1,000 frequency level, 3rd 1,000 frequency level, 5th 1,000 frequency level, Academic word Level and 10th 1,000 frequency level). This means that the test uses the knowledge of the 2nd 1,000-word frequency level to estimate vocabulary knowledge of the 1st 1,000-word frequency level and the 5th 1,000-word frequency level to estimate vocabulary
knowledge of the 4th 1,000-word frequency level. In Al Fotais (2012), the vocabulary size of Saudi EFL learners was 1,447 words out of the 1,000 to 3,000-word frequency levels and around 48% of the known words came from the 1,000-word frequency level. Accordingly, the participants in this study (i.e., Saudi first year English major students) are expected to have a similar vocabulary size. Therefore, the VLT might not be suitable to measure the vocabulary size of the participants in this study as it does not effectively examine vocabulary knowledge at the 1,000-word frequency level.
2.10.4.2 Vocabulary Size Test (VST)
The Vocabulary Size Test (Nation & Beglar, 2007) is a written four-option multiple-choice receptive vocabulary size test. It consists of 140 items measuring vocabulary knowledge at fourteen 1,000 spoken word family bands in the British National Corpus (BNC) (Nation, 2006), from the first 1,000 to the 14th 1,000-word frequency level. At each frequency level, ten items were randomly selected, so each item represents 100 word-families within the same word frequency level. The following is an example from the first 1,000-word family frequency level: period: It was a difficult period.
a. question b. time c. thing to do d. book
The VST is claimed to “provide a reliable, accurate, and comprehensive measure of a learner’s vocabulary size from the 1st 1,000 to the 14th 1,000-word families of English” (Nation & Beglar, 2007, p. 9). However, there are a few possible shortcomings that should be noted. In a recent case study, Gyllstad, Vilkaite and Schmitt (2015) examined the effect of guessing and sampling rate on data from the VST. The study compared test-takers’ performance on the three sections of the VST
(i.e., 3K, 6K, and 9K sections) with their performance on follow-up interviews where participants were asked to look at a list of words from each section of the VST without the aid of multiple- choices and describe the meaning of those words (e.g., L1 translation equivalent, L1 or L2 definition, L2 synonym). The results suggested that there was a significant difference between the participants’ scores in the VST and the interview on the 3K and 9K sections: there was a tendency for the VST to overestimate the participants vocabulary size at these frequency bands.
The researchers provided two possible explanations for the difference between test-takers’ scores on the VST and the follow-up interviews, where scores were lower. First, the VST requires test- takers to demonstrate knowledge at a less demanding level (i.e., receptive meaning recognition), whereas the oral interviews elicited knowledge at a more demanding level (i.e., productive meaning recall). This explanation is congruent with Laufer and Goldstein's (2004) finding that meaning recall tasks are more difficult than meaning recognition tasks and it does not represent a deficiency in the VST, just that it was not testing the same level of depth of knowledge as the interview did.
The second explanation for the discrepancy in scores between the VST and the interviews could be a result of overestimation due to guessing, however. The researchers therefore highlighted the issue of a clear overestimation tendency in multiple-choice tests and called for a careful consideration when choosing multiple-choice tests, such as the VST, as a vocabulary measurement instrument for pedagogical and research purposes.
In addition, in respect to the sampling rate, the VST consists of ten target items for each word frequency level with a ratio of one target item to 100-word families at the corresponding word frequency level. Therefore, each target item’s “characteristics (e.g., cognate, or false friend or not) and each test item’s efficacy (strong or weak item) has a disproportionate effect on the overall vocabulary size estimate” (Gyllstad, et al., 2015, p. 280). The results suggested that the VST
sampling rate of one target item to 100-word families is roughly sufficient. However, the researchers suggested that due to the possibility of overestimation, the VST might not be suitable in situations when accurate vocabulary size estimate is required (e.g., estimating graded reader levels). It should be noted that Gyllstad, et al (2015) did not arrive at a conclusive answer to the difference between scores on the VST and oral interviews. In my view, any or both of the above- mentioned explanations could be possible and the effect of guessing on vocabulary size multiple- choice tests should be further examined.
Despite the possible limitations of the VST, it is the most suitable vocabulary size test for the participants in the current study for several reasons. Most importantly, the VST provides an estimation of vocabulary size knowledge at each word frequency level from the 1st 1,000-word frequency level to the 14th 1,000-word frequency level. The range of word frequency levels that the VST covers is important in this study for two main reasons. First, the participants in this study are expected to have a low vocabulary level that mostly consists of words form the highest word frequency levels (i.e., 1,000-word frequency level and 2,000-word frequency level). Secondly, the target items in the study were selected from a lower word frequency level to minimize the chances of these items being known by the participants (i.e., 5,000-word frequency level). Therefore, examining the participants’ vocabulary size at the same word frequency level as that of the target items, as well as at lower word frequency levels, would further confirm the expected vocabulary size of the participants and support the decision to choose the items from a low word frequency level. Second, in terms of classroom time constraints, the test format makes it practical in the target context as it would take a relatively short time to administer.