In association with Pilgrims Limited
*  CONTENTS
--- 
*  EDITORIAL
--- 
*  MAJOR ARTICLES
--- 
*  JOKES
--- 
*  SHORT ARTICLES
--- 
*  CORPORA IDEAS
--- 
*  LESSON OUTLINES
--- 
*  STUDENT VOICES
--- 
*  PUBLICATIONS
--- 
*  AN OLD EXERCISE
--- 
*  COURSE OUTLINE
--- 
*  READERS’ LETTERS
--- 
*  PREVIOUS EDITIONS
--- 
*  BOOK PREVIEW
--- 
--- 
*  Would you like to receive publication updates from HLT? Join our free mailing list
--- 
Pilgrims 2005 Teacher Training Courses - Read More
--- 
 
Humanising Language Teaching
Humanising Language Teaching
Humanising Language Teaching
MAJOR ARTICLES

The Deleted Essentials Test: an effective affective compromise

Secondary and adult
June Eyckmans, Frank Boers, Murielle Demecheleer, Erasmus College of Brussels, University of Antwerp, Universitι Libre de Bruxelles

Menu

Introduction
Discrete-point tests, holistic tests and text-based integrative tests
Transcription as a test
Cloze and C Tests
Rational Cloze Test (test designer chooses which words are deleted)
The Deleted Essentials Test
Conclusion

Introduction

In this article we propose a user-friendly test format that could offer a compromise between (i) proficiency testing and achievement testing, and (ii) between holistic testing and discrete-point testing, while meeting common criteria for test design, such as:

  1. Validity: Does the test really test what it is meant to test? For example, testing a learner's explicit knowledge of grammar rules may be a poor measure of this learner's oral proficiency. Is the task varied enough to give the learner sufficient opportunity to display the skills that are meant to be measured? For example, oral proficiency tests that consist entirely of a learner's prepared discourse may be a poor measure of this learner's 'real-time' communicative skills.
  2. Reliability: Does the test produce consistent data? Would the results be similar if the candidates' performance were rated by different scorers (i.e., so-called inter-rater reliability)? Does the scoring method offer enough guidance for the assessor to be consistent in her or his assessment of linguistic performance over time (i.e., so-called intra-rater reliability)?
  3. Positive affect: Do the learners feel that the test gives them a fair chance to show what they are capable of? Does the test encourage them to try and improve their performance in future tests (i.e., does it create a positive backwash effect)?

The test format we propose could be used as a general proficiency test, i.e., a test that is meant to measure a learner's language ability regardless of the specific content and objectives of the language courses she or he may have taken. It goes without saying that proficiency is the ultimate objective in language learning. In educational practice, however, most tests commonly administered at (Belgian) schools, colleges and universities are so-called achievement tests, which are directly related to the specific content and objectives of the course that the learners are taking or have taken. The goal of such tests (which may be either progress achievement tests or end-of-term achievement tests) is to determine whether learners have achieved specific course objectives. Many teachers adhere to achievement tests for three reasons. Firstly, such tests often correspond to a view of language learning as a linear, step-by-step mastery of distinct segments of the language, a view which gives teachers (and learners) a (perhaps naοve) sense of control over the language learning process. Secondly, achievement testing can be used by teachers as a (short-term) carrot-and-stick method to 'motivate' their pupils or students to actually study the segments of language they are being taught. Thirdly, achievement tests are quite simply an integral part of a longstanding corporate culture and tradition of many schools and imitation is obviously an easier option than innovation.

Discrete-point tests, holistic tests and text-based integrative tests

In practice, achievement testing is often carried out by means of so-called discrete-point tests. These are tests that measure mastery of distinct language elements item by item (e.g., a traditional grammar test targeting the form of a specific tense). Sometimes, they are purely forms-focused, as in the following example:

Turn the infinitives into the past tense

1) It was getting late when we....(to drive) back home last night.
2) Last Christmas, I....(to give) you my heart.
Etc

Discrete-point tests typically tap into explicitly studied knowledge of separate segments of language. Sometimes the learner is even asked to reproduce that explicit knowledge (of grammar 'rules', etc.), resulting in tests on which the majority of native speakers would actually obtain rather disappointing scores.

For example:

Put the verbs in the simple past or the present perfect and motivate your choice

1) She (break) ____________ her leg in an accident last year.
2) So far, this teacher _____________ (not teach) us anything new.
Etc.

Discrete-point tests are usually easy to mark as there is often just one 'correct' response and thus different teachers are likely to agree when scoring them (i.e., the inter-rater reliability tends to be high), although they will certainly not always agree on the choice of test items (i.e., the perceived relevance or importance of the chosen language focus). The problem, however, lies with their lack of validity when it comes to measuring general proficiency. There is serious doubt about the transferability of explicit knowledge measures to learners' ability to actually use the language in real communication, especially, real-time, spontaneous communication (Skehan, 1998). In fact, when we ourselves compare our college students' scores on such discrete-point English grammar tests with their scores obtained in oral proficiency interviews, we very seldom find statistically significant correlations at the p < .001 level.

Even text-based - and in that sense more meaning-focused - exercises like the following example, in which the 'reflective' student embeds the appropriate types of conditional sentences in a stretch of discourse that is not her own, cannot predict with any certainty that this student will also make use of these structures successfully if she engages in (spontaneous) self-expression.

Put the verbs in the appropriate tense.

Economics is not a science of precise laws. Tendencies, maybe. Stock prices _____________ (tend) to reflect rational predictions, unless stock brokers _______________ (start) to panic. Even if economists ____________ (be) able to get all possible information about a given market, they ____________________ (still be) unable to predict future developments with complete certainty. Economic theories are influenced by the historical context in which they are constructed. If Adam Smith ________________ (live) at the beginning of the 20th century, he ________________ (had) different ideas about division of labour. Etc.

The following kind of discrete-point test, however, allows for slightly more self-expression. The contextual clues provide a framework for the appropriate use of discrete grammatical structures, but within those confines there is opportunity for students to produce language in a more creative way:

Complete the blanks (min. 4 words), making use of the contextual cues.

1. _______________________________________ for 2 years now.
2. We made a long shopping list in order __________________________. But even so we did.
3. In spite ______________________________she refused to come.
4. Would you consider ____________________________with me tomorrow ?
Etc.

An advantage of this kind of test lies in the absence of explicit language focus instructions per test item. The learners are not informed explicitly that they should pay special attention to tenses in item one, to word order in item two, etc. In other words, this test is much less overtly directive than most other discrete-point tests. A clever choice and combination of test items of this sort can allow for 'coverage' of many segments of a language's grammar without the learner's being constantly aware of what grammatical structures are being elicited. In fact, scores obtained on this kind of test, which we shall henceforth call 'structure test', do seem to correlate well with scores obtained on general proficiency tests, such as interviews (see below) and therefore seem to live up to the criterion of validity. However, a disadvantage of such 'structure tests' certainly is the difficulty (and time) involved in marking them, as learners will come up with a variety of responses, some of which may be counted as acceptable by one teacher but erroneous by another. In other words, inter-rater reliability may be at risk.

Tests that go straight to the heart of general proficiency tend to consist of 'holistic' tasks. These are tasks in which mastery of various language elements is required and different skills are intertwined. This 'holistic' approach to language testing is based on the notion that language proficiency constitutes much more than the sum of grammatical and lexical knowledge. The way in which different linguistic skills are integrated in authentic language use makes up the essence of language proficiency. This approach is a reaction to Carthesian-based views or methodologies where separate elements of a phenomenon are investigated without paying attention to how these elements interlock/interrelate. Typical examples of 'holistic' language tasks are essays (for 'self-monitored' proficiency in writing) and interviews (for real-time oral proficiency). While such holistic tests are obviously more valid than discrete-point tests to estimate a learner's general proficiency, giving a score to the learner's performance on a holistic task is a much more complicated issue. As a result, one and the same essay may be awarded different marks by different assessors and an interviewee may be given different scores by different interviewers. A lot depends on the priorities set by the individual assessor. Not only is holistic testing a very time-consuming business, second opinions ('blind judges') may be required in order to guarantee reliable scores. Interview scores may be especially problematic if they are awarded by just one interviewer. After all, even teachers have a limited processing capacity, and so even they must find it hard to focus simultaneously on meaning and form while on top of that also contributing to the face-to-face interaction. Given the amount of work required to safeguard reliability, it is not so surprising that holistic testing has had comparatively little popular appeal among groups of overworked teachers and examiners (or at least the self-questioning ones).

In a quest for proficiency tests that combine the characteristics of validity, reliability, and teacher-friendliness (i.e., easy to develop, administer and mark), various formats have been proposed over the years. These kinds of proficiency tests are based on a sample of authentic language in accordance with the view that the performance of the language learner in the test should be as representative as possible of his or her ability to use the target language in non-test situations. They allow for a more consistent scoring method (and thus for high inter-rater reliability), because they use the authentic text that is used as a basis is also the key for marking We shall call this type of tests 'text-based integrative tests'.

Transcription as a test

One example is the transcription, where students listen to an authentic audio-recording once for the gist, and are then asked to transcribe the text while the recording is played again but now with pauses between semantic segments. While such a transcription task has been shown to be a reliable measure of proficiency, the need for technical equipment (both to prepare the segmented recording and to administer the actual test) may undermine its popular appeal to teachers. In our experience with this test format, students also tend to perceive transcriptions as particularly merciless.

Cloze and C tests

Most text-based integrative tests start off from a printed text. In an effort to minimise the impact of the test maker's (or the teacher's) individual choices with regard to linguistic elements she might find especially relevant or important, but which a colleague might not give the same priority to, the so-called Cloze Test (for further reading see Alderson, 1983; Bachman, 1985) and the C-test (Klein-Braley, 1985; Raatz & Klein-Braley, 1985) were introduced. Here is an example of each:

Cloze test (here every fifth word is deleted systematically)

BODY LANGUAGE
People convey messages not only by using words, but also through what is known as body language. Communication through body language ………. been going on for ………. of years, but has ………. been studied during the ………. 30 years or so. ………. more human communication takes ………. by the use of ………. and postures than most ………. us realize. For instance, body ………. can offer clues to ………. you tell when someone ………. lying. If a young ………. tells a lie, he ………. covers his mouth with ………. hands in an attempt ………. stop the deceitful words ………. coming out. If he ………. not wish to listen ………. a reprimanding parent, he ………. simply cover his ears ………. his hands. Similarly, when ………. sees something he doesn't ………. to look at, he ………. his eyes with his ………. or arms. As a ………. becomes older, the hand-to-face ………. become more refined and ………. obvious, but they still exist.

C- test (here the second half of every second word is deleted systematically)

BODY LANGUAGE
People convey messages not only by using words, but also through what is known as body language. Communication through body language has been going on for thousands of years, but has only be…… studied du…… the pa…… 30 ye…… or s……. Yet mo….. human commun ….. takes pl ….. by t ….. use o ….. gestures a ….. postures th ….. most o ….. us rea …… . For inst….., body lang ….. can of ….. clues t ….. help y ….. tell wh ….. someone i ….. lying. I ….. a yo ….. child te ….. a l ….., he frequ ….. covers h ….. mouth wi ….. his ha ….. in a ….. effort t ….. stop t ….. deceitful wo ….. from com ….. out. I ….. he do ….. not wi ….. to lis ….. to a reprim ….. parent, h ….. may sim ….. cover h ….. ears wi ….. his ha ….. . Similarly, wh ….. he se ….. something h ….. doesn't wa ….. to lo ….. at, h ….. covers h ….. eyes wi ….. his ha ….. or ar …… As a person becomes older the hand-to- face gestures become more refined and less obvious, but they still exist.

Although these tests have been shown to be both valid and reliable measures of proficiency, to our knowledge, neither the Cloze Test nor the C-test have ever made it into mainstream foreign language testing in Belgium. One reason for this may have been the common lack of information sharing between FLT researchers on the one hand and FLT practitioners on the other. Another reason, we suspect on the basis of people's reactions when we host workshops about language testing, is quite simply that teachers do not much like the idea of not having any say in the choice of test items. They may worry that potentially 'interesting' language in a given text might not be 'covered' because of the 'blind' deletion every nth word. They may also worry that they cannot really prepare their students or pupils for such an integrative test by asking them to study the particular segments of the language in a linear, step-by-step syllabus. Perhaps hardworking pupils may feel disappointed if they fail to get the rewards they think they deserve for studying their course notes, and so negative affect may set in. In other words, teachers may be anxious about jeopardising the carrot-and-stick effect of testing.

The teacher's autonomy in selecting test items in authentic texts has been restored in the so-called Rational Cloze test. Here it is the teacher who decides which words are deleted for their students to be reproduced. For example:

Rational Cloze test (the test designer chooses which words are deleted)

BODY LANGUAGE
Communication through body language has been going on for thousands of years, but has only ………. Studied during the past 30 years or so. Yet more human communication takes place by the use of gestures and postures ………. Most of us realize. For instance, body language can offer clues to help you tell ………. Someone is lying. If a young child tells a lie, he frequently ………. His mouth with his hands in an attempt to stop the deceitful words ………. Coming out. If he ………. Not wish to listen to a reprimanding parent, he may simply cover his ears with his hands. Similarly, when he sees something he doesn't want to look ………., he covers his eyes with his hands or arms. ………. A person becomes older, the hand-to-face gestures become more refined and ………. Obvious, but they still exist.

While the Rational Cloze test may satisfy teachers' desires to prioritise certain language elements over others, it risks sacrificing some validity as a proficiency test. After all, some teachers may target items they conceive to be of particular grammatical importance (e.g., elements recently 'covered' in the syllabus). Others may target vocabulary belonging to a particular lexical field (e.g., about a theme recently explored in class). And so on. As such, the Rational Cloze test may become more suitable as an achievement test than as a proficiency test. The more idiosyncratic the taste of the teacher, the less valid the test risks becoming as a measure of general proficiency. In fact, the Rational Cloze test carries an invitation to return to discrete-point achievement testing.

Summing up so far, we have briefly discussed three types of tests. Discrete-point tests tend to display high inter-rater reliability (as the scoring tends to be simple and unambiguous), but they are typically achievement tests that tend to lack validity as proficiency measures. Holistic tests (e.g., interviews) are valid as proficiency tests, but inter-rater reliability may be poor. In addition, they tend to be hard work and time-consuming. Text-based integrative tests (e.g. Cloze) appear both reliable and valid as proficiency measures, but they tend to provoke negative affect. We shall now propose a text-based integrative test that is meant to combine the best of three worlds.

The Deleted Essentials Test

The Deleted Essentials test is based on an exercise proposed by Weir (1990) as a reading comprehension task, but which can easily be adapted for purposes of proficiency testing. It is a text-based integrative test that has some properties of the Rational Cloze test, but in which the teacher's choice of target items is inherently limited and in which the learner's task fulfilment requires more than merely filling in words in indicated spaces. Here is an example:

Deleted-Essentials test

Read the following text. In every numbered line ONE word is missing. Indicate by means of a slash (/) where a word is missing and then write down the missing word in the margin.

BODY LANGUAGE
People convey messages not only by using words but also through what is known as body language.
Communication through body language has going 1.
On thousands of years, but has only been studied 2.
during the past 30 years so. Yet more human 3.
communication takes by the use of gestures 4.
and postures than most of us realize. For, body 5.
language can clues to help you tell when someone is lying. 6.
If a young child a lie, he frequently covers his mouth 7.
with his hands in an to stop the deceitful words from 8.
coming out. If he does wish to listen to a reprimanding 9.
parent, he may simply cover his with his hands. 10.
When he sees something he doesn't want to look, 11.
he his eyes with his hands or arms. 12.

As the number of words that are both essential and predictable is quite limited in each line of the text, any teacher's potential bias in their choice of test items will by definition be dampened. Correct responses are typically limited to one or very few possibilities, which warrants comfortable inter-rated reliability. Not giving away the location where a word is missing but instead incorporating indicating this location as part of the task has a number of advantages, too. Firstly, the test taker is obliged to take a more active role to fulfil the task as she needs to recognise 'oddities' in the text (and by doing so, showing familiarity with the target language). Secondly, text comprehension and familiarity with rhetorical structure can be tested, for example by deleting a discourse connector (e.g., line five) or a negation marker (e.g., line nine), which would be too transparent in Cloze tests or C-tests. Thirdly, students can already be given partial credit for recognising where the word in missing in a line, and full credit if they also manage to come up with the appropriate lexis. This reduces the risk of negative affect, since the test appears less rigid and merciless than the more black-and-white types of text-based integrative texts. It may also contribute to the discriminating power of the test in rather homogeneous student populations. Levels of difficulty can be determined by the choice of text (genre) and by manipulating the length of the lines from which one word is deleted.

The use of the Deleted Essentials Test is certainly not confined to English. In fact, its use for languages that display a higher degree of inflection than English may provide a third level of task fulfilment and hence a third variable for marking: not only may students be rewarded for locating the place where something is missing and providing the missing word, they may also be given credit for providing the correct word ending (as in line 12 of our English example, where the -s ending is required in the missing word 'covers'). This third level of task fulfilment and marking may accommodate teachers with a more forms-focused view of language and it may give a sense of gratification to students who have taken the trouble of upgrading the accuracy of their language production. The following is an example for French, in which mainly conjugation and the distinction between feminine and masculine words make up this third level of task fulfilment.

Deleted Essentails Test

L'histoire de la biθre
L'histoire de la biθre semble remonter ΰ la nuit des temps. Les signes les plus
anciens de existence datent de prθs de 8000 ans avant J.C., dans un 1)...
village de Sumer (entre le Tigre et l'Euphrate). On y a des preuves 2)...
de la culture de l'orge et froment, des traces de fours et des meules, 3)...
bref de tout ce qui ιtait nιcessaire ΰ la d'un "vin de grains". La stθle 4)...
du cιlθbre "code d'Hammourabi" dιcrit la biθre seulement comme 5)...
une boisson, mais ιgalement comme / mιdicament et un moyen de 6)...
paiement en. La plus ancienne recette, gravιe en hiιroglyphes sur la 7)...
pierre, est prιcieusement au Metropolitan Museum de New york. 8)...

Evidence of the validity of the Deleted Essentials Test as a measure of proficiency can be derived from retrospective analyses of end-of-term exam scores of student populations at the Universitι Libre de Bruxelles, where the test has been administered on a regular basis since the early 1990s. We looked at the exam results of students of modern languages at the end of their first year at university. These were students aged 19-20 who had just completed a sixty-hour English course aiming at general proficiency, but also containing a fair amount of task-based revision of grammar. We considered the exam data of four consecutive years, with student numbers ranging from 58 to 69. Each year the exam consisted of four parts: (i) a structure test (see above) targeting the grammar structures revised in the course, (ii) a transcription using a recording of a radio news item whose contents were related to one of the themes discussed in the course, (iii) a Deleted Essentials Test based on a text whose contents were related to one of the themes discussed in the course, and (iv) an interview using the topic of a text the students had read as a starting point for the interaction. New materials and new tests versions were used each year. The structure test, the transcription and the Deleted Essentials Test were each marked by one teacher. For reliability's sake, the interviewers were conducted by two teachers, who afterwards 'negotiated' the marks.

The table below sums up the coefficients we calculated by means of the Spearman rank correlation formula between the Deleted Essential Tests scores and the other parts of the four exam sessions. All correlations were (fortunately) significant at p < .001.

Spearman rank correlations with Deleted Essentials Tests

Exam session Structure test Transcription Interview
Year one (N 69) .670 .721 .656
Year two (N 59) .778 .791 .551
Year three (N 64) .774 .722 .629
Year four (N 62) .724 .826 .606

Overall, the Deleted Essentials Test scores turned out to correlate virtually equally well with the interview scores as the transcription scores did (averages around .620. and .630, respective). Since past research pointed up the transcription as a valid measure of proficiency, we may assume that the Deleted Essentials Test is also a viable candidate, but one that is easier to develop and administer. It must be acknowledged, though, that the scores on the structure test (which leans towards discrete-point testing) also happened to correlate well with the interview scores in our data (around .610 on average). This correlation, however, may have been enhanced somewhat by the fact that the interviewing teachers knew that one of the objectives of the course had been to revise grammar, and so they may have paid special attention to the students' use of the studied grammatical structures. In addition, we mentioned earlier that the structure test is a rather a-typical discrete-point test as it is less overtly directive and it invites the students to 'produce' a fair amount of language themselves.

Conclusion

We have proposed and advocated a text-based integrative test for measuring proficiency, the Deleted Essentials Test, which seems to live up to three crucial criteria of test design, i.e., validity (as a measure of proficiency), inter-rater reliability, and avoiding negative affect (by incorporating sufficient flexibility in the choice of test items and marking schemes).

Concurrent validity was established by calculating correlation coefficients between Deleted Essential Test scores and scores on other integrative and holistic tests. Inter-rater reliability seems safeguarded by the fact that per line only an essential and predictable word is targeted, which limits the choice of test items and also severely limits the number of words that are appropriate in the given slot (usually there is only one 'correct' response).

At the same time, the test format allows for sufficient flexibility in the choice of test items (the choice can be widened by manipulating the length of the lines, for example) to accommodate individual teacher's wishes to establish a given language focus (e.g. especially target prepositions) and thus add a dimension of achievement testing. Finally, the Deleted Essentials Test allows for a fairly flexible marking method in which students can be given credit at three levels: locating the missing word (showing familiarity with the target language), providing the missing word (e.g., showing knowledge of collocations, syntax and text comprehension), and providing the correct form of the word in the given context (showing knowledge of formal grammar and spelling conventions). It would obviously be up to teachers to decide on a marking scheme that is most appropriate for their particular students and for assessing whether those students have reached the objectives of their particular language course.

It goes without saying that students should be familiar with a novel test format before it is administered to them as part of an exam (Henning 1987). Fortunately, we have found that making Deleted Essentials Tests is fairly easy, and not as time-consuming as putting together a structure test, for example. However, experience also tells us that it is important to try a newly developed version first on colleagues who have not read the original text, in order to make sure the targeted words are really essential and predictable and in order to use the colleagues' 'correct' responses as a key in (the rather rare) case an item might invite more than one appropriate response.

1 For those interested in the workshops on language testing and language teaching methodology of the L.E.E.R.- guild (Language-Education-Experience-Research), please contact june.eyckmans@pandora.be

References

Alderson, J.C. (1983) The cloze procedure and proficiency in English as a foreign language. In Oller, J.W. Jr. (Ed.), Issues in Language Testing Research. Rowley, Mass.: Newbury House, pp. 205-217.
Bachman, L. (1985) Performance on cloze tests with fixed-ratio and rational deletions. Tesol Quarterly, 19, 535-556.
Henning, G. (1987) A Guide to Language Testing: Development, Evaluation, Research. Cambridge Massachusetts: Newbury House.
Klein-Braley, C. (1985) A cloze-up on the C-test: a study in the construct validation of authentic tests. Language Testing, 2, 76-104.
Raatz, U. & Klein-Braley, C. (1985) How to develop a C-test. In Klein-Braley, C. & Raatz, U. (Eds.) Fremdsprache und Hochschule, 13/14: Thematischer Teil: C-tests in der Praxis. Bochum: AKS, pp. 20-22
Skehan, P. (1998) A Cognitive Approach to Language Learning. Oxford/New York: OUP.
Weir, C. (1990). Communicative language testing. Hemel Hempstead: Prentice Hall International.

Back Back to the top

 
    © HLT Magazine and Pilgrims