What we Think of Tests: A Critical Evaluation
Ian Michael Robinson, Italy
The author is a researcher at the University of Calabria in Italy and has been involved in testing and test writing for many years in Italy and Japan. He enjoys life in Italy with his wife, children and cats. E-mail: ian.robinson@unical.it
Menu
Introduction
Brief literature review
Methods and participants
Results and comments
Conclusions
References
In a time when testing is coming increasingly under scrutiny (for example, Paran and Sercu, 2010; Küçük and Walters, 2009) it becomes ever more important to get a clear idea of how testing affects teachers and what attitudes and beliefs they have on the subject with an aim of trying to see if they are critically aware of what they are doing.
The literature on testing refers to aspects that we must bear in mind when preparing methods of evaluation in the EFL field, such as validity, reliability, authenticity, criteria to be used, as well as a test being user and marker friendly (Alderson, et al., 1995; Bachman,1990; Bachman and Palmer, 2010; Douglas, 2010; Fulcher and Davidson, 2007; Heaton, 1988; Hughes, 1989; McNamrarra, 2000; Weir, 1993; Weir, 2004). This work investigated three different groups of people involved in EFL in the south of Italy: a group of secondary school teachers, a group of aspiring secondary school teachers and a group working in an Italian university who deal with testing during their work. A questionnaire was used to try to elicit the attitudes and beliefs of people in these groups concerning this problem of evaluation. This was done by having the people involved answer questions concerning the types of test they use, their reasons for using them and their assessment of tests they are familiar with. They were also asked to critically reflect on what they do as test practitioners.
This article will show the results of this research and discuss some of the conclusions that can be drawn from it.
The world of English as an L2 teaching is becoming more complex with the goals becoming more wide ranging when compared to what might have been the general goal of language teaching in the past (just to teach the language). Indeed, Thomlinson and Mashura (2013) came up with a list of fifteen aims by which to judge courses:
- provide extensive exposure to English in use;
- engage the learners affectively;
- engage the learners cognitively;
- provide an achievable challenge;
- help learners to personalize their learning;
- help learners to make discoveries about how English is typically used;
- provide opportunities to use the target language for communication;
- help the learners to develop cultural awareness;
- help learners to make use of the English environment outside the classroom;
- cater for the needs of all the learners;
- provide the flexibility needed for effective localization;
- help the learners to continue to learn English after the course;
- help learners to use ELF;
- help learners to become effective communicators in English;
- achieve its stated objectives.
Presumably these goals should be reflected in the type of testing that occurs in this field. However, the field of language testing is not an easy one either.
Many language teachers harbour a deep mistrust of tests and testers … this mistrust is frequently well founded
(Hughes, 2003, 1)
This ... is not about tests, but rather about their uses, effects and consequences. It is about how tests cannot be viewed as isolated and neutral events but rather as embedded in educational, social, political and economic contests. Tests, therefore, need to be interpreted and understood within this complex context.
(Shohamy 2001, xvii)
These two different quotes from two different leaders in the field of L2 testing begin to demonstrate some of the problems involved in language testing. These difficulties are further illustrated by other authors. Luona (2004) starts her book by looking at four different scenarios concerning the testing of speaking. From this it is evidently clear that there cannot be just one way to assess a language skill, but that various modes could be used dependent on the aspect that is to be tested and the specific context (as the quote from Shohamy above clearly notes). What Luona has in common with the other authors of the same series of assessment books (Anderson 2000, Cushing Weigle 2002, Purpura 2004, Read 2000) is that they all place modern L2 testing into specific theories of assessment and these all refer back, at some point, to Bachman and Palmer (1996) and the idea of language ability, rather than a grammar based approach and a step on from the general communicative competence approach that replaced it.
The starting point of these various scholars is of interest. Cushing Weigle starts from Hughes 1989 (p75) quote that ‘the best way to test people’s writing ability is to get them to write’ but then goes on to question what writing and writing for what. Cushing Weigle also states (p 244) ‘it is absolutely essential, particularly in high-stake situations, that any assessment methods be carefully designed with due attention to the aspects of test usefulness outlined in Bachman and Palmer and in this book’. In a book about Assessing Vocabulary, Read (2000) notes a dichotomy between people especially interested in specific vocabulary testing and those in the more general language testing field with the former looking more to discreet independent word unit testing and the latter adopting a communicative approach aligned with that of Bachman and Palmer. Purpura (2004, p4) concentrates on grammar assessment and notes ‘there is a glaring lack of information available on how the assessment of grammatical knowledge might be carried out’ and that we should ‘allow language educators to select the type of assessment that best match their assessment goals’ (Purpura 2004, p255). For reading, Alderson seems to have the opposite problem, in that he states it is impossible to read and assimilate all the research into reading or even have a single perfect synthesis of all of this to define what reading actually is and so it becomes impossible to clearly write the construct of a test. He does admit though that the ‘consolation, however, is that by designing admittedly imperfect tests, we are then enabled to study the nature of the tests and the abilities that appear to be being measured by those tests’ (Alderson 2000, p2).
Paran (2010, p11), in agreement with Luona about the many different contexts, notes that ‘the specificity of each context and the specificity of the constructs involved means that large-scale validity of the instruments used is difficult’.
This picture of the state of the art of testing allows us to see that testing is a far from perfect science and that often our tests are the best we can do at the moment and at other times what might make sense in one context would be inappropriate in another. The literature tries to guide the practitioners in the use of tests but does not give a one size fits all solution and so many and various tests are available.
As well as this we have Shohamy (2013, 18-19) who states that
“as language testers seek to develop and design methods and procedures for assessment (the ‘how’) they also become mindful not only of the emerging insights regarding the trait (the ‘what’), and its multiple facets and dimensions, but also of the societal role that language tests play, the power they hold, and their central functions in education, politics and society”.
Thus, pulling together many of the strands that make L2 language testing such a difficult field to work within. How should classroom practitioners deal with this?
To evaluate what people think of tests a questionnaire was developed.
- What is the best exam you have used, why?
Many exams are indeed very good, this question tries to elicit what is perceived as making a good test.
- What is the worst exam you have used, why?
By examining the worst exam, it is hoped that the respondent will have to begin to critically evaluate exams used.
- What is the strangest exam you have used, why?
Another way to examine the non-good exams is to attempt to discover what is perceived as ‘normal’ in examining and what as ‘strange’.
- What is the use of giving tests and exams?
The literature available informs us of the uses of tests, do the users of tests have the same perception?
- What would your ideal test involve?
Having discussed negative aspects, the questionnaire returns to more positive points and encourages respondents to form an ideal test.
- Do you create your own tests? If yes, how and why.
- Do you ever read books on testing? If yes, which has been of most use?
These two questions combine to investigate whether the growing resource of testing books is being exploited by EFL practitioners who are involved in some form as test writers.
- What do you look for in a good test?
The answers to the ideal test in question 5 should probably correspond to what people are looking for in their tests.
- What alternative type of evaluation have you used / would you use?
Is traditional testing and exams the only assessment form being used in Italian contexts?
The questionnaire was administered to three different groups of people: colleagues at the University who are involved in teaching EFL and testing it, school teachers involved in a language course at the university and students on a teacher training course at the university. The questionnaires for the first two groups were sent via email as direct contact with these groups was difficult. The last group had the questionnaire given to them as part of a bigger class discussion on the topics involved (it was given to them and they completed the questionnaire and then we discussed the issues, but this discussion was after they had completed the items and so did not influence their answers).
Maybe unsurprisingly all 9 of the students involved in the teacher training that day completed and handed in the questionnaire. For the questionnaires distributed by email the return rate was much lower: 10 out of 37 for the university colleagues and 3 out of 13 for the school teachers.
I had hoped to be able to compare the results of the three groups but the number of replies made this statistically unrealistic.
- What is the best exam you have used, why?
Five people replied that the best test was an ‘oral test’, four named the Cambridge suite of ESOL exams, two people said that they did not know.
Individual answers included: Open questions – ‘without the stress of an oral test’; the one in which I felt very relaxed and self-confident; What do you mean by “best test”? I think that a best test would be one that is fair and tests the students’ ability in what you have taught; ‘they were clear on the goals of the exam and so the question did really test the candidates’ skills’
- What is the worst exam you have used, why?
Four people replied that the worst type of test was one with multiple choice items, two said it involved tests that had nothing to do with the subject. For example, one of them wrote
[it] was not based on any books: the exam was written by others, and not knowing what went in the course, it was not at all surprising that the exam content had little relevance to the course content and, worse, was not structured so to give learners the opportunity to fully express their new knowledge.
Another two stated that the worst exams were those that were not clear in the language used or not clear in what the candidate had to do. Other single replies include a very boring oral exam, exams that have not been proofread properly (‘if at all’), not user-friendly exams, a piano exam where the candidate became nervous having to perform and knowing that he/she would be judged; exams with time problems; an exam in which the candidate was tired; and, an exam in which the respondent says
I had the role of substitute teacher, so I didn't know the students who were sitting in front of me. We didn't have the time to find a common ground in order to mediate what our role is. This situation deeply influenced that experience
Three people replied that they did not have a ‘worst’ exam.
- What is the strangest exam you have used, why?
Four people could not think of a ‘strangest’ exam
Other replies included the following:
- Exams that have been written badly by people whose English is a bit iffy-
- IELTS / Shenker British Institute orals
- the entry test to TFA (Italian Teacher Training Courses), because many questions weren’t formulated in a right form, so they have been deleted afterwards (2 - Where numbers are shown in brackets this refers to the number of respondents who replied in the same, or similar, fashion.)
- The strangest exam I have used was one of the SSIS (Teacher Training Course) exams, because I didn't understand what the teacher wanted to teach us and the lessons were without a logical sense.
- true, false and not given!
- Playing Enya’s music while doing a test on the Celts, or being asked about totally other subjects to, maybe, put you at ease.
- What is the use of giving tests and exams?
Eighteen replies included the word assessment. Other replies are as follows:
- Getting information about your SS at the beginning of a course; testing the progress the students make, and consequently questioning the efficacy of your teaching strategies; plus, obviously, the fact that every now and then we have to carry out formal assessment, and if we don’t want to be unfair we need something objective to rely on.
- To give the students the opportunity to demonstrate what they have learned in that particular subject.
- Also evaluate the effectiveness of the teacher, materials used and the test itself.
- Some tests aim to a standardization of CEFR levels, in order to simplify interpretation of curricula.
- To see if a person has reached certain skills. But anxiety and many other factors are to be kept in consideration, because sometimes it can happen that the best students aren't able to show their skills during examination. The types of tests do not always correspond to the students "intelligence”
- At school, tests are used to frighten students and force them to study J.
- I don't think there is any use in giving one big exam at the end of the semester, and evaluation should be a continuous process during the school year
- 5) What would your ideal test involve?
Eight people felt that the ideal test should test all of the four (or five) main skills, plus structures and lexis according to the level, five said it should include different activities, four stated it should allow learners to show what they know; three wanted a test that can assess learners’ language skills objectively; two thought it must be for different types of people; two wrote that it must be clear; two wanted it to allow students to express their creativity (one of these added that this was a utopia). Other single replies included:
- topics should arouse the students’ interest;
- real life tasks;
- fair;
- involves all the senses, during which the students can stand up, hear something, smell something else or watch something;
- a test where time is not so frenzied;
- an internal normalizing control mechanism that overrides variations in test difficulty and question inappropriacies;
- students' mind and heart.
Put all together these would form a general construct of what a good test should be, although using smell as part of a test could be problematic. Obviously, it is an idealised construct in which the candidates are treated and respected as individuals, even so far as creating a test that takes into consideration the students’ ‘hearts and minds’!
- Do you create your own tests?
- Do you ever read books on testing?
Yes - 9
- No - 12
The answers to questions 6 and 7 show that very often teachers and language instructors are involved in test development, but that these people are not always involved in reading the literature of testing. This might raise the question of how critically involved these people are in attempting to create better tests and that there could be an acceptance of things as the way they are.
- What do you look for in a good test?
Clear (7);
- reliability (6);
- objectivity (6);
- coherence, (3);
- fairness (3), and no overlapping answers;
- stimulating texts/ activities (3);
- tests what it is supposed to (3);
- valid (2);
- variety of task types (2);
- comprehensible from linguistic point of view and even on the content. A test must be done following - precise criteria (2);
- NO surprises;
- flexibility;
- adaptability;
- straightforward and doesn’t use question types students are not familiar with;
- not a final goal, but just one of the steps of a lifelong learning process;
- the best simple way is to ask the question without pitfalls. A written test isn't a training camp of Marines in which one has to watch out for possible traps;
- test various items and assess students' knowledge;
- doesn’t test background knowledge;
- easy to mark to guarantee reliability;
- questions formulated in the correct form;
- a good test has to test all the skills of students and test their global performances;
- create a scale of evaluation; clarity in marking and assessment criteria
- it has to test what has been taught;
- originality.
Similar to the idealised test question (Q5) this question elicited replies that generally reflect well what is written in many texts on testing, suggesting that these respondents are aware of what good testing is concerned with.
- What alternative type of evaluation have you used / would you use?
- group work and presentations (2)
- Self-evaluate (2)
- Mainly oral interaction, plus observation of the SS’s involvement in the different activities or tasks performed; this is not enough, though, because it can only give an overall impression of the SS’s results.
- Work done, compositions, journals, research projects, films.
- Some type of test that acknowledges that language is not just words.
- I would like to evaluate coursework and maybe projects carried out in class.
- At school, I would substitute “interrogazioni” (oral exams) with presentations. The word itself (interrogazioni) reminds us of the very unpleasant concept of “interrogatorio”. I believe they “murder” student motivation, lower students’ self-esteem, enhance teachers’ omnipotence, and incentive a sterile and uncritical approach to learning. Presentations, on the other hand, allow students to express their creativity, they stimulate student autonomy and sense of initiative and are extremely stimulating for the whole class (including the teacher). They are a useful tool, allowing teachers to carry out formative assessment of students’ competences. They may be followed by more standard/objective forms of summative testing.
- using the language portfolio
- a personal diary and take notes on pre-, during and post course.
- projects with students from other countries but this is not very practical considering the high number of students.
- Pair work on a project or specific topic or even a web quest
- a dictation, a translation, comprehension of a song listened to or of a video watched together; and then students write down what they understand and their opinions about what they listen to or looked at.
- Oral examination
- Diagnostic evaluation - to understand the ability of students
Formative evaluation - two of the modules of the learning/teaching approach
Summative evaluation - to express a final judgement about the achievement of students
- test my students every day, just from random questions during a lesson and I try to avoid, for oral evaluation, a formal test setting. I also evaluate students on projects and personal initiatives during my lessons
- different types of tests that aid evaluation to keep the variety going: completion; multiple choice, brainstorming would be an alternative to see how the students link language aspects.
Although a limited sample, these responses allow us to reach some tentative conclusions. Even if not all those who are involved in test use and, especially, test writing read the literature concerning this field, it does seem that people are thinking about the tests that they are involved with. The literature points out the vast difficulties of language testing. Those who took part in the survey have demonstrated that they are aware of much of this. However, the fact that they do not always agree on what would be the best test indicates that they are thinking of their context and the individuals they are dealing with. They also realise that not all the tests available, or in use, are suitable for them.
The fact that we need more context specific tests is an invite to us all to become more involved in a critical fashion. We should not be accepting poor test constructs and this needs to be heard by the various stakeholders involved. Being aware of the shortcomings of some of the ‘imperfect’ (not to say ‘strange’) tests should help us all reach for that ‘ideal’ test that we would like to use or create. The questions asked in this project, and in the literature, could help us all to cast a critical eye over what we are doing and help us aim to do better.
Alderson, J. C., (2000). Assessing Reading. Cambridge: Cambridge University Press.
Alderson, J. C., Clapham, C. & Wall, D. (1995) Language Test Construction and Evaluation.
Cambridge: Cambridge University Press.
Bachman, L.F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, L. & Palmer, A. (2010) Language Assessment in Practice. Oxford: Oxford University Press.
Buck, G. (2001). Assessing Listening. Cambridge: Cambridge University Press.
Cushing Weigle, S. (2002). Assessing Writing. Cambridge: Cambridge University Press.
Douglas, D. (2010). Understanding Language Testing. London: Hodder Education.
Fulcher, G., & Davidson, F. (2007). Language Testing and Assessment, an advanced research book. Abingdon; Routledge Applied Linguistics.
Heaton, J. B. (1988). Writing English language tests (2nd Ed.). London: Longman.
Hughes, A. (2003). Testing for Language Teachers; Second Edition. Cambridge: Cambridge University Press.
Küçük, F. & Walters, J. (2009) “How good is your test?” ELT Journal Volume 63/4: 332-341.
Luoma, S. (2004). Assessing Speaking. Cambridge: Cambridge University Press.
McNamrarra, T. (2000). Language Testing. Oxford: Oxford University Press.
Paran, A., & Sercu, L. (Editors) (2010). Testing the Untestable in Language Education (New Perspectives on Language and Education) Multilingual Matters.
Purpura, J. E. (2004). Assessing Grammar. Cambridge: Cambridge University Press.
Read, J. (2000). Assessing Vocabulary. Cambridge: Cambridge University Press.
Shohamy, E. (2001) The power of tests: A critical perspective of the uses of language tests. Harlow: Longman.
Shohamy, E. (2013), Expanding the construct of language testing with regards to language varieties and multilingualism. In D. Tsagari, D., S. Papadima-Sophocleous, and S. Ioannou-Georgiou (eds.) International Experiences in Language Testing and Assessment – Selected papers in Memory of Pavlos Pavlou. (pp. 17-32). Frankfurt am Main: Peter Lang GmbH. (Language Testing and Evaluation Series, Vol. 28)
Thomlinson and Masuhara (2013) Adult Coursebooks. ELT Journal Volume 67/2: 233-249.
Weir, C. J. (1993). Understanding and developing language tests. New York: Prentice Hall.
Weir, C. J. (2004). Language Testing and Validation: An Evidence-based Approach (Research and Practice in Applied Linguistics). Basingstoke: Palgrave Macmillan.
Please check the Assessment for the 21st Century English Classroom course at Pilgrims website.
|