Humanising Language Teaching Magazine for teachers and teacher trainers

	CONTENTS

	EDITORIAL

	MAJOR ARTICLES

	JOKES

	SHORT ARTICLES

	CORPORA IDEAS

	LESSON OUTLINES

	STUDENT VOICES

	PUBLICATIONS

	AN OLD EXERCISE

	COURSE OUTLINE

	READERS’ LETTERS

	PREVIOUS EDITIONS

	BOOK PREVIEW

	POEMS

Would you like to receive publication updates from HLT? Join our free mailing list

Pilgrims 2005 Teacher Training Courses - Read More

SHORT ARTICLES

How the Knowledge Base in Language Assessment is Measured: A Reverse Engineering Approach

Kioumars Razaipour and Tahereh Firoozi, Iran

Kioumars Razavipour is an assistant professor of Applied Linguistics at Shahid Chamran University of Ahvaz, Iran. His main research area of interest is language assessment. E-mail: razavipur57@gmail.com

Tahereh Firoozi is M.A student of TEFL at Valiasr University of Rafsanjan. Her primary research interest is language testing. E-mail: firoozi.chamran@gmail.com

Abstract

While language testers are still in search of appropriate mechanism for assessing language abilities, a new concern that has emerged is the type of knowledge or skills that language testers themselves should possess to assess test takers' communicative competence. This concern with language assessment literacy (LAL) has recently generated heated debates in the community. To contribute to this debate, we analyzed several retired versions of a test that is targeted at measuring M.A. TEFL candidates' LAL who seek entry to Iranian tertiary institutions. Results show that the test is designed within a predominantly psychometric spirit, with much emphasis placed on standardized testing, standard practice in test construction and design, and reliability and validity issues in a componential conceptualization of test validation. As such, the test appears to care less about performance and classroom language assessments. Also missing in the test content is an adequate coverage of the principles of language assessment. Thus, the test appears to be suffer from construct-underrepresentation. We suggest that the test, to be more valid for its target audience and its intended purpose, has to undergo some serious changes in its content coverage to better represent the construct of language assessment literacy.

Introduction

It is a commonplace to claim the importance of assessment in education (Alderson, 2005; Popham, 2006, among others). Anyone directly involved with education is aware of the power that tests have on the quantity and quality of education (see Alderson and Wall, 1993; Wall, 1996). While most research studies seem to have focused on the adverse impacts of assessment on education (see the special issue of Language Testing 1996), it is perfectly possible and crucial for assessment to be utilized for the betterment of education. In fact, good education is impossible to take place without good assessment (see Popham, 2009). There is even evidence that, under otherwise ideal conditions, implementing good assessments can spark positive changes in bad teaching (Daveis, 1990).

Implementing good assessments however does not come naturally as a byproduct of teaching. It takes expertise, reflection, and constant professional development. It also requires knowledge and awareness of principles and procedures of assessment. This knowledge of the fundamentals of assessment has come to be known as assessment literacy (AL) in general education assessment and language assessment literacy (LAL) in the field of language testing (see Inbar-Lourie, 2008, 2013).

The last decade has witnessed a gradual move toward recognizing the critical role that LAL plays in practices in language teaching and testing. A number of studies have been conducted since Bridndley's (2001) pioneering work, including a recent Special issue in Language Testing, guest-edited by Inbar-Lourie (2013). Despite this, compared to many other areas of language education, the literature on language assessment literacy remains disproportionate to its vital importance in language education and assessment. Furthermore, despite attempts at diffusing assessment literacy for professionals and teachers, test takers seem to have been more or less overlooked (Watanabe, 2011). To our knowledge, no study on LAL has been carried out that is based on a test specifically designed to measure examinees' knowledge base in language assessment. This study examines the content structure of a high stakes test of language assessment literacy with the aim of discovering its underlying constructs and domains. To this aim, the present study seeks to address the following questions:

1. How is the LAL construct operationalized in the LAL test? 2. Which content domains of the LAL construct are given priority and which areas are underrepresented in the LAL test?

Method

Whether the test we have called a LAL test is designed based on a test specification or not remains unclear, given our lack of access to test designers or the institution in charge of designing and administering the test. Therefore, we have adopted a reverse engineering approach to uncover the content structure of the test. According to Davidson and Lynch (2002), a reverse engineering approach is a mechanism via which we reconstruct the components of a test specification on the basis of the actual test. It should be admitted that not all constituents of test spec are to be arrived at in our analysis. Rather, we are mainly concerned with arriving at the underlying constructs that inform item and task writing. The materials which were made use of in our reverse engineering approach were a sample of previous test papers given annually by the Organization of Educational Measurement (OEM), locally known as Sazeman Sanjesh. Failing to derive a coherent framework from an inductive analysis of test papers, we chose to make use of the existing frameworks in the field. The framework was first designed by Fulcher (2012). Fulcher used a factor analytic approach to find the underlying factors in the data collected through a survey administered to language testing instructors across the globe. Subjecting the data to factor analysis, Fulcher arrived at the conclusion that language assessment literacy entails the following components:

Test design and development (TDD)
Large-scale standardized testing (LST)
Classroom testing and washback (CTW)
Validity and reliability (VR)

For the purpose of this study, seven retired versions of the LAL test were selected for analysis. The criterion for including this set of test papers was simply the fact that they were the ones the researchers could access. However, the conveniently chosen sample happened to cover a wider range of test papers, spanning over a decade of the time the test has been in use. The tests were analyzed using the framework mentioned above. More specifically, each test item was assigned to one of the four major categories in the framework (see attachment A). To reduce the subjectivity involved in deciding which test item goes to which test category, another rater was invited to do the content analysis. Kappa coefficient showed that there was a moderately high level of agreement between the two raters.

Among the few frameworks proposed recently to uncover the constituents of LAL for various groups of stakeholders (i.e., Davies 2008, Inbar-Lourie 2008, Brindley 2001), the one by Fulcher (2012) was preferred for 1) it derives its categories from a factor analysis based on a survey with a relatively large population of participants, 2) it is elaborate and comprehensive enough to yield categories that tap various areas of language assessment knowledge with a good degree of detailedness and 3) it was aimed at identifying those areas of LAL which are more immediately needed by classroom teachers.

Results

It was indicated above that the content analysis of test papers was based on a framework developed by Fulcher (2012). The framework, arrived at through factor analysis of a survey given to language testers across the globe, consists of the following categories:

Test design and development (TDD)
Large-scale standardized testing (LST)
Classroom testing and washback (CTW)
Validity and reliability (VR)

Table 1 gives the number of test items in each of the four categories along with items which did not fit any of the above categories.

Table 1. Number of items allocated to LAL Test modules.

Year	Number of items	LST	VR	TDD	CTW	Other
2000	10	4	4	1	1	0
2001	10	2	4	3	1	0
2002	10	3	3	3	0	1
2003	10	4	2	4	0	0
2004	20	7	3	6	1	3
2010	15	4	5	2	1	3
2011	15	6	4	2	1	2
Total	90	30 (33%)	25 (28%)	21 (23%)	5 (5,5%)	9 (10%)

The above table shows that the area of language assessment that is most frequently tested is large-scale standardized testing, with 33% of the whole test items in the test papers examined. The second area that is given priority in TEFLEE is validity and reliability with 28% of all test items in the sample papers. Test design and development comes third in the number of test items that are allocated to it, with 23% of test items. The area that is least frequently tested is classroom testing and washback. Out of a total of ninety test items in the pool of items, only five relate to this area of LAL. Finally, one tenth of test items could not be assigned to any of the categories in the framework, forming a separate category called other.

Distribution of test items across the different modules indicate that the LAL test is not 'stakeholder-friendly' in the sense that it seems not to be based on the immediate needs of its stakeholders. We observe that the largest number of items go into LST, that part of the language assessment knowledge base that is barely immediately needed by classroom teachers, more so in countries with a centralized educational system where teachers are rarely, if ever, involved in large-scale testing. Had the test been based on a sound needs analysis of test takers, more items would have been allocated to language assessment in the classroom, criterion language testing, alternative assessments, and so on.

In terms of different schools of language testing, we find that all test items in the sample are founded on the classical testing theory. In fact, there is no one single item dealing with item response theory or Rasch models in the sample of tests reviewed. The importance of IRT in measurement cannot be overestimated given the fact that some scholars liken the centrality of Rasch models to measurement to that of quantum physics in the world of physics (McNamara 1998). As McNamara asserts, knowledge of the multi-faceted measurement "is now a standard part of the training of language testers" (2011, p. 436). Multiple arguments can be advanced against this zero space allocated to IRT. One is that probably test designers deem the IRT principles and procedures, which demand learners to be more computer and software-savvy, beyond the capability of candidates. The counter argument is that today's generation is indeed ways more digitally literate than us and such tacit assumptions about their inability to cope with the demands of IRT are at best, ill-founded.

Inbar-Lourie (2013b) recognizes that the ingredients in LAL are either generic, common to general education measurement, or language-specific. Analyzing the data of the current study from this standpoint, it was found that out of a total of 15 items of the LAL test only three addressed language-specific issues and the remaining 12 all could be on a general test of assessment literacy in other areas of education. The three language-specific items were about the components of communicative competence, cloze elide, and rating oral performance. What proportion of a LAL test should go to the generic ingredient and what share should be reserved for the language-specific part is an open question as no such consensus as to the appropriate weight of each of the modules exists in the literature yet. Much, of course, depends on the purpose, the population of test takers, their needs, and the local demands of the context of test use for a test to be stakeholder-friendly, as Brown 2008 puts it. Despite this lack of agreement on how the construct of LAL should be realized in each of its generic and language-specific components, we might still be on safe grounds to argue that allocating only one-fifth of a LAL test to those areas of LAL which relate exclusively to language assessment is a case of construct-underrepresentation.

Final remarks

The analysis of the LAL test we examined reveals that for its designers the knowledge base required in language assessment is psychometric in nature, mainly about high stakes tests with priority given to validity and reliability criteria, in their traditional semantics (see Hathcoat 2013), where a componential approach is taken toward test validity. This was evidenced with test items related to large scale standardized testing being the major concern in the examination.

Less tested are assessment knowledge and skills that are required for conducting formative, informal, criterion-referenced assessments. Given that the target population of test takers are either already English teachers or would be language teachers, more regard should be paid to classroom-based language assessments. This would in turn generate some positive impacts in the content of language testing courses that are offered in under-graduate programs.

Finally, almost totally absent in the test is a concern with the principles of language assessment, which in the words of Davies, help guard the guards. A well-rounded education in language assessment requires that language testers be able to identify abuses of assessments and act responsibly to counter unethical practices. For language assessment education to be both efficient and ethical, a healthy balance of skills, knowledge, and principles is required (Davies 2008).

Another implication that the findings of this study carry for language assessment relates to the impacts that high-stakes test of LAL have on language testing offered in university language programs and test preparation courses directed at preparing test takers for the test. Construct-underrepresentation in the design of the test results in narrowing the syllabi in both of the above-mentioned language testing programs. Given the importance of decisions that are commonly made based on language test results, it is of prime importance that decision makers enjoy an adequate knowledge base for conducting fair language assessments.

This study was based on content analysis of a sample of retired LAL tests, which has its own shortcomings most important among which is that fact that the construct validity of the LAL test could not be examined using this approach. Carefully designed validation studies are needed to further investigate the construct validity of the test. Moreover, the impact of this test on language testing education is an uncharted territory which merits serious investigations.

References

Alderson, J. C. (2005). Diagnosing foreign language assessment. London: Continuum

Brindley, G. (2001). Language assessment and professional development. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara & K. O'Loughlin (Eds.). Experimenting with uncertainty: Essays in honor of Alan Davies (Vol. 11, pp. 137-143). Cambridge: Cambridge University Press.

Brown, J. D. (2008). Testing-Context Analysis: assessment is just another part of Language curriculum development. Language Assessment Quarterly, 5(4), 275-312

Daveis, A. (2008). Textbook trends in teaching language testing. Language Testing, 25, 327-347

Davidson, F., & Lynch, B. (2002). Testcraft: a teacher's guide to writing and using language test specifications. Yale University Press.

Fulcher, G. (2012). Assessment literacy for the language classroom. Language Assessment Quarterly, 9, 113-132

Hathcoat, J. (2013). Validity semantics in educational and psychological assessment. Practical assessment, research, and evaluation, 18 (9), 1-14

Inbar-Lourie, O. (2008). Constructing a language assessment knowledge base. Language Testing, 25(3), 385-402

Inbar-Lourie, O. (2013b). Language assessment literacy: what are the ingredients. Plenary speech at the 4th CBLA SIG Symposium Programme ´Language Assessment Literacy – LAL, November. Cyprus.

Keohe, J. (1995). Basic item analysis for multiple choice tests. ERIC Digest.

McNamara, T. (1998). Measuring second language performance. Oxford: Oxford University Press.

McNamara, T. (2011). Applied linguistics and measurement: A dialogue. Language Testing, 28(4), 435–440

Popham, W. J. (2006). Needed: a doze of assessment literacy. Educational leadership, 63, 84-85

Popham, W. J. (2009). Assessment literacy for teachers: faddish or fundamental? Theory into Practice, 48, 4-11

Watanabe, Y. (2011). Teaching a course in assessment literacy to test takers: its rational, procedure, and effectiveness. Research Notes, 46, 29-34

Please check the How to be a Teacher Trainer course at Pilgrims website.

Website design and hosting by Ampheon