How the Cambridge Learner Corpus helps with Materials Writing
Felicity 0'Dell, Cambridge
As someone who writes materials for Cambridge University Press, I am able to use their enormous English language corpus. This continuously expanding corpus is, I believe, already the largest in existence as it currently has roughly 700 million words of written English and 40 million words of spoken English. It comes to users together with CUP's in-house concordancing software to enable lexicographers, academics and materials writers to process this mass of words in the ways they would like to. One of the most unusual and most powerful features of CUP's corpus is the fact that there is also a separate corpus of about 20 million words of learner English based on student scripts from Cambridge ESOL examinations. The number of words in this corpus - known as the Cambridge Learner Corpus (CLC) - also grows considerably each year.
The CLC is invaluable for the materials writer. Even if we teach multilingual classes, we inevitably have a restricted knowledge of the kind of errors that learners are likely to come up with. My own classroom experience has been predominantly with students whose first languages are Romance, Germanic or Slavonic. So I am used to the overuse of the present perfect and to the way sympathetic is used to mean nice. I am much less able to anticipate the things that, say, Chinese or Arabic learners are likely to find difficult. The learner corpus provides me with examples of language from students with a hundred different L1s in one hundred and fifty different countries. There are examples of learner English at all levels as extended pieces of student writing from KET, PET, FCE, CAE, CPE, BEC, IELTS and CELS - how Cambridge ESOL love acronyms! - have all been used.
So, how is it all done? The scripts are word-processed by a team at CUP. These keyers have to be very accurate as well as fast! They must make sure they key the mistakes the candidates have made and don't correct the errors. For example, when a student has written 'finaly' the keyer must not type 'finally'! Quite difficult sometimes! Then the computer files with the scripts are taken by a specialist team who deal with error coding. They find and code all the errors according to type - wrong preposition, unnecessary verb, missing determiner, for example - and also indicate how the error should be corrected. The people who do this are linguists with TEFL experience and they need to be highly trained to do this work.
The coding is, of course, crucial and one of the most complex aspects of developing CLC. These are a set of nearly a hundred acronyms such as MP for missing punctuation or W for word order error. Much of the classification hinges round F (standing for wrong Form used), M (something is missing), R (a word or phrase needs replacing) and U (an unnecessary word or phrase has been used). To these basic F, M, R, U codes are added codes which indicate the part of speech involved. When the person doing the error coding comes across an error, they must insert the error code followed by the wrong word followed by the corrected word followed by the error code again. So if they are typing the sentence, It depends of the weather, they will actually type:
It depends RP of/|on RP the weather.
If you ever feel fed up with the pile of marking that you have to do, just imagine what it must be like to spend your days preparing texts in this way for the Cambridge Learner Corpus! I hope that the people who do this very concentrated and, I would imagine, often tedious task are at least aware of how enormously useful what they are doing is for those who will ultimately use the corpus.
Each student text is, of course, anonymous but basic data about the writer is recorded - L1, of course, nationality, gender, age, educational background, the exam session from which the script comes. Having found a page of cites that you are interested in, you can find out all these biographical details about the original writer of any individual citation. You are also able to double click on any one line of text you have found in order to be able to see it in its bigger context.
How, then, can all this wealth of information be used by the materials writer? I myself have used it for preparing materials for a CAE course and for a CAE writing skills book. I select just the CAE part of the learner corpus and I find what errors are typical for exam candidates at this level. I can identify, say, the twenty most frequent spelling errors at this level and focus on those rather than wasting time dealing with spelling errors that may be common at FCE level but have been mastered by CAE.
I know from my own experience as a teacher with students at this level that CAE students need to work on verbs and on prepositions; so I can use the corpus to search on the code for, say, Incorrect tense of verb or Replace preposition and I can see exactly which aspects of tense and which examples of preposition use it will be most useful to address for students working towards CAE.
I may suspect that candidates at CAE level have difficulties with certain lexical items - with eventually, say, or with nature or with travel. I can use the corpus to search on these words and to see whether candidates are generally using them correctly or incorrectly - in fact, many do turn out to have problems with these words. The corpus lets me see exactly what kinds of mistakes students are making with them. This allows me - I hope - to choose examples for my materials that have more authenticity and relevance than would otherwise be the case.
The CLC is based on exam scripts but it is, of course, not just useful when working on exam preparation materials. When preparing, say, advanced vocabulary materials, I turn on the sections of CLC that relate to all the advanced exams - CAE, CPE, IELTS and the top levels of BEC and CELS - and then search on those words that I am interested in presenting and practising. Do learners at this level already use these words freely? If so, there may be no point in dealing with them in my materials. If not, what kinds of mistakes do they tend to make with these words? This information helps me to focus tasks more precisely.
The other way in which I personally have found the Cambridge Learner Corpus particularly useful is when giving a talk or seminar to a teachers' group in one specific country. I can select the L1 I am particularly interested in - Japanese, say, or Czech - and can then see what kinds of errors are most typical for these learners either across the whole range of levels or for one specific level as preferred. On the whole, my experience is that the information provided by the corpus here simply backs up teachers' own intuitions and knowledge about what errors learners of a particular L1 tend to make; nevertheless, teachers appreciate the reassurance provided by the corpus's more extensive and rigorous approach.
The Cambridge Learner Corpus also has the potential to do a range of wonderful and slightly weird things which I personally have not yet been able to make practical use of. It can check whether male or female learners are more likely to use a particular word or make a specific error. It can compare the typical errors of 15-year-olds with those of 23-year-olds. It can show how errors made in 1999 differ from those made in 2001.If anyone can think of a purposeful way of exploiting these features, do please let me know.
A learner corpus is a wonderful resource for materials writers. However, if you do not have the good fortune to have direct access to this material, there is no need to feel deprived. Much of the practically useful information to be gained from it has been analysed for us and is available to everyone through modern learners' dictionaries, grammars and other study materials. It is fun playing with CLC's many different tools and seeing what you can discover. But it does have the potential to be almost as time-wasting to have on your computer as Solitaire or DX-Ball. Much more appealing to see what the most frequent errors for Albanian learners are than to get on with that writing that next exercise. As a user, my only real frustration with the CLC is that it is not available for use on a Mac.
|