HLT Magazine (April 1999) - Ideas from the Corpora

Humanising Language Teaching
Year 1; Issue 2; April 1999

THE MICHAEL RUNDELL COLUMN

It's no coincidence that two of the great 19th century lexicographers the philologists Jakob and Wilhelm Grimm were also writers of fairy tales. Even now, though dictionary-writers can access huge volumes of linguistic data from computerized databases, the job of describing a language involves a substantial element of storytelling - of first analyzing and then repackaging the information found in a corpus or in any other source of authentic language. Not for nothing is this material called "raw data", and the process of converting linguistic evidence into appropriate learning resources is quite a complex one. The data from corpora is merely the starting point: it has to be interpreted and its implications explored, its relevance to a particular user-group must be assessed, and then (the really hard work), the "story" has to be presented in ways that those users can readily understand.

Lexicographers have been using corpus data for almost 20 years, but they are not of course the only people whose job entails describing language: anyone involved in any aspect of the language-teaching business will spend at least some of their time trying to explain the language they teach. Consequently, people in the dictionary trade find themselves mystified by the fact that - leaving aside a few pockets of genuine interest and real expertise - the "mainstream" ELT profession has been rather slow to embrace corpus linguistics as a major feeder discipline. What I want to do in this column is try to convince the sceptical of the enormous value of corpus data for teachers and learners of English. Over the next months we'll be looking both at practical issues (what is a corpus, what can it tell you that you couldn't work out for yourself, how can you get hold of one, and how might you use it for teaching), and at questions of a more philosophical nature, such as what really is the status of corpus data (holy writ, or just one form of evidence among many?) and how far the process of language analysis can (or should) be left up to machines.

First, for the benefit of anyone who hasn't yet strayed into this electronic world, we need to define the basic terms. Who better to ask than John Sinclair, one of the most influential figures in modern corpus linguistics? Sinclair's definition of a corpus is "a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language". Add to this the fact that the texts making up a contemporary corpus are stored in digital form and can therefore be processed and analyzed using a range of increasingly sophisticated software tools (of which more later…). Corpus linguistics is a growth area, and there are several good online introductions to the field, including a freely available course from the excellent team at the University of Lancaster and a tutorial from Cathy Ball at Georgetown University. If you want to try searching corpora from your own computer, the British National Corpus (BNC) and the Bank of English both offer a free online trial.

Four or five years ago we first started hearing about the CANCODE corpus of spoken English being assembled at Nottingham University. CANCODE is now a respectably large 5-million-word databank but it started life more modestly, and what was so impressive in its early days was the way that Mike McCarthy or Ron Carter could take a tiny fragment of dialogue and derive often quite profound insights into interpersonal relations and the way these are encoded lexically or grammatically. This is corpus exploration at the microcosmic level. At the other end of the scale, people working in AI (Artificial Intelligence) use statistical programs to analyze vast corpora of a billion words or more, an approach that yields information (for example, about syntactic and collocational behaviour) which would be impossible to get at "manually". There is an obvious analogy here wth the way that scientists try to explain the universe, either by looking at subatomic particles or by studying unimaginably large and distant objects. We will be doing a bit of both in the coming months, so watch this space.

Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins) runs the "Lexicography MasterClass", providing training courses in all aspects of dictionary development and dictionary use (see http://ds.dial.pipex.com/town/lane/ae345).