HLT Magazine, May 02: Ideas from the Corpora

Would you like to receive publication updates from HLT? You can by joining the free mailing list today.

Humanising Language Teaching
Year 4; Issue 3; May 02

Metatarsal mania grips England – the value (and limitations) of a news corpus

Mike Rundell

There is an old saying about London buses, that you wait for one for ages then three come along at once. Almost the same thing happened last month with metatarsal bone injuries to footballers' feet – there has suddenly been a rash of them but none, of course, more "tragic" (to use the sports commentators' favourite adjective) than that to the left foot of England's favourite player, David Beckham. And now I notice that – aaargh! – there is no entry for metatarsal bone in my new dictionary. This is just one of those things over which one has no control. The word in question has lurked quietly for years in the specialized discourse of the medical profession, with no better claim for inclusion in a learners' dictionary than any other "minor" bone (such as the ilium or ischium). Then a freak accident propels it into the limelight, and half the British population are now experts on the functions of this particular bone and on the treatment of metatarsal injuries. As one headline writer put it: "Metatarsal mania grips England".

Apparently something similar happened, at the turn of the last century, with the word appendicitis. King Edward VII was rushed to hospital to have his appendix removed, and the word was suddenly on everyone's lips. But the great Oxford English Dictionary, to the embarrassment of its publishers, contained no mention of it. The word had been around for twenty years or so, but the dictionary's Editor, James Murray, had been foolishly advised by the powers in Oxford to include only those words that were mentioned in "reputable" sources (i.e. not newspapers or scientific journals). Following this debacle, the OED's elitist policy towards its source material was quickly abandoned.

All of which illustrates some of the issues associated with using newspaper text as a source of linguistic data. As corpus-developers have long been aware, journalistic text alone is not a reliable guide either to word frequency or, more generally, to the way words behave in the language as a whole. For one thing, journalism has its own special, somewhat archaic dialect (Actor's love child slams errant dad, Soccer supremo in four-in-a-bed romp, etc.). But an even more serious weakness is the fact that the subject matter of newspapers is to a large extent determined by events (many of them quite random), and this can give a very distorted picture of the relative centrality and importance of particular items of vocabulary. To give an obvious example, an English newspaper corpus collected in the second half of September last year would probably indicate a tenfold (maybe even a hundredfold) increase in the frequency of the word bin – but in most cases this would not reflect a sudden upsurge of interest in garbage containers, but the temporary ubiquity of the word that comes between "Osama" and "Laden". As a result of specific events, words like this can gain a profile out of all proportion to their "importance" – if we define importance in terms of a word's relative frequency and widespread "dispersion" through a range of text-types over an extended period.

But while recognizing the limitations of news text as a data source for linguists and dictionary writers, one could look at this another way: a news-based corpus can also be a revealing record of what is going on and what is important in a society at any given time. This thought is prompted by what I learned at a recent workshop on the development of the "SLICE" Corpus. SLICE is the Sri Lankan English component of the ICE corpus collection. To quote from its website http://www.hku.hk/english/research/ice/index.htm, ICE (the International Corpus of English) is a global initiative that started in 1990 "with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English". The objective is to collect, in each participating country or region, a million words of spoken and written text from a range of sources, as a basis for research into the syntactic, lexical, discourse and phonological features of that specific type of English. And because the design of each corpus follows a common structure, we will eventually have a good basis for making comparisons across all the main varieties of English. Some of the ICE corpora, including those for Australian, East African, Singaporean, and Indian English, are already complete, with the rest at various stages of development. You can, by the way, download sound files for some of these – so for example you can hear people speaking Indian English or Australian English.

The Sri Lankan project is still at an early stage, but is being progressed with great enthusiasm. So far, they have collected around 200,000 words, and at this stage a high proportion of the material comes from Sri Lanka's English-speaking press – though this imbalance will be redressed as text-collection continues. My friend Chris Tribble, who organized the SLICE workshop, has made some interesting discoveries about the kinds of vocabulary that appears with unexpected frequency in Sri Lankan newspapers. The excellent WordSmith Tools corpus-querying software, available from Oxford University Press (http://www1.oup.co.uk/.../download.html) has a function called "Keywords", which is a program for identifying any words in a text (or set of texts) that occur with greater-than-usual frequency. First you have to establish what "expected frequency" is, but this is easy enough: you can use the word-frequency information from a large general corpus like the BNC as a benchmark or "control" list, against which you can then assess frequency data from a smaller, more specialized corpus. The Keywords program compares wordlists from the two sources (control list, and corpus being studied) and spits out any items that occur significantly more often in the latter. These are the "keywords" and they can be a helpful way of characterizing a text. (This makes the program a valuable tool for genre analysis – which must have some applications in language-teaching.)

What do we find in the Sri Lankan data? Some of the unusually frequent vocabulary reflects social and cultural features that have special importance in that particular country: words like Sinhalese, Buddhist, Tamil and island come high in the list, and it is a reasonable prediction that all these words will continue to figure highly once the full SLICE corpus has been collected. But some of the other "keywords" in Sri Lankan English newspapers give us clues about the island's current preoccupations: the unusually high frequency of words like peace, peacemaker, security, and terrorist are a poignant reminder of the efforts now being made to bring an end to a decade or more of bloody civil war. (Michael Ondaatjie's novel, Anil's Ghost, gives a chilling account of what the island suffered during the '80s and '90s.) Even more revealingly, we find that one of the keywords here is suicide, which appears far more often in the Sri Lankan data than in the BNC. Further investigation shows that, while the main collocates of this word in a general English corpus are items like commit, rate, attempt and pact, in the Sri Lankan news corpus suicide collocates more often with words like mission, bomber, and – most gruesomely – jacket.

Newspaper text is probably the easiest type of corpus data to get hold of. Its limitations as a basis for general descriptions of the language (as in a dictionary or reference grammar) are well known. But it is well worth investigating as a genre in its own right, and it can provide a fascinating and revealing snapshot of the cultural and political issues that have high currency at any given moment – whether these are as trivial as the injury to David Beckham's foot, or as serious as the efforts to end the war in Sri Lanka.

Michael Rundell (michael.rundell@dial.pipex.com) is a lexicographer, has been using corpora since the early 1980s, and has been involved in the design and development of corpora of various types. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English and the Longman Language Activator. He is now an independent consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the Lexicography MasterClass, whose activities include workshops in corpus lexicography (www.itri.brighton.ac.uk/lexicom) and a new MSc course in Lexical Computing and Lexicography (www.itri.brighton.ac.uk/MscLex) He is also Editor-in-Chief of the new Macmillan English Dictionary for Advanced Learners (www.macmillandictionary.com).