Pilgrims HomeContentsEditorialMarjor ArticleJokesShort ArticleIdeas from the CorporaLesson OutlinesStudent VoicesPublicationsAn Old ExercisePilgrims Course OutlineReaders LettersPrevious EditionsLindstromberg ColumnTeacher Resource Books Preview

Copyright Information

Humanising Language Teaching
Year 3; Issue 4; July 2001

Ideas from the Corpora

My one's bigger than your one

Michael Rundell column

Dictionaries used to boast on their covers about how many words they defined. Nowadays they try to outbid each other in the size of the corpus resources at their disposal:

–Our corpus has 25 million words!

–That's nothing - ours has 100 million.

–Pathetic! Ours has 200 million.

And so on. The current "leader" is COBUILD's Bank of English – now well on the way to half a billion words. But should the dictionary-buyer be impressed by all this bragging? Or, to put it another way, does it actually matter whether a corpus contains 5 million words or 500 million?

In one sense these numbers have become somewhat irrelevant. Anyone – lexicographers included – can get access to the Internet these days, where many billions of words of text are waiting to be queried (including of course the contents of this website). And people in the dictionary business certainly make use of this material from time to time – for example in order to find more instances of a recently coined word or phrase that they have encountered somewhere else. Furthermore, some genres of text (most notably news journalism) are readily available in digital form: a CD-ROM of the whole of last year's Washington Post or Daily Telegraph can be bought relatively cheaply, and supplies enormous volumes of English text. So collecting large quantities of text (if you are not too fussy about what kind of text it is) is not especially difficult or impressive, and nowadays even an inexpensive entry-level PC will have enough hard-disk space to store many hundreds of millions of words.

A more interesting question, therefore, is not about the sheer quantity of text, but about its quality. A 100-million-word sample of the Wall Street Journal is no doubt a large, high-quality corpus – but only if the language you are interested in is contemporary financial journalism written in American English. It is well established that genre has a profound and often systematic influence on linguistic behaviour and vocabulary selection, so it is a fair bet that a corpus of financial journalism will be somewhat deficient in, for example, descriptive imaginative prose or the language used in fiction to describe intimate personal relationships. This, essentially, is the thinking behind corpus collections like the BNC and forthcoming American National Corpus: to supply the raw materials for a wide-ranging description of English in all its many forms, the corpus must include examples of speech and writing from the broadest possible repertoire of text types.

But leaving aside this issue of "balance" (the term used by corpus linguists to characterize a well-formed and diverse collection of carefully selected texts), we come back to that question of numbers: is bigger necessarily better?

The short answer is yes. Back in the 1930s, the Harvard linguist G.K. Zipf discovered a strange and powerful fact about the way words are distributed in English and indeed in many other languages: essentially, a small number of words occur very very frequently, and a large number of words occur very very rarely. The seven most common words in written English (the, of, and, a, in, to, and it) make up almost 20% of every text you read, which is astonishing when you consider how many tens of thousands of English words are available. A corpus of one million words would probably have over 60,000 instances of the word the but is unlikely to include any of the following:

gastronomic, plagiarism, incoherent, reassuring, preach

… all of which have a frequency rating of well under one-hit-per-million-words, yet could hardly be described as obscure. (For more information on "Zipf's law", try http://linkage.rockefeller.edu/wli/zipf/)

This asymmetric distribution of vocabulary in the language creates something of a problem for people writing dictionaries. In order to get enough information to give a reliable description of "rarer" words like incoherent, you need to have a very large corpus. But this in turn creates a serious case of data overload once it comes to looking at the more central vocabulary – you get so many instances of frequent words that it is almost impossible to handle them. Take for example the two words imbue and accept. The first is a rather rare word, but a good dictionary would aim to describe not only what it means, but what kinds of context it appears in: for example, what is it that things are typically imbued with, what sorts of things are imbued, and for that matter am I right in the assumption that this verb is mostly (or always?) used in the passive? There are just about enough instances of this on the 100-million-word BNC to provide a reasonable basis for answering these questions, but a smaller corpus than this would leave us relying on our own instincts to an unhealthy degree. However, in order to supply an adequate amount of data for imbue, the corpus has to be so large that it also supplies vast and unwieldy amounts of data for the more central vocabulary. Never mind the extreme case of the (which appears well over 6 million times in the BNC): words like accept, agree and begin weigh in with anywhere between 20,000 and 50,000 occurrences, which no normal person could analyze "manually" without going crazy.

How can we get around this? Fortunately, technology can help. For a while now there have been various types of software that can analyze large amounts of data and extract information about statistically significant facts. One well-known program measures the likelihood of two words occurring together within a particular span. According to what statisticians call the "null hypothesis" you can calculate the probability of word A appearing in the same sentence as Word B, simply on the basis of how frequent each word is (and therefore how likely it is that the two words will come together in a text at some point). Then, if the software finds that Word A and Word B appear together much more often than this null hypothesis suggests, the inference is that the two words are in some way closely associated. Suppose for example that the null hypothesis says commit and suicide "should" appear together once in every 500,000 words, but we find that in fact they appear together once in every 50,000 words, then we have a pretty clear case of a significant co-occurrence, and it is likely that we are looking at a collocation.

This can be quite useful for dictionary people, who can look at lists of automatically generated collocates and get an idea of which are the most important. But the results are often messy and unreliable, and take no account of the grammatical function of words that are identified as significant collocates: they may be subjects or objects of a verb, or adverbs or prepositions that often go with it. The next exciting stage in this process is already well advanced, however, as computer scientists develop more sophisticated ways of analyzing large corpora. A very impressive example of this technology is what are called "Word Sketches", which analyze corpus data in a variety of ways and end up with a detailed description of the way a word behaves. A typical Word Sketch for a verb, for example, will generate separate lists (each arranged in order of significance) showing the typical subjects of the verb, its most frequent objects, the adverbs that most commonly modify it, and much else besides. You can find examples of these on the website of my colleague Adam Kilgarriff, who developed this approach (go to: http://www.itri.bton.ac.uk/~Adam.Kilgarriff/wordsketches.html)

But to give a flavour, here is just a small extract of the information in the Word Sketch for the noun conversation (the numbers show the number of times a given word occurs with conversation in the BNC)

Transitive verbs whose object is conversation

Words that occur in the position WORD + prep + conversation

Adjectives that typically modify conversation

overhear 33
steer 25
record 46
tap 16
resume 14
hold 63
interrupt 14
continue 42
prolong 8
make 120
hear 38
conduct 13
finish 19
end 20
start 34
remember 27
recall 12
dominate 10

topic 93
snatch 16
lull 14
deep 34
listen 57
eavesdrop 9
engage 30
hum 11
buzz 10
babble 7
gist 6
murmur 7
fragment 11

polite 32
casual 36
animated 18
everyday 28
private 64
whispered 11
desultory 9
face-to-face 10
intimate 15
normal 40
earnest 12
overheard 5
murmured 6
interesting 29
after-dinner 6
informal 16

 

This is just the start of what promises to be an interesting new phase in corpus exploration. As this approach develops, we should start to hear less macho posturing about how big everyone's corpus is, and a little more about which smart techniques people are using to extract useful and relevant information.

***********************

Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types. He is now a freelance consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the "Lexicography MasterClass", whose training activities include an intensive workshop in corpus lexicography at the University of Brighton (www.itri.brighton.ac.uk/lexicom). He is also Editor-in-Chief of the forthcoming Macmillan English Dictionary for Advanced Learners.


Back to the top