Pilgrims HomeContentsEditorialMarjor ArticleJokesShort ArticleIdeas from the CorporaLesson OutlinesStudent VoicesPublicationsAn Old ExercisePilgrims Course OutlineReaders LettersPrevious EditionsLindstromberg ColumnTeacher Resource Books Preview

Copyright Information

Humanising Language Teaching
Year 3; Issue 3; May 2001

Ideas from the Corpora

Things can only get fuzzier

Michael Rundell column

The Swedish scientist Carl von Linné, better known as Linnaeus, was the great taxonomist of the living world. Working a century or so before Darwin, he aimed to classify what he (and his contemporaries) perceived as a fixed biological universe. His work, therefore, entailed elaborating categories, into which plants and animals were neatly slotted. Since each living entity had been created by God, it followed that it could be assigned unambiguously to a particular category, and that membership of one category automatically precluded membership of any other. Two hundred and fifty years later, modern genetic science paints a rather more fuzzy picture. In his fascinating update of On the Origin of Species, the geneticist Steve Jones shows how access to more detailed information, at the genetic level, has forced scientists to modify this idea of discreet, mutually-exclusive groupings with clearcut boundaries. In reality, things are not quite so simple: "species can, in the new world of molecules, no longer be seen as absolutes". They are not so much distinct units as rough groupings of individuals, each with its own unique attributes.

It is easy to see where this line of thought is headed with regard to human language. Anyone familiar with the work of Eleanor Rosch will remember that, as in the natural world, so in the world of language, there are birds and there are birds: some are "good" category members (the "prototypical" birds like – in northern Europe at least – the robin and the blackbird) but then others are much less birdy, like the ostrich or the penguin. Prototype theory, and the notion of "degrees of category membership", was already well-established long before the corpus revolution. But it's interesting to think of corpus data as being analagous with genetic code: both give us direct access to the raw data, and thus allow a "bottom-up" approach to analyzing reality and organizing it in some way – or rather (whether we are linguists or geneticists) to re-assessing established categories to see whether they really stand up to scrutiny. And unsurpisingly, many of them do not.

Take for example word-class (or "part-of-speech") labels. The basic categories are serviceable enough in many cases, and many words can be tagged unambiguously as nouns, verbs or adjectives. But there are plenty of exceptions. Take a word like nuisance. On the face of it, this looks like a good old countable noun, for example:

    "God that man's a nuisance," Fiona said , as they turned the corner.
    The new public school is a bastion of free thinking , with pupils seen as consumers not nuisances by school managements.

But what about the many cases where the referent is plural but the noun remains singular:

    Youngsters on motorbikes have been something of a nuisance in the past
    Don't you find these long journeys are a nuisance? I can't settle to anything .

Here nuisance evades straightforward classification. And this is without mentioning all those cases where nuisance modifies another noun. Nuisance calls and nuisance callers are familiar enough, but many similar cases suggest that the word is developing another personality as something closer to an adjective:

    Officers had responded to calls reporting nuisance motorcycle riders.
    "Bad news?" "No, just a nuisance interruption"
    Police say they'll act on nuisance scroungers.
    MOVE TO CULL NUISANCE BIRDS
    and so on.

This boundary between adjective and noun is often somewhat negotiable. Dictionaries usually classify words like summer as nouns, while observing their frequent use as modifiers. Summer clothes, weather, concerts, breezes and jobs all more or less fit this model, but when well over 25% of all instances of summer occur in this type of structure, one begins to wonder if the classification "noun" really does justice to the word's behaviour:

    Meanwhile, City confirmed that their summer interest in Middlesbrough midfielder Phil Stamp was over.
    the summer confirmation that Philip Anderson is to succeed Terry Gregg…
    ..in the hope that the industry will escape its typical summer downturn

An even more adjectival noun is core: management gurus forever talk about core values, core competencies, and core business activities, and in contexts like this core is almost always used to modifiy other nouns. But there are signs now of it finally crossing the species barrier into true adjectival territory:

    I don't think there will be a lot of people buying big mainframes, but they are core to the business for the people who have them.
    Core to the design is the provision of three rows of seats with places for seven adults.

The point is not that traditional word classes are suddenly exposed as faulty categories, but simply that – just like Linnaeus' species – they can no longer be seen as watertight groupings to which items can be assigned with absolute confidence, and with no "leakage".

One could say the same about established criteria for classifying forms of discourse. One of these, usually described as "mode", assumes a binary choice between "spoken" and "written" text. We have always known that this isn't quite a straightforward as it looks: many types of spoken text have in fact been written in advance and are spoken without improvization: most drama, the Queen's Speech to Parliament, and many allegedly spontaneous exchanges between politicians. But then along comes email to blow the whole categorization apart. Many email users, especially if they have good keyboarding skills, can write almost as fast as they speak, and their messages exhibit many of the characteristics of unscripted conversation. And just as we are getting used to this hybrid text-type, we are confronted by an even newer form of communication, the suddenly ubiquitous "text message", beside which the average casual email exchange reads like a passage from Demosthenes.

Which brings us, finally, to the issue of word meanings. There is general agreement that the same word can mean different things in different contexts. So far, so good. But it is a very long way from accepting this premise to the notion that, for any given word, there is a well-established and generally-agreed set of distinct senses – the idea that, for example, word x has three meanings and word y has six, and any given instance of the word in use can be slotted unambiguously into one of these categories. If such a state of affairs ever existed, it has been completely laid to rest by the advent of corpus data. Just as access to the molecular level of living things has forced scientists to re-evaluate Linnaean categories, so our new exposure to corpus data has shown that assigning words to meanings is to some extent an artificial exercise. To give a "simple" example, take a look at these corpus lines for the verb absorb (a tiny fraction of the 3000 or so occurrences in the BNC).

d was used.  With imported oil absorbing 40% of foreign exchange,  ener	     
nd the South-east will have to absorb 900,000 extra households over the	     
isbelief,  accept that you are absorbing a different way of looking at 	     
he ability of a bigger bank to absorb a smaller one into its systems is	     
 is made as it is hard and has absorbed a tangy flavour.  The water	     
h 20 minutes.  The rice should absorb all the water and the chicken wil	     
may,  through overfunding,  be absorbing an unfair amount of scarce res	     
le differences in the way they absorb and emit light of specific wavele	     
d by the Russian 's ability to absorb and hold his drink without losing	     
information than one brain can absorb and you just do n't know how you 	     
 has occurred,  such as impact-absorbing  bumpers,  recessed windscreen	     
uncan; it gave him a chance to absorb his new surroundings,  to get use	     
which falls on the Pennines is absorbed into layers of squelchy peat on	     
. The banks would be forced to absorb  large losses.  Mr Goria wants ba	     
hop floor had sawdust on it to absorb the dirt and bits of squashed fru	     

How many different meanings are there here? The only honest answer is "anything from one to fifteen" because what we see here (as so often) is not a set of discreet word senses, but rather a continuum. And if you look this word up in five different dictionaries, you will find five different explanations. One of the established conventions of European lexicography is that meaning differences are presented as separate senses, and users approach dictionaries with this expectation. But the more we look at real language in use, the clearer it becomes that this convention is just a pragmatically useful device, rather than a true reflection of how words work.

Steve Jones concludes that "Whatever species may be … they are not fixed. Instead, their boundaries change before our eyes … differences blend into one another in an insensible series". It would be difficult to find a better description of how word meaning works. The interesting challenge for those of us engaged in describing English will be to find ways of presenting information that reflect this new, fuzzier way of classifying linguistic phenomena.

***********************
Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the "Lexicography MasterClass", whose training activities include an intensive workshop in corpus lexicography at the University of Brighton (www.itri.brighton.ac.uk/lexicom).



Back to the top