Pilgrims HomeContentsEditorialMarjor ArticleJokesShort ArticleIdeas from the CorporaLesson OutlinesStudent VoicesPublicationsAn Old ExercisePilgrims Course OutlineReaders LettersPrevious EditionsLindstromberg ColumnTeacher Resource Books Preview

Copyright Information

Humanising Language Teaching
Year 3; Issue 1; January 2001

Ideas from the Corpora

Call me old-fashioned, but

Michael Rundell 20 January 2001

Mature readers of this column will remember how, when the year 1984 came around, there was much discussion about the degree to which George Orwell's gloomy vision had or had not been realized. Now here we are in 2001, and the debate centres on the power of computers. Arthur C. Clarke's futuristic novel (and the Stanley Kubrick film that it spawned) imagined a machine that managed the running of the spaceship and interacted freely with the human astronauts. It had not only "a mind of its own" but, in the end, an agenda that was spookily at odds with that of the ship's human crew.

How close are contemporary computers to replicating the power of 2001's "HAL"? (Each of the initials, by the way, is one letter ahead of "IBM".) Within the AI (artificial intelligence) community, there is an ongoing argument about the extent to which computers might be able to achieve (or at least successfully simulate) consciousness. Some take the view that all of the brain's operations can be reduced to binary processes, and the corollary of this is that given massively increased processing power and a better understanding of what actually happens in the brain it will ultimately be possible to produce conscious machines: computers that can learn, form their own opinions, be creative, and perhaps eventually take over as the dominant species. Others are convinced that consciousness is more complex than this and that a computer like HAL is a theoretical impossibility. (A good place to read about all this is Shadows of the Mind, by Oxford mathematician Roger Penrose.)

What does all this have to do with language corpora, you might by now by asking. Well, 14 years go (which is several generations in computing terms) John Sinclair the brains behind the groundbreaking COBUILD project and one of the most influential figures in corpus linguistics made the interesting claim that "a fully automatic dictionary is [now] at the design stage" (Sinclair 1987. 152). This was a general observation, with little in the way of detailed explanation, but my understanding is that Sinclair envisaged a world of "dictionaries without lexicographers", where computers would analyze massive volumes of corpus data and form their own generalizations regarding which linguistic features were most typical of English (or whatever language was being studied). Dictionaries would then be automatically generated on the basis of this. Illustrative examples, for instance, could be selected by the computer, using smart programs to identify the most significant collocational and syntactic behaviour-patterns of given headwords. The great advantage here is that the process would not be affected by the corrupting influence of human intervention, since individual speakers' intuitions about language are notoriously partial, subjective, and unreliable.

So how plausible is this idea in practice? As one who makes a living from lexicography I have, of course, a vested interest in there being a continued role for humans in the production of dictionaries (and preferably not a subservient role either). So take that as a caveat to anything that follows. There is no question that, since 1987, the whole business of analyzing language (whether for dictionary-making or other purposes) has been supported by increasingly large and representative corpora, by ever more powerful machines, and by smarter and smarter corpus-querying software. I would argue, though, that the key word here is "supported": it seems to me that as we have more and better evidence for the way languages work, the human element in the process interpreting and making sense of all this data has, paradoxically perhaps, become more crucial than ever.

Take for example issues such as metaphor or speaker-attitude, both of which permeate many (if not most) types of text: in recognizing and describing data like this, what are the respective roles of human and machine? If you look at corpus data for a word like old-fashioned, you find a range of attitudes going all the way from negative, through neutral, to very positive, thus:

    "I'm ready now, darling, I'll just put my scarf on." Sousan looked pained. "No one wears headscarves in London, Mummy. It's very old-fashioned."
    Fortunately, the number of these old-fashioned classes seems to be gradually falling.
    Paula had no patience for making conversation with Gran, who tended to have very old-fashioned, dyed-in-the-wool ideas.
Here, old-fashionedness has connotations of outdatedness and irrelevance, and it arouses irritation or disapproval. In many cases, though, the word is used as a value-free descriptive adjective:
    At the back of the house the big, old fashioned bathroom with its noisy pipes and its huge wood-surrounded bath. On its top was a simple oak cross and an old fashioned black telephone, the receiver off the rest and lying on its side.
    Hilbert and Lewis and Beryl sat in old-fashioned deck chairs with striped canvas seats.

Finally, there are many situations where being old-fashioned is seen as a virtue a reminder of "the good old days":
    We're a very small, old-fashioned type of club [with the subtext: "And that's the way we like it."]
    The reception area [is] decorated to conform to the same image, conveying an image of discreet, old-fashioned comfort and luxury.
    The real way to improve the health of the capital city 's people lies with such old-fashioned concepts as full employment, decent housing and good education.
    Whatever happened to good old-fashioned values?

Getting a computer program to make sense of data like this would be daunting enough, but something even more interesting happens when we narrow our search to the expression good old-fashioned. There are of course plenty of corpus instances of this being used in the (expected) very positive sense:
    Mine's [=my watch] a good old-fashioned proper mechanical wind-up job.
    Domestic security is simply a matter of good old-fashioned common sense.
    Selby fought aerodynamic mediocrity with good old-fashioned cubic inches.

But that's not all.
    businesses that thrive on paranoia .. and good old-fashioned nosiness
    There remain good old-fashioned nationalist dictators

We also hear about good old-fashioned guilt, greed, and self-interest. Exactly what is happening here is a matter for interpretation, but there is an observable tendency to use this phrase in an ironic way, when talking about things which though regrettable are very familiar and unlikely to go away. Bill Louw, one of the first people to notice and define the phenomenon of "semantic prosody" (to which this column has referred several times), noted the way that irony "relies on a collocative clash" (Louw 1993.157. The effect, in other words, arises from the fact that the expected collocates (like "common sense" or "patriotism") are replaced by something much less edifying.

The great thing about technology is that it can supply us with the volume of data (and the software for analyzing it) that we need in order to uncover and describe linguistic behaviour of this type. And perhaps in the end we will be able to program computers to spot things like this on their own but don't hold your breath. The humanistic component in interpreting and making sense of corpus data is likely to be important for many years to come. John Sinclair Looking Up (London: Collins 1987)

Bill Louw "Irony in the Text or Insincerity in the Writer?" in M. Baker et al. (Eds) Text and Technology (Amsterdam: Benjamins 1993)

Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the "Lexicography MasterClass" (http://www.lexmasterclass.com), providing training courses in all aspects of dictionary development and dictionary use.

Back to the top