HLT Magazine (January 2001) - Ideas from the Corpora

Humanising Language Teaching
Year 3; Issue 1; January 2001

Call me old-fashioned, but …

Michael Rundell 20 January 2001

Mature readers of this column will remember how, when the year 1984 came around, there was much discussion about the degree to which George Orwell's gloomy vision had or had not been realized. Now here we are in 2001, and the debate centres on the power of computers. Arthur C. Clarke's futuristic novel (and the Stanley Kubrick film that it spawned) imagined a machine that managed the running of the spaceship and interacted freely with the human astronauts. It had not only "a mind of its own" but, in the end, an agenda that was spookily at odds with that of the ship's human crew.

How close are contemporary computers to replicating the power of 2001's "HAL"? (Each of the initials, by the way, is one letter ahead of "IBM".) Within the AI (artificial intelligence) community, there is an ongoing argument about the extent to which computers might be able to achieve (or at least successfully simulate) consciousness. Some take the view that all of the brain's operations can be reduced to binary processes, and the corollary of this is that – given massively increased processing power and a better understanding of what actually happens in the brain – it will ultimately be possible to produce conscious machines: computers that can learn, form their own opinions, be creative, and perhaps eventually take over as the dominant species. Others are convinced that consciousness is more complex than this and that a computer like HAL is a theoretical impossibility. (A good place to read about all this is Shadows of the Mind, by Oxford mathematician Roger Penrose.)

What does all this have to do with language corpora, you might by now by asking. Well, 14 years go (which is several generations in computing terms) John Sinclair – the brains behind the groundbreaking COBUILD project and one of the most influential figures in corpus linguistics – made the interesting claim that "a fully automatic dictionary is [now] at the design stage" (Sinclair 1987. 152). This was a general observation, with little in the way of detailed explanation, but my understanding is that Sinclair envisaged a world of "dictionaries without lexicographers", where computers would analyze massive volumes of corpus data and form their own generalizations regarding which linguistic features were most typical of English (or whatever language was being studied). Dictionaries would then be automatically generated on the basis of this. Illustrative examples, for instance, could be selected by the computer, using smart programs to identify the most significant collocational and syntactic behaviour-patterns of given headwords. The great advantage here is that the process would not be affected by the corrupting influence of human intervention, since individual speakers' intuitions about language are notoriously partial, subjective, and unreliable.

So how plausible is this idea in practice? As one who makes a living from lexicography I have, of course, a vested interest in there being a continued role for humans in the production of dictionaries (and preferably not a subservient role either). So take that as a caveat to anything that follows. There is no question that, since 1987, the whole business of analyzing language (whether for dictionary-making or other purposes) has been supported by increasingly large and representative corpora, by ever more powerful machines, and by smarter and smarter corpus-querying software. I would argue, though, that the key word here is "supported": it seems to me that as we have more and better evidence for the way languages work, the human element in the process – interpreting and making sense of all this data – has, paradoxically perhaps, become more crucial than ever.

Take for example issues such as metaphor or speaker-attitude, both of which permeate many (if not most) types of text: in recognizing and describing data like this, what are the respective roles of human and machine? If you look at corpus data for a word like old-fashioned, you find a range of attitudes going all the way from negative, through neutral, to very positive, thus:

"I'm ready now, darling, I'll just put my scarf on." Sousan looked pained. "No one wears headscarves in London, Mummy. It's very old-fashioned."
Fortunately, the number of these old-fashioned classes seems to be gradually falling.
Paula had no patience for making conversation with Gran, who tended to have very old-fashioned, dyed-in-the-wool ideas.

Here, old-fashionedness has connotations of outdatedness and irrelevance, and it arouses irritation or disapproval. In many cases, though, the word is used as a value-free descriptive adjective:

old fashioned

old-fashioned

Finally, there are many situations where being old-fashioned is seen as a virtue – a reminder of "the good old days":

old-fashioned

Getting a computer program to make sense of data like this would be daunting enough, but something even more interesting happens when we narrow our search to the expression good old-fashioned. There are of course plenty of corpus instances of this being used in the (expected) very positive sense:

good old-fashioned

But that's not all.

good old-fashioned

We also hear about good old-fashioned guilt, greed, and self-interest. Exactly what is happening here is a matter for interpretation, but there is an observable tendency to use this phrase in an ironic way, when talking about things which – though regrettable – are very familiar and unlikely to go away. Bill Louw, one of the first people to notice and define the phenomenon of "semantic prosody" (to which this column has referred several times), noted the way that irony "relies on a collocative clash" (Louw 1993.157. The effect, in other words, arises from the fact that the expected collocates (like "common sense" or "patriotism") are replaced by something much less edifying.

The great thing about technology is that it can supply us with the volume of data (and the software for analyzing it) that we need in order to uncover and describe linguistic behaviour of this type. And perhaps in the end we will be able to program computers to spot things like this on their own – but don't hold your breath. The humanistic component in interpreting and making sense of corpus data is likely to be important for many years to come. John Sinclair Looking Up (London: Collins 1987)

Bill Louw "Irony in the Text or Insincerity in the Writer?" in M. Baker et al. (Eds) Text and Technology (Amsterdam: Benjamins 1993)

************************
Michael Rundell is a lexicographer, and has been using corpora since the early 1980s. As Managing Editor of Longman Dictionaries for ten years (1984-94) he edited the Longman Dictionary of Contemporary English (1987, 1995) and the Longman Language Activator (1993). He has been involved in the design and development of corpus materials of various types, including the BNC and the Longman Learner Corpus. He is now a freelance consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the "Lexicography MasterClass" (http://www.lexmasterclass.com), providing training courses in all aspects of dictionary development and dictionary use.