HLT Magazine, March 03: Ideas from the Corpora

Would you like to receive publication updates from HLT? You can by joining the free mailing list today.

Humanising Language Teaching
Year 5; Issue 2; March 03

Unnatural language processing

Michael Rundell

Most devotees of Pilgrims and its publications will be familiar with the abbreviation 'NLP'. The same three letters are also used by computational linguists, but in those circles they do not mean 'neurolinguistic programming' but 'natural language processing'. This version of NLP refers to the technologies that can be used for analyzing human language in order to produce applications such as concordancers, automated translation systems, speech-recognition devices, and indeed dictionaries. The term 'natural language' is used in order to distinguish human languages that have evolved 'naturally', from languages (or at least, systems that use words and symbols) that have been specially constructed for the purpose of communicating with computers.

But how natural are natural languages? The influences that shape the ways in which human languages develop and change are very complex and very varied. One of the best descriptions of this process is Michael Halliday's analogy with the climate and the weather: the language as a system, changing very gradually over time, is like the climate, whereas individual instances of language in use may – like the weather – be highly idiosyncratic. Halliday continues: 'Each instance redefines the system, however infinitesimally, maintaining its present state or shifting its probabilities in one direction or another'1. When a significant proportion of the people in a given speech community start behaving in a particular way (for example, in terms of the phrases they use or the pronunciations they favour), then instance begins to be transformed into system.

To give a couple of examples:
(1) rising intonation at the end of a sentence
This has long been noticed as a characteristic feature of Australian English, and is also favoured by some speakers of American English. To British listeners, a sentence ending on a rise sounds like a question – as if the speaker is saying 'She comes from Sydney?', rather than simply making a declarative statement. But in the last ten years or so, the popularity of Australian soap operas among British teenagers has led to the widespread adoption of this feature among younger people in the UK. It is too early to say whether this is a short-term blip or whether rising intonation will become standard practice for a significant (and possibly growing) number of British speakers.
(2) bored of
In all the pedagogical dictionaries I know – up to and including the brand new Longman Dictionary of Contemporary English (2003) – we are told that 'bored' is used with the preposition 'with'. In reality, however, more and more speakers say 'bored of'. Try searching with the 'Webcorp' software, which trawls the internet and finds sentences containing a particular word or expression (www.webcorp.org.uk). You will find things like this:

bored of

When the British National Corpus (BNC) was assembled in the early 1990s, there were 246 instances of 'bored with', but only 10 hits for 'bored of' – and most of these came from recorded conversations rather than from written texts. The bored of variant would still, I suspect, be regarded as incorrect by most teachers, but a search on Google finds 112,000 instances of this pairing, as against 340,000 examples of bored with. It is always a bad idea to make predictions about language, but bored of seems to be catching up with bored with, and may well end up being recognized as an acceptable alternative.

In both these cases, the gradual take-up of a new feature eventually reaches a point where it is favoured by quite significant numbers of speakers: that is, it begins to affect the 'climate' – the long-term system of the language – rather than merely being an isolated example of the 'weather'.

How do we explain these changes? The move towards 'bored of' may just be a case of speakers opting for a phonologically less awkward expression, but the increasing popularity of the rising intonation can be plausibly traced back to a specific event: the arrival in the UK of Australian programmes like Neighbours and Home and Away. Here, our exposure to repeated instances of this feature has had measurable (though not necessarily long-term) effects on the linguistic behaviour of quite large numbers of British speakers.

For the most part, this process seems reasonably benign: we can sometimes identify the influences driving language change, but there is no sense of this process being manipulated by any group or individual. But we live in a very 'connected' world, and there are some signs that certain forms of linguistic behaviour – certain 'ways of saying', if you like – may be being foisted on us. Take for example the language of call centres. Call centre operatives work with a very carefully written set of questions and answers (tellingly called a 'script') and are actively discouraged from deviating from this prescribed text. (There are even software companies that provide programs for the automatic generation of 'call centre scripts'.) Thus what may seem on first sight to be a spontaneous dialogue is actually anything but 'natural language'. Even the familiar greeting 'How may I help you?' sounds wrong, for two reasons: first, the fact that it is absolutely ubiquitous – no one in a call centre ever uses any other form of greeting – and second, the fact that it bears no relation to what people say in normal 'uncontrolled' discourse. Corpus data shows that this interrogative use of 'may' had almost died out by the early 1990s, yet it could now be revived through the influence of call-centre-speak. Constant exposure to a particular mode of expression tends to lead to it becoming embedded in the language.

The world of business, and especially of 'human resource management' is another rich source of new vocabulary that has spilled over into fields such as social work and politics. Words like synergy, facilitator, and benchmarking have spread like a virus in the last decade – and none more so than the dreaded stakeholder (for which there are now almost a million hits on Google). These words may have started out meaning something, but they are enthusiastically bandied about by so many charlatans that most sensible people now react to them with either suspicion or scorn. But here too we see a process where natural linguistic choice and variation seems to be stifled: a concept that could easily be articulated in a variety of ways is in practice always lexicalized by a single 'approved' expression.

Finally, to politics, and the distorting effect of the 'spin doctors'. In the week before the start of the war on Iraq, we were told that Tony Blair was 'working flat out to achieve a second UN resolution'. Now, in natural languages, there is an almost infinite number of ways in which a proposition like this could be expressed. It might therefore be reasonable to expect that different people would describe the same event in different ways. For example, you might expect one person to say 'Mr Blair is doing everything possible to get a second resolution', while another opines that he 'is making tremendous efforts', and yet another that he 'has been beavering away behind the scenes'. These are all perfectly natural and common ways of making the same point. But this is not what happened at all: with mind-numbing repetitiveness, every politician and every commentator used the identical 'working flat out' formulation, as if they had all been programmed like robots. You can check the story in any English-speaking newspaper (see www.onlinenewspapers.com), and you will find the same phrases used again and again. The next stage in the lead-up to war was that we had 'reached the diplomatic endgame', and this phrase too was recycled ad nauseam, as were several other key expressions in this gloomy drama. Presumably the way this works is that the Downing Street press office issues a statement using a particular form of words, and the politicians whose job it is to spread the message are expected to adhere closely to whatever formula has been decided on: it is, in other words, a top-down rather than a bottom-up process, and that is the opposite of the way that languages usually evolve.

Does it matter? Well, I suspect George Orwell might think so. His essay on 'Politics and the English Language' ( www.resort.com/~prime8/Orwell/patee.html) has an uncanny resonance as we listen to the politicians explaining why we have to go to war. Take this extract, for example:

Modern writing at its worst does not consist in picking out words for the sake of their meaning and inventing images in order to make the meaning clearer. It consists in gumming together long strips of words which have already been set in order by someone else, and making the results presentable by sheer humbug. ...When one watches some tired hack on the platform mechanically repeating the familiar phrases, one often has a curious feeling that one is not watching a live human being but some kind of dummy.

Sounds familiar? But Orwell's essay is not simply a rant against 'falling standards' in language. His central point is that if we don't make an effort to find the most appropriate and effective way to express what we mean; if, instead, we simply recycle formulae passed down to us by others, then this increases the likelihood that our utterances themselves will be banal, lazy, or at worst duplicitous:

It [our language] becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

Michael Rundell (michael.rundell@dial.pipex.com) is a lexicographer, has been using corpora since the early 1980s, and has been involved in the design and development of corpora of various types. As Managing Editor of Longman Dictionaries (1984-94) he edited the Longman Dictionary of Contemporary English and the Longman Language Activator. He is now an independent consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the Lexicography MasterClass Ltd (www.lexmasterclass.com) , whose activities include workshops in corpus lexicography and an MSc course in Lexical Computing and Lexicography at the University of Brighton (www.itri.brighton.ac.uk/MScLex). He is also Editor-in-Chief of the Macmillan English Dictionary for Advanced Learners (www.macmillandictionary.com).