HLT Magazine (September 2000) - Ideas from the Corpora

Humanising Language Teaching
Year 2; Issue 5; September 2000

American corporate issues

Mike Rundell 24 September 2000

The US government is currently funding a corpus-based research programme called "FrameNet" at the University of Berkeley in California. In this exciting project, led by the eminent linguist Charles Fillmore, large volumes of corpus data are being annotated in a way that will enable corpus users (whether human or machines) to see the links between semantic roles and the various ways in which they are expressed, lexically and syntactically, in real text. (For background on the project, visit its homepage at http://www.icsi.berkeley.edu/~framenet/index.html.) The FrameNet database – though still far from complete – can already be queried live, at no cost, through an Internet connection (http://163.136.182.112/fnsearch/), and its eventual value for linguists, lexicographers, and computer scientists can only be guessed at.

Bizarrely, the corpus which this team of American researchers has been working with is the good old BNC – yes, the British National Corpus, a large and carefully-selected sample of British English. Why so? The simple answer is that no comparable corpus of American English currently exists. Now in fact there is a thriving research industry in the US based around the analysis of large collections of digitized text. This work entails crunching huge volumes of corpus data with the goal of enabling computers to "understand" natural language. The idea here is that when you ask a search engine (like Yahoo! or AskJeeves) a question such as "tell me about oral examinations", you should get back just the information that you require, and not a mass of irrelevant material (known in the trade as "noise") about dentistry or pornography. To teach computers to process our queries more intelligently, linguists and computer scientists are using corpus data to create sophisticated, context-sensitive models of the language. The corpora used in this field of research are typically huge and monolithic – a billion words of Associated Press newswires, for example. This is on the one hand an impressively large amount of text, but on the other hand a very limited variety of text in terms of both its content and its style.

Hence the unsuitability of such text collections for projects like FrameNet, which require more balanced and broadly representative corpora of the type used by British lexicographers. So although corpus work of a kind is flourishing in the US, the corpus revolution that swept through the UK's dictionary-making business in the 1980s has largely passed the Americans by. The household names of US dictionary publishing (Merriam-Webster, American Heritage, and Random House, for example) have yet to produce a single corpus-based dictionary. US publishers, too, have recently dipped a toe into the pool of pedagogical lexicography, with several new learner's dictionaries of American English being published during the last two or three years. But while their British counterparts like Oxford, Longman, Cambridge, and COBUILD all use corpora as the primary data-source for their dictionaries, for publications like the Newbury House Dictionary of English and the learner's dictionary produced by NTC (the National Textbook Company), the corpus revolution might never have happened.

While American lexicography holds out, Canute-like, against the tide of empirical language study, corpus-building in other languages is progressing rapidly. A team at the University of Pretoria, for example, is busy collecting data for new dictionaries of the formerly marginalized "Bantu" languages of South Africa: one of these, Tsonga, a rural language with fewer than a million mother-tongue speakers, already has a respectable lexicographic corpus of over 3 million words. It is an extraordinary situation: American English is the world's dominant language, whose home is the world's richest economy, yet dozens of "minor" languages, many of them based in countries that are far from wealthy, have superior linguistic resources on which to base their dictionaries.

How has this come about? It is difficult to avoid pointing the finger at Chomsky. During the first part of the 20th century, American linguistics boasted a thriving research community whose methodology was broadly comparable to that of the modern-day corpus linguist: these "empirical" linguists used field studies and frequency data to help establish and record the typical behaviour of languages. But Chomsky's work, with its emphasis on the linguist's own introspective judgements, rather than on external data, "changed the direction of [American] linguistics away from empiricism and towards rationalism in a remarkably short period of time" (Corpus Linguistics, Tony McEnery and Andrew Wilson, Edinburgh University Press 1996: p4). In Syntactic Structures (1957) Chomsky wrote:

Despite the undeniable interest and importance of semantic and statistical studies of language, they appear to have no direct relevance to the problem of characterizing the set of grammatical utterances

For Chomsky, the role of the linguist was to model language users' "competence" – the set of internalized rules and exceptions that underlies the utterances they produce when in real communicative situations (their "performance").

Following Chomsky, empirical linguistics became something of a minority sport in the US, and all the major lexicographic corpora of English (starting with COBUILD's in the early 1980s) have been developed in the UK. Meanwhile, American dictionary editors stick to the approach pioneered by Dr Johnson in the 18th century, basing their description of the language on large collections of "citations" – short extracts from texts, collected by human readers and showing a particular word or phrase in context. The limitations of this approach have been widely discussed by linguists such as John Sinclair, and for British corpus lexicographers the citation banks on which American dictionaries depend can only form one part (and no longer the main part) of the lexicographer's data-set.

A recent article by Dr Barbara Agor, in the "Learning English" section of The Guardian Weekly (21-27th September 2000), discusses the huge divide that has opened up between the British and American dictionary-making traditions – the former now mainly corpus-based, the latter still largely averse to using corpus data. According to Agor, "some American editors do not believe that corpora will reveal more about words than is already known from other sources". This claim could, in my view, only have been made by someone had had never seriously used a corpus.

But it looks as if American lexicography is going to be dragged kicking and screaming into the 21st century after all: a consortium of academics, computer scientists, and a small band of pro-corpus American lexicographers is now at the advanced stages of planning an "American National Corpus" on the same lines as the BNC (see http://www.cs.vassar.edu/~ide/anc/) Some US dictionary publishers have already signed up as members of the consortium, leaving outsiders open to claims that they are not keeping up to date. As Barbara Agor suggests, the market itself may ultimately determine the direction in which American dictionaries will go: with large corpus-based books like the New Oxford Dictionary of English available, how long can products based on inferior linguistic evidence remain competitive?