HLT Magazine, May 03: Ideas from the Corpora

Would you like to receive publication updates from HLT? You can by joining the free mailing list today.

Humanising Language Teaching
Year 5; Issue 3; May 03

Cut out the middleman: how to do it yourself with corpora

Michael Rundell

A recent conversation with a couple of teachers in Portugal made me aware that – while there is plenty of information around on the subject of using corpora as an aid to langauge teaching – tracking down the data you need isn't always that easy. This month's column, therefore, is a straightforward factual summary of easily available resources that should be useful to anyone who likes the idea of using corpus data but doesn't really know where to begin. It makes no claim to being definitive or exhaustive, but most of the material discussed here is stuff I have used myself at one time or another – and it works.

Introduction: what's the point of 'data-driven learning'?

Corpora, or collections of text in digital form, have been used for over 20 years as the raw materials for dictionaries and – to a lesser extent – for grammars too. As far as the English language goes, this all started in 1980 with John Sinclair's COBUILD project at Birmingham University in the UK, and nowadays no serious English dictionary is written without a basis in corpus data. As a result, the quality of these books has been transformed. The basic rationale here is that, if we are going to make generalizing statements about the way words, phrases and other linguistic phenomena behave – and a dictionary is essentially a comprehensive set of generalizations – then it makes no sense to do so without examining evidence of real language in use. This allows us (inter alia) to look at the relative frequencies of particular linguistic features; to explore variation across different registers; to formulate and test hypotheses; and above all to discover 'rules' (in the sense of systematic linguistic behaviour) that we would never have known about without the opportunity to examine large amounts of language data.

And if this is now the standard methodology for producing dictionaries and grammars, why shouldn't any other form of language-teaching material – whether a major multi-level coursebook or a one-off classroom exercise – be rooted in authentic data of language use in real communicative situations? In fact, there is a small but dedicated community of enthusiasts (teachers and academics, for the most part) who have been toiling away in this vineyard for two decades, without ever quite making it to the language-teaching mainstream (though far greater numbers of teachers are now aware of this methodological strand). During that time an enormous body of invaluable material has been developed for both teachers and learners of languages. More on this later.

Getting started: what you need to be a corpus linguist

All you need is some text to analyse and some software to analyse it with. The kinds of question you ask of the data, and the practical uses to which you put the answers, are limited only by your own imagination. Having said that, there are plenty of interesting and well-thought-out activities and exercises, and – since there is no point in reinventing the wheel – we will look at these a little later.

Finding texts to work with used to be quite a challenge. At a time when most books and newspapers were set using hot-metal printing, the only way to get 'corpus' data was to convert the text to digital form by typing it into a word-processor or by scanning it – both quite labour-intensive operations. The situation has now changed out of all recognition, and vast amounts of digital text are freely available.

What counts as a 'corpus' depends to some extent on what you are using the data for. Writing dictionaries, which aim to record the salient linguistic facts about a whole language, requires large and diverse text collections that are broadly representative of the main genres in which the language is found. This need has driven the development of the major English corpora such as the British National Corpus (BNC), the Bank of English, and the forthcoming American National Corpus (ANC). Of these, the 100-million-word BNC, probably the world's most carefully designed and scrupulously annotated corpus, is the easiest to get hold of. A single-user BNC licence is available for £50 from:
www.hcu.ox.ac.uk/BNC/getting/index.html

The BNC comes complete with 'SARA'. its own corpus-querying software (the programs that enable you to produce concordances and analyse the text in various other ways), but the thousands of texts that make up the BNC exist as plain-text (.TXT) files, so you can use them with any other concordancing program too. To enable users to identify the genres and text-types they are interested in, David Lee has compiled a helpful index to the BNC files, available at:
http://clix.to/davidlee00

There is also an excellent book giving practical advice on using the BNC and getting to know the (powerful, but rather difficult) SARA program: The BNC Handbook: Exploring the British National Corpus with SARA, Guy Aston and Lou Burnard, Edinburgh University Press 1998 (a second edition is rumoured to be in the pipeline).

Buying the BNC, and learning to install it and use it effectively, is a big commitment (though well worth the effort). If you don't yet feel ready for this, you can also query the BNC online in a more ad hoc way, at:
http://sara.natcorp.ox.ac.uk/lookup.html

Here you can enter a search term (any word or phrase you are interested in) and the service will return 50 concordance lines showing your search item with a reasonable amount of surrounding context. A similar service allows you to search the Bank of English, not only for words and phrases (the Cobuild Concordance Sampler), but also for frequent collocates (the Cobuild Collocation Sampler). Both are at:
http://titania.cobuild.collins.co.uk/form.html

A rather different way of getting concordances via the Internet is the University of Liverpool's 'WebCorp' site, at:
http://www.webcorp.org.uk/

As with the BNC and COBUILD services, you can type in a word or phrase and wait for WebCorp to find instances of it and display them in the form of concordances. The big difference is that WebCorp is not searching through an existing, finite corpus: instead, it trawls the entire Internet – and is consequently not recommended if you are just looking for something simple and frequent. Conversely, it is an excellent way of searching for complex phrases and neologisms. To give an example, I recently read an article about the doomed Enron Corporation, which said that a culture of secrecy had 'become part of the company's DNA'. I was interested to see whether this 'genetic' metaphor was used more widely. A search on WebCorp for the string 'part of its DNA' returned dozens of instances of the phrase. Perhaps two-thirds of these were in the expected 'literal' meaning, thus:

part of its DNA

But a substantial number exploited the same metaphor found in the Enron article, for example:

Tradition, experience, credibility, solidity and innovation: these are the five identification values of the Pirelli brand, now part of its DNA. Self-established dreams and a strong resolve to achieve them have embodied Honda's corporate culture since its establishment. Continuing to challenge is part of its DNA.

Award sponsor Mervyn Pedelty said that Scottish Power was awarded the title for making responsible practices "part of its DNA". The Body Shop board was forced to recognize that the company's values are part of its DNA – and subsequent restructurings have taken this into account. Independence from the ideologies and coteries that are so rife in Italy has always been a key value in Olivetti, part of its DNA. In some cases it's more painful to the ears then in others, but the screechy-ness is always there as part of its DNA. This has eventually made me stop listening to classical music altogether.

The data shows us not only that this expression is beginning to be used quite widely (to the point where it is likely to feature in future dictionaries), but that it has a marked tendency to appear in corporate 'vision statements' and similar texts. WebCorp is thus an extraordinarily powerful resource for anyone wanting to check out their hunches or search for more evidence of a rare or novel usage.

Going it alone: the throwaway corpus

The large, structured corpora developed in the 1990s reflected the technology available at the time. But it is now possible (and indeed, pretty straightforward) to build your own 'corpus', exploit it for whatever purpose you like, then throw it away and build another one. The reason is that text of all kinds is easy to find, and storage space on computers more than adequate for almost any purpose you can imagine. (The hard disk on the cheapest entry-level PC would easily hold 1000 million words of text.)

A good place to start might be one of the major 'text archives'. These are repositories of text in digital form, and for the most part store books that are out of copyright. This makes them a good source of 19th century literature (such as the complete works of Jane Austen, George Eliot or Charles Dickens) and other major classics (like Darwin's On the Origin of Species). Using a broadband connection, you could download the complete Sherlock Holmes stories in a couple of minutes, then a quick corpus search will reveal that Holmes never actually said 'Elementary, my dear Watson'. Two of the best known archives are the Oxford Text Archive and Project Gutenberg:
http://ota.ahds.ac.uk
http://www.gutenberg.net

Beyond that, there is the whole of the Web: massive, anarchic, and teeming with text of every possible type. Estimates of its size vary, but the likelihood is that there is almost 100,000 million words of English out there, and the number is growing by the day. Interested in Friends or Buffy the Vampire Slayer? Then you can download entire scripts of these or any other TV show of your choice. There are hundreds of amateur enthusiasts who have lovingly transcribed every episode of their favourite programme, and posted them on the web. To find them, try going to Google (www.google.com) and typing in something like:

“The Simpsons scripts” (use double quotes)

This search yields several dozen likely websites, so you will have no problem finding what you want. (The same, of course, goes for movie scripts.)

If politics, sport, and current affairs are more your thing, the web is also awash with online versions of almost all the newspapers in the world. You can either go directly to a paper you know (such as www.guardian.co.uk or www.washingtonpost.com) or try a general global directory like the one at www.ims.uni-stuttgart.de/info/Newspapers.html. If you find yourself using news text on a regular basis, it may be worth investing in software such as Blue Squirrel's 'Grabasite', which makes the whole extraction process much more efficient. You can get a 30-day trial version of Grabasite at: http://www.bluesquirrel.com/products/grabasite/index.html and the full version costs about $60.00.

Meanwhile, 'biographical web logs' (daily diaries kept on the web, better known as 'blogs') are the new major growth area, and supply text of a more personal nature. Here is one posted on the day I am writing this piece, found at the www.blogger.com website:

emotions

For more information about corpora and other sources of text, two websites are always worth a visit: Michael Barlow's Corpus Linguistics page:
http://www.ruf.rice.edu/~barlow/corpus.html

and David Lee's amazing page of 'bookmarks for corpus based linguists' at:
http://devoted.to/corpora

For anyone interested in investigating languages other than English, both these sites provide copious links to corpora in a wide range of languages, including Chinese, Spanish, Czech, and Hebrew. There is even a small corpus of Manx (the Gaelic language of the Isle of Man). The web, too, though dominated by English, includes large amounts of text in other languages. Recent estimates made by my University of Brighton colleague, Adam Kilgarriff, include the following:

Now you have your text ...

...You need software to analyse it with. There are two well-known programs that provide concordancing tools, wordlist builders (for looking at frequency data), and much else besides. These are WordSmith Tools, available from Oxford University Press (go to http://www.oup.co.uk/isbn/0-19-459286-3) and MonoConc Pro, sold by the U.S. publisher Athelstan (http://www.athel.com/). Both cost in the region of £50 sterling, and are well-tried, robust pieces of software. Which one you go for is largely a matter of taste, though it is probably fair to say that WordSmith has a more powerful range of functions, while MonoConc is – for the non-expert – friendlier and easier to use. There is a useful comparative review of the two programs at: http://llt.msu.edu/vol5num3/review4/default.html Mike Scott, the author of WordSmith Tools, also has a number of smaller (and free) bits of software designed for teachers, at: http://www.lexically.net/software/index.htm

The Barlow and Lee websites (above) list several free concordancers and, though I haven't used any of these myself, ConApp (at http://www.edict.com.hk/pub/concapp/) certainly looks an impressive package. The same sites also give information about a range of text-annotation tools, including part-of-speech taggers (which automatically assign a part-of-speech label to every word in a corpus), but this is perhaps further than most users are interesting in going.

Finally, if you are looking for ideas about using your corpus skills to produce teaching materials, a good place to start is the 'data-driven learning' page of Tim Johns of Birmingham University. Tim Johns was writing programs to generate language-teaching exercises as far back as the 1970s, initially using a Sinclair Spectrum, and is (rightly) regarded as the founding father of this whole field. His web page includes a collection of fully road-tested activities that use corpus data to investigate grammar and vocabulary: http://web.bham.ac.uk/johnstf/ddl_lib.htm

Another well-known figure in this field is Chris Tribble, who co-wrote a book called Concordancing in the Classroom, which is packed with good ideas, and has also devised some simple macros for working with text, which can be downloaded from his website at: http://www.ctribble.co.uk/Langtch.htm

David Lee's 'Devoted to Corpora' webpages (above) give further clues about ideas for creating teaching materials.

Over to you....

Michael Rundell (michael.rundell@lexmasterclass.com) is a lexicographer, has been using corpora since the early 1980s, and has been involved in the design and development of corpora of various types. As Managing Editor of Longman Dictionaries (1984-94) he edited the Longman Dictionary of Contemporary English and the Longman Language Activator. He is now an independent consultant, and (with the lexicographer Sue Atkins and computational linguist Adam Kilgarriff) runs the Lexicography MasterClass Ltd (www.lexmasterclass.com) , whose activities include workshops in corpus lexicography and an MSc course in Lexical Computing and Lexicography at the University of Brighton (www.itri.brighton.ac.uk/MscLex). He is also Editor-in-Chief of the Macmillan English Dictionary for Advanced Learners (www.macmillandictionary.com).