Reviving Lexicography through Text Corpora
Paschalia Patsala, Greece
Dr Paschalia Patsala is the Head of the English Studies Department of The International Faculty of the University of Sheffield, CITY College and a Lecturer of English Language and Linguistics. Being a lexicographer and editor, she has participated in a number of dictionaries and volumes published by Greek and foreign publishing houses. Her research interests include: Corpus Linguistics, Lexicography, Terminology, Second Language Acquisition, as well as Educational Technologies. E-mail: ppatsala@citycollege.sheffield.eu
Menu
Introduction
The role of corpora in the compilation of dictionaries
The merits of Corpus Linguistics applications in Lexicography
Limitations in the use of corpora in dictionary writing
Conclusion
References
The point of departure for the present article was my doctoral thesis where I aimed at investigating the extent to which structures that connectives/conjunctions occur in can be predictable, and thus recordable in dictionaries. Having examined the frequency of specific patterns and the possible role of particular lexico-grammatical factors in a sample of written discourse, it was a Modern Greek corpus that actually enabled me to make valid claims about the role corpora can play in capturing native speakers’ linguistic preferences as well as naturally produced discourse. In an effort, firstly, to recognise the most frequent lexical and syntactic patterns that connectives are linguistically realised in naturally-occurring sentences, and, secondly, to identify the elements that should be included in the corresponding general dictionary entries, I delved into the ultimate value of corpora for lexicographical projects.
To begin with, a corpus is “a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (Sinclair, 2005, p. 12). Within a fundamental exploration of the topic of corpora and corpus linguistics (or ‘corpus-driven/based/assisted’ analyses etc.), a reader may encounter a wide range of definitions of Corpus Linguistics (CL) and multiple approaches suggested by influential corpus theorists (summarised in Taylor, 2008). Yet, in broad terms, one may consider Corpus Linguistics to be “a whole system of methods and principles of how to apply corpora in language studies and teaching/learning” which undoubtedly carries a theoretical status, albeit not being theory in itself (McEnery, Xiao and Tono, 2006, p. 7-8). The view that Corpus Linguistics is a methodology - although not uniform in all studies—has been widely supported in seminal papers (see, for example, Bowker & Pearson, 2002; McEnery & Wilson, 1996; Meyer, 2002; Teubert, 2005; McEnery & Gabrielatos, 2006). Undoubtedly, the use of linguistic corpora in the field of Applied Linguistics has rapidly expanded over the past forty years, with the creation of easily accessible systems of electronic storage and analysis, and the ever-growing appreciation of the huge potential of corpora. Given that computer technology has advanced greatly since the 1960s, it has become feasible to store and access language data of increasing size. Some of the fundamental applications of language corpora can be summarised below:
- The use of corpora in translation, mainly in the form of parallel corpora in order to compare the use of apparent translation equivalents;
- Their contribution to literary studies, stylistics and forensic linguistics, since general corpora can be used to establish norms of frequency against which individual texts can be measured, or even to investigate cultural attitudes conveyed through linguistic choices or critical discourse studies (see Stubbs, 1996; Fairclough, 2000; Teubert, 2000; etc.);
- Their implementation in language teaching tasks and processes; corpora, for example, offer valuable information for language teaching about how a language works such as phraseology issues; students are also encouraged to explore corpora in order to observe usage of words and make comparisons among languages (Aston, 2001) (for more details on the use of corpora in applied-linguistics research studies, see Meyer, 2002; Aijmer and Stenström, 2004; Barnbrook, Danielsson and Mahlberb, 2005; McEnery, Xiao and Tono, 2006; Hornero, Luzón and Murillo, 2008; Römer and Schulze, 2009; etc); and,
- Lastly and most significantly in the present paper, the contribution of text corpora towards the compilation and production of dictionaries and grammar books.
The present article is organised in the following way: in section 2, the benefits of a corpus-driven approach to lexicography will be demonstrated, discussing certain main assumptions and possible implications that emerge when a lexicographer makes use of corpus evidence. Moreover, a number of possible applications of corpora will be briefly illustrated in section 3 in relation to dictionaries, along with arguments about the usefulness of corpora in describing how a language works and what the context can reveal about the items depicted in a dictionary. Lastly, section 4 focuses on the limitations and potential demerits of corpus tools that may emerge from the use and analysis of a language corpus, while certain validity and reliability issues are also explored. It should be also clarified that data extraction and filtering processes applied to corpus data (in order to extract the necessary information for a dictionary) do not fall under the scope of the present article.
Corpora have revolutionised the writing of dictionaries so greatly that it is now virtually impossible for publishing houses to produce a (Learner’s) dictionary that does not claim to be based on a corpus. As a result, even people who have never heard of what a corpus is, use the product of a corpus investigation. Obviously, no printed dictionary could claim full coverage of a natural language, as lexicographers have to compress and prioritise materials, especially in printed dictionaries mainly due to space constraints. As Sinclair (1996) rightly claims, “whatever lexical information is retrieved from these processes and stored in a lexicon, it will never be capable of describing open text [...] because of the capacity of text to create syntagmatic meanings that cannot be precisely anticipated and may in some instances contrast sharply with stored lexical information” (p. 91). Thus, any lexicographical realisation of a language cannot fully capture the wealth and complexities of a given language. However, corpus-based lexicography attempts to address this weakness that any dictionary in the past inevitably exhibited, demonstrating at the same time that there is a need of further linguistic research in order to reflect the language in its rich and ever-changing form.
In order to justify the need for employing corpora in contemporary Lexicography, one first needs to explain in what ways traditional lexicographical practices differ from corpus-based approaches. Firstly, the traditional practice in the field of lexicography is the de-contextualisation and isolation of single words; however, especially in the case of function words as opposed to content words which has constituted the focal point in my research, this process of describing the meaning of words ‘unadulterated’ by the context where they occur, impedes the dictionary user from familiarising himself/herself with the lexical and grammatical environment where such an item is usually encountered. Given that words do not appear isolated but in sentences which actually constitute their “initial, primary environment or context,” the role of any lexicographical team is to provide the user/reader with this type of linguistic environment, the so-called ‘co-text.’ At this point it must be clarified that, although context refers to “the situational context surrounding the speech event in which a word occurs” (Kitis, 2012, p. 46-47) and is distinguished from the relevant linguistic co-text, in this article the two terms will be used interchangeably, especially as co-text can generate context.
Moreover, in the past, the standard practice was to develop dictionaries in the absence of data validation. However, during the last decades, corpora have proved to be a promising and efficient source for creating both monolingual and bilingual dictionaries with updated information that would have never been collected through traditional, standard methods. Some decades ago in the lexicographical community, the potential of corpus linguistics was exploited merely for the extraction of examples. Nowadays, evidence from text corpora is incorporated in lexicographical products both systematically and to a greater extent. Although the genus proximum and differentia specifica approach - which had been adopted in traditional lexicography to describe word meaning as a language-independent concept has contributed a great deal to the semantic description of content words (see Teubert, 2007), defining grammatical terms, in particular, necessitates capturing different usages in correlation to the varied contexts in which these occur, or may occur. As a result, corpus linguistics has gradually become a subdiscipline of its own right, and the present paper aims to illustrate achievements and challenges offered by corpus tools. Undoubtedly, lexicography has been one of the major fields where adoption of a corpus-linguistics approach extended the scope of research and introduced new methods; indicatively, one could make reference to English language dictionaries produced by Sinclair in 1987, Atkins et al. in 1993, Kjellmer in 1994, Herberg et al. in 1997, etc. Accounts of using corpora in a consistent manner to compile dictionaries can be encountered in numerous projects (see, for instance, Baugh et al., 1996; Clear et al., 1996; Summers, 1996; only to mention a few). On the contrary, the majority of Modern Greek dictionaries to date - irrespective of whether they (claim to) have made use of corpora or not - have not systematically or coherently exploited any corpora of the Modern Greek language available to lexicographical teams (Patsala, 2015).
The originality of the corpus-driven approach is that it attempts to make transparent in a number of dictionary entries the evidence found in large text corpora, not only in terms of illustrative examples as initially practiced, but also in relation to: i) sense disambiguation, ii) definitions, iii) discourse function and iv) syntactic patterns the lemmas under question occur in. Among the areas that corpus evidence has significantly supported lexicography as a discipline are frequency, collocation and phraseology, grammatical constructions, and authenticity. Although this new approach still involves a wealth of human judgement, it embodies a great amount of careful qualitative observation and quantitative assessment of corpus evidence, aiming at identifying both semantic distinctions and any idiosyncrasies that the above connectives exhibit in natural language discourse. According to Teubert (2007), it is widely acknowledged that “dictionaries, both monolingual and bilingual, need to be corpus-validated, if not entirely corpus-based” (p. 132). As a result, the author of this article views a corpus-linguistic approach, not in opposition to traditional linguistics, but rather as complementary to it. Larger segments of text - and, therefore, more information about authentic context- can provide the dictionary compiler with recursive co-occurrences of text elements that can be made visible and, consequently, further explored by quantitative devices.
However, what corpora have offered to dictionaries, both monolingual and bilingual, is not only statistically-defined analyses; they have predominantly enabled and enriched the examination of each lexical item with both required and optional features and constituents of the linguistic context that it occurs in. As a result, lexemes can be now defined and described not only at a lexical, but more significantly at a syntactic and grammatical level. This facility has dramatically assisted lexicographers in eliminating subjective judgements and linguistic intuitions, as the meaning of words involved can be further investigated within relevant contexts and text segments or clauses.
One major contribution of corpus linguistics to the area of lexicography is associated with (idiomatic) phraseology. The huge amounts of linguistic data collected in corpora have enriched the list of collocations found in previously existing dictionaries (Hanks, 2009, p. 420) either in the form of lists of highly-frequent collocations or by bringing to the surface how phrases can appear in illustrative examples. In certain dictionaries, for example the Oxford Dictionary of English, phraseological units are identified in bold fonts accompanied with relevant examples. Other Learner’s dictionaries (see, for instance, the Macmillan English Dictionary for Advanced Learners) make even explicit reference to frequency rates or measures of collocates as these have been derived from corpus databases.
Another great innovation closely interwoven with the creation and use of corpora is the retrieval and introduction of full-sentence definitions in dictionaries, which was decisively put forward in C0BUILD dictionaries adding to the user-friendliness of learners’ lexicographical products (Herbst, 1996, p. 322). Given the severe competition between Publishing Houses with respect to English learners' dictionaries, other publishing houses (namely, Longman and Oxford University Press) invested greatly in up-to-date editions and innovations that relied heavily on corpus projects.
Struggling to keep a balance between conventions and creativity, and within a paradoxical situation where “introspection is dangerous when used as a source of invented evidence, although necessary for the interpretation of real evidence” (Hanks, 2012, p. 406-7), lexicographers are called to summarise the standard usage of any lexical item combined though with the creative nature of language mainly through the examples they include. Towards that direction, Corpus Linguistics has offered an endless source of examples facilitating, on the one hand, a considerable increase in authenticity and, on the other, reducing lexicographers’ introspection and speculation while generating ‘invented’ examples. It has been extensively claimed that the most notable way in which corpora have influenced the discipline of Lexicography is “the corpus-derived dictionary example” (Rundell, 1998). Hornby (as cited in Cowie 1978, p. 131), admitted that an ‘artificial’ example can carry a wide range of information aspects, particularly when space limitations are imposed on lexicographic teams. On the contrary, the value of ‘unnatural’ instances of discourse encountered in pre-corpus dictionaries has been strongly disputed by a great number of linguists and lexicographers alike (e.g. Laufer, 1992; Humble, 1998; etc).
Nevertheless, the enhancement of dictionary authenticity through the use of corpora is also mirrored in the analysis of meaning and the sense distinction/disambiguation in dictionaries. More specifically, corpus evidence has added significantly to the degree of detail and precision compared to previous lexicographical endeavours (see Moon, 1987; Rundell, 1998; as well as De Schryver & Prinsloo, 2000 on the microstructure of dictionaries). In the first place, both in widespread and less widely spoken languages, corpus evidence affects lexicographers’ decisions in lemmatising the various word classes of a lexical item and defining the macro-structure; on a second level, the internal structure of a lemma in a corpus-based dictionary is usually mapped onto the evidence retrieved with respect to “sense ordering, listing of collocational and colligational patterns” (De Schryver, 2008, as cited in Hanks, 2012, p. 422). If one juxtaposes, for example, definitions as these were encountered in pre-corpus dictionaries with those that appear in corpus-based lexicography based on explicit and rather objective criteria, the lexicographic definitions traditionally inherited from previous centuries seem rather ‘weak’ or ‘empty’:
- helpful:
- giving help; useful (ALD4)
- If you describe someone as helpful, you mean that they help you in some way, such as doing part of your job for you or giving you advice or information (C0BUILD2)
- witness:
- be present at and see (ALD3)
- to see something happen, especially an accident, a crime, or an important event (LEA)
(as cited in Rundell, 1998, p. 326)
Another trend and methodology encouraged by the development of corpus tools has been the wider or broader span that lexicographers tend to focus on nowadays, moving from the collocational approach (where they also examine issues of ‘semantic prosody’ and ‘valency’), to constructions mainly at a clause level. In order, for example, to be able to support claims about the extent to which structural frequency plays a role in the semantic and/or pragmatic meaning carried by a connective (while exploring the characteristics of a concessive versus a concessive conditional discourse marker, for instance), it is critical first to form an accurate and reliable estimate of the corresponding structural features of the clauses this connective usually occurs in (see Patsala, 2015). Thus, information emerging from large amounts of texts and data points towards the concept of “pre-assembled, more or less fixed, groups of words” or even clauses, which is in alignment with what Rundell (1998) has described as a “gradual shift in focus 'outwards' from the node” (p. 324).
Although issues of how corpora are designed are extremely important, what should be clarified is that in the present article I have not addressed issues related to a corpus project, in general, such as the problems of corpus compilation, encoding, annotation, representativeness and validation as processes needed for transforming raw data into language processing software.
Having argued in the previous subsections for the benefits of corpora in lexicography and the study of language, in general, at this point some limitations that the majority of corpus studies may exhibit (following Hunston’s 2002 analysis) will be briefly discussed.
- To begin with, a corpus cannot reveal whether a pattern may be employed, or not in a given language, but only whether it is frequent or not. In other words, the issue in focus is typicality, and not well-formedness, as Sinclair (1991) and Biber, Conrad & Reppen (1998) have rightly suggested.
- Along similar lines, a corpus offers evidence and information to its compilers or users; still, linguistic intuition and background knowledge is necessary to help the researcher interpret the rich pool of examples he/she has gained access to.
- Moreover, all conclusions drawn from corpus materials have to be treated as inferences, and not as facts; even when one claims that a specific corpus is representative, the generalisations made based on it are essentially extrapolations, and should be dealt as such in any study.
- Furthermore, naturally, frequency measurements in Applied Linguistics may suggest that very infrequent uses can legitimately be ignored, but in the case of lexicography one should not be misled to assume that these should be omitted. On the contrary, these marginal instances or patterns should be also described and included in the relevant (sub)entries whenever feasible.
- In a number of cases, the study of lexical or grammatical items suffers from a major drawback: the fact that the (sub)corpora compiled consist solely of written language. Undoubtedly, when judging the value of corpora, it is not only the written ones that are the most highly valued, but rather spoken corpora can be considered of equal—if not of higher—value. Even though a collection of carefully selected samples of spoken discourse by numerous different speakers under diverse acoustic conditions could offer valuable insights, this is not always feasible in relation to certain languages or dialects due to lack of spoken corpora.
- Lastly, it needs to be highlighted that any corpus reveals two types of linguistic information: a) “normal phraseology” that every lexicographer has a genuine interest in, and b) the “linguistic creativity” of language users (Hanks, 2012, p. 305); yet, there are no clear-cut lines between the literal or recurrent and the figurative use, and thus corpora cannot define the degree of ‘unusual’ or ‘abnormal’ nature of the usage of a lexical item (Hanks, 2006).
The present article has been an attempt to highlight the user-friendliness, usability, as well as functionality of corpora towards ameliorating lexicographic products. More advanced features and tools (e.g. the TickBox Lexicography) have improved dramatically the infusion of corpus information directly into lemma writing systems indicating that technological progress can further enhance the application of corpus tools In Lexicography. It needs to be noted that the major challenge in the future is to create and establish systems that can “support seamless data transfer” which would ensure bridging the gap between corpus data and dictionary content (Kilgarrif & Kosem, 2012, 54).
Moving from the use of the Bank of English (initiated in the COBUILD project) to the largest corpora built (e.g. the British National Corpus, the Corpus of American Contemporary English etc), the corpus analysis has aspired to become a platform that will allow (or, that has already allowed) members of the lexicographical community to perform the activity of describing the meaning of words in a more informed and linguistically-enlightened way. Undoubtedly, though, the formulation of a dictionary entry remains the responsibility of Lexicography, and not of Corpus Linguistics. And, although the collection and analysis of corpus data has resulted into radical changes in the field of lexicography, the pedagogical orientation of a Learner’s dictionary should always remain the main priority for a lexicographer.
Aijmer, K., & Stenström, A.B. (Eds.). (2004). Discourse Patterns in Spoken and Written Corpora. Amsterdam: John Benjamins.
Aston, G. (Ed.). (2001). Learning with Corpora. Houston, TX: Athelstan.
Atkins, B.S., Duval, A., & Milne, R.C. (Eds.). (1993). Le Robert & Collins: Dictionnaire Français-Anglais/Anglais-Français (4th ed). Paris: Le Robert.
Barnbrook, G., Danielsson, P., & Mahlberb, M. (Eds.). (2005). Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora. London: Continuum.
Baugh, S., Harley, A., & Jellis, S. (1996). The role of corpora in compiling the Cambridge International Dictionary of English. International Journal of Corpus Linguistics, 1, 39-59.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge.
Clear, J., Fox, G., Francis, G., Krishnamurthy, R., & Moon, R. (1996). Cobuild: the state of the art. International Journal of Corpus Linguistics, 1, 303-314.
Cowie, A.P. (1978). The Place of Illustrative Material and Collocations in the Design of a Learner's Dictionary. In P. Strevens (Ed.), In Honour of A. S. Hornby (pp. 127-139). Oxford: Oxford University Press.
Cowie, A.P. (Ed.). (1989). Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press. (ALD4)
De Schryver, G.-M. (2008). Why Does Africa Need Sinclair?. International Journal of Lexicography, 21(3), 267-291.
De Schryver, G.-M., & Prinsloo, D.J. (2000). Electronic Corpora as a Basis for the Compilation of African-language Dictionaries, Part 2: The microstructure. South African Journal of African Languages, 20(4), 310-330.
Fairclough, N. (2000). New Labour, New Language? London: Routledge.
Hanks, P. (2006). Metaphoricity is Gradable. In A. Stefanowitsch & S.T. Gries (Eds.), Corpus-based Approaches to Metaphor and Metonymy (Trends in Linguistics: Studies and Monographs 171) (pp. 17-35). Berlin: Mouton de Gruyter.
Hanks, P. (2009). The Impact of Corpora on Dictionaries. In P. Baker (Ed.), Contemporary Corpus Linguistics (pp. 214-236). London: Continuum.
Hanks, P. (2012). The Corpus Revolution in Lexicography. International Journal of Lexicography, 25(4), 398-436.
Herberg, D., Steffens, D., & Tellenbach, E. (1997). Schlüsselwörter der Wendezeit. Wörter-Buch zum öffentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter.
Herbst, Th. (1996). On the way to the perfect learners' dictionary: A first comparison of OALD5, LDOCE3, COBUILD2 and CIDE. International Journal of Lexicography, 9(4), 321-357.
Hornby, A.S., Cowie, A.P., & Windsor Lewis, J. (1974). Oxford Advanced Learner's Dictionary of Current English. London: Oxford University Press. (ALD3)
Hornero, A.M., Luzón, M.H., & Murillo, S. (Eds.). (2008). Corpus Linguistics: Applications for the Study of English. Studies in Language and Communication 25. Bern: Peter Lang.
Humble, P. (1998). The use of Authentic, Made-up, and "Controlled" Examples in Foreign Language Dictionaries. In T. Fontelle et al. (Eds.), EURALEX '98 Proceedings (pp. 593-9). Liege: University of Liege.
Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Kilgarif, A., & Kosem, I. (2012). Corpus tools for lexicographers. In S. Granger & M. Paquot (Eds.), Electronic Lexicography (pp. 31-55). Oxford: Oxford University Press.
Kitis, E. (2012). Semantics: Meaning in Language. Thessaloniki: University Studio Press.
Kjellmer, G. (1994). A Dictionary of English Collocations Based on the Brown Corpus. Oxford: Clarendon Press.
Laufer, B. (1992). The Effect of Dictionary Definitions and Examples on the Use and Comprehension of New L2 Words. Cahiers de lexicologie, 63, 131-42.
McEnery, T., & Gabrielatos, C. (2006). English corpus linguistics. In B. Aarts & A. McMahon (Eds.). The handbook of English linguistics (pp. 33-71). Oxford: Blackwell.
McEnery, T., & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press.
McEnery, T., Xiao, R.Z, & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.
Meyer, Ch.F. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press.
Moon, R. (1987). The Analysis of Meaning. In J. Sinclair (Ed.), Looking Up: An Account of the COBUILD Project (pp. 86-103). London: Collins.
Patsala, P. (2015). Semantic and pragmatic aspects of lexicography: The case of concessive connectives in Modern Greek. Doctoral Dissertation. Aristotle University of Thessaloniki.
Römer, U., & Schulze, R. (Eds.). (2009). Exploring the Lexis-Grammar Interface. Studies in Corpus Linguistics 35. Amsterdam: John Benjamins.
Rundell, M. (1998). Recent trends in English pedagogical lexicography. International Journal of Lexicography, 11(4), 315-342.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Sinclair, J. (1996). The Search for Units of Meaning. TEXTUS, IX(1), 75-106.
Sinclair, J. (2005). Corpus and text—basic principles. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 1-16). Oxford: Oxbow Books.
Sinclair, J. (Ed.). (1987). Looking Up: An Account of the COBUILD Project. London: Collins.
Sinclair, J. et al. (Eds.). (1995). Collins COBUILD English Language Dictionary. London & Glasgow: Collins. (COBUILD2)
Stubbs, M. (1996). Text and Corpus Analysis. Oxford: Blackwell.
Summers, D. (1996). Computer lexicography: The importance of representativeness in relation to frequency. In J. Thomas, & M. Short (Eds.), Using Corpora for Language Research (pp. 260-265). London: Longman.
Summers, D. et al. (1997). Longman Essential Activator. London: Longman. (LEA)
Taylor, Ch. (2008). What is corpus linguistics? What the data says. ICAME Journal, 32, 179-200.
Teubert, W. (2000). A province of a federal superstate, ruled by an unelected bureaucracy: Keywords of the Eurosceptic discourse in Britain. In A. Musolff, C. Good, P. Points, & R. Wittlinger (Eds.), Attitudes towards Europe: Language in the Unification Process (pp. 45-88). Aldershot: Ashgate.
Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1-13.
Teubert, W. (Ed.) (2007). Text Corpora and Multilingual Lexicography. Amsterdam: John Benjamins.
Please check the English Update for Teachers course at Pilgrims website.
Please check the English Language course at Pilgrims website.
Please check the Teaching Advanced Students course at Pilgrims website.
|