You are here: Home » OED News » Newsletter archive » January 1996 newsletter » Historical corpus
Search the site | Contact us
 
January 1996 newsletter

Words, words, words: the OED's historical corpus

The idea for a full text historical corpus for use in the revision of the OED has its roots in the old paper concordances that the editors would consult as part of their research of words being drafted. The editors were constantly popping into different rooms of the old, sprawling library and reaching for these heavy old volumes - sometimes getting lucky, and sometimes not. It seemed a natural area for computers to make life a lot easier, and the way seemed straightforward enough.

In fact, it was not so simple. There was a lot of talk about collecting full-text corpora, and there were a lot of historical texts 'out there' in places like the Oxford Text Archive and Project Gutenberg, but making such a corpus practically available to the editors in Oxford was not so easy. Most of the corpora had been collected by synchronic linguists and represented exclusively modern texts, often without readily available bibliographical details of the sort needed by the OED. Such texts were fine if all you wanted to know was that a certain word appeared more than 50 times with a certain collocate, but not if you wanted to harvest citations from them.

Given the choice, paper was still faster and better. On the other hand, the historical full texts were not collected into a corpus (and thus not easily accessible), and came in a bewildering variety of forms. In the early days of the Text Encoding Initiative and before the advent of the Web, not many people bothered with SGML in creating their electronic texts. Once again, the tried and true 'paper' methods, with their origins as far back as Johnson, were not about to be displaced.

What we needed was a large collection of texts in homogeneous form that could be quickly scanned using a powerful search engine like Open Text's Pat. It was obvious that if we were going to acquire such a corpus, we would have to start building it ourselves more or less from scratch, and in April 1992 we began doing just that, starting with Hardy's Far From the Madding Crowd and Dickens's A Christmas Carol.

For the first six months, we muddled along 'in the dark' with a pseudo-SGML tagging system (still better than no tagging at all), collecting about 2 million words of text. Then, in the Fall of that year two events caused the Historical Corpus to catch fire.

In October, I attended a talk given by John Price Wilkin at a conference at the Centre for the New OED in Waterloo, Ontario. Price Wilkin described his own historical corpus in SGML and searchable with Pat. My reaction can perhaps best be summed up in Hamlet's breathless words: 'O my Propheticke soule'. Suddenly, we were no longer alone in this - and it was obviously doable. That conference marked the beginning of a friendship and collaboration that I am pleased to report continue to this day.

The second event occurred during my visit to Oxford in November of that year. We met with Lou Burnard of the Oxford Text Archive. Burnard, who was deeply involved at that time in preparing the TEI guidelines for SGML, had lots of texts, but very few marked up in TEI SGML. We wanted texts to mark up - and were happy to adopt the TEI standard. This meeting led to an important collaboration between OED and OTA. OTA began to supply us with 'raw' texts, and we began to mark them up.

We also converted our 2 million word corpus to TEI SGML and deposited the texts, as we have done with all our texts, at OTA for use by the scholarly community in general.

In the years since then, the OED's Historical Corpus has grown to more than 45 million words, ranging from Beowulf to Virginia Woolf. Since the texts are kept in a homogeneous SGML form and are searchable as a corpus, it is possible for an editor to consult every text with sophisticated queries in a matter of seconds, and thus time is saved for the real, analytical work of lexicography.

The task is far from finished, however. We are continuing to build the Historical Corpus and eagerly seek collaborations with scholars who have or are creating electronic texts. It is not necessary that these texts be in SGML form already, though it is best if they are in some sort of public exchange format, such as ASCII or RTF. If you would like to be involved in such a collaboration, feel free to contact us.