|
176 122 588721 368372 60891 |
Searchable lexicons Fully analyzed lexicons Total word entries Fully analyzed word entries Total English modern headwords |
Introduction
Ian Lancashire, Editor
March 1, 2006
Lexical information takes many forms in this period because the dictionary was an emerging genre. The notion of an English-only, monolingual lexicon was late in coming. Only in 1623, with Henry Cockeram's hard-word lexicon, did the term "dictionary" (first employed in English by Sir Thomas Elyot in 1538 for a bilingual lexicon) acquire a sense like that we take for granted today. Historical lexicons also take many different forms. Most LEME lexical texts have word-entries that open with a headword and close with an explanation of that headword, but explanations of words also appear inside informative treatises and literary editions with marginal glosses or notes that explain terminology. Encyclopedic or topical works, such as herbals and books of reference in medicine or law, sometimes offer logical definitions of things in subject-complement ("is-a") form.
Why compile a database of old dictionaries when English has the great Oxford English Dictionary? Oxford lexicographers give a scientific account of the history and meaning of all English words, based on corpus-linguistic principles. That is, quotations support every definition. Now in its second edition, available online, and proceeding to a monumental third edition, the OED grows with the English language. Even a monumental work that covers 1500 years, however, necessarily selects lexical evidence. Jürgen Schäfer observed that Early Modern English quotations in the first edition of the OED predominantly come from major authors and overlook information in monolingual glossaries. Clarendon Press published Schäfer's Early Modern English Lexicography in 1989. It surveys 133 printed glossaries to 1640 and provides new evidence for 5,000 OED entries. The OED has expanded its coverage of authors, thanks to Schäfer's achievement. Yet he does not provide the electronic data on which his extracts are based; and any English lexical expression in the explanations of huge bilingual dictionaries by the likes of Cotgrave, Florio, Minsheu, and Thomas Thomas, is hard to find and thus easily overlooked.
The public version of LEME allows anyone, anywhere, to do simple searches on the multilingual lexical database -- which includes major monolingual English and bilingual English-French-Italian-Latin-Spanish dictionaries -- but lacks advanced retrieval options, such as proximity and Boolean queries, regular expressions, and searching by date, subject, language, position in the word-entry, etc. Bibliographical entries for only the searchable texts are available. Queries can be restricted to one or more individual works in LEME, and up to one-hundred word-entries in which a queried word occurs may be retrieved. Results of searches may be held in a notepad for printing or e-mailing. Context-sensitive help for searching is available.
The public version serves general readers and schools. The licensed version serves colleges and universities.
We welcome comments and suggestions but we do not have sufficient resources to answer all questions.
The primary LEME database of lexical works, in the licensed version, also offers simple and advanced searches, including regular-expression and sub-string queries, and proximity and Boolean searches. The size of search contexts is adjustable. Queries on the lexical database may be restricted by date, author, title, type of lexical work, and subject. A complete word-list of the lexical database may be browsed. An index to over 1,200 known lexical works in the period may be searched by date, author, title, subject, and genre. There is also a biographical index.
The licensed version offers, as well, a new documentary period database of over 10,000 works from the Early Modern era, generously donated by Early English Books Online/Text Creation Partnership. (This is available only to individuals or institutions already subscribing to EEBO/TCP.) Additional texts have been generously made available by the Women Writers Project, the Internet Shakespeare Editions, and Renascence Editions. Searches on the documentary database are not restrictable but there is a complete, browsable word-list of over 2.3 million strings for the period.
There are two kinds of work in the LEME lexical database: analyzed and unanalyzed.
The headwords of analyzed lexical works have been editorially lemmatized and segmented by headword, explanation, sub-headwords, sub-explanations, and cross-references. Lemmatization and segmentation enable us to restrict searches of analyzed lexical works by the usual modernized, lemmatized spelling of their headwords, and by the position of the words to be retrieved. Analyzed lexical works can also be displayed, page by page, entry by entry. Each analyzed word-entry has a permanent URL so that an online scholarly edition, dictionary, or critical work can cite a LEME word-entry in confidence that any online reader will be able to retrieve it.
Unanalyzed lexical works cannot be displayed, searches on them are global rather than restrictable, and these word-entries lack any permanent URL. LEME often includes a lexical work, in advance of its analysis, because more readers will want the opportunity to search a lexicon than to read one, entry by entry, given that EEBO/TCP and several well-known facsimile series make the texts of historical lexicons available as books or images of books.
There are advantages, however, in browsing a displayed and analyzed lexical text. A reader can select specific headwords and can bring up all other word-entries with the same headword and so compare how different lexicographers, over time, have commented on or used the word. Displays of printed books optionally give a link to Early English Books Online where an online facsimile of that text-page may be consulted. Readers can also list all normalized headwords that are unique to any one lexical work. Such lists identify headwords that either eluded most monolingual English lexicographers (perhaps they were an obvious term from the mother tongue) or that entered and left the language quickly.
Knowing the position of a retrieved term in any word-entry is advantageous. For example, any term retrieved from a headword position in a monolingual English dictionary or glossary will usually be the direct subject of that entry. In this instance, a lexicographer explains an English word forthrightly, as Henry Cockeram (1623) does in the word-entry, "Death. Mortality." Any English term retrieved from the explanation position in a bilingual or polyglot lexicon, on the other hand, generally translates a foreign-language word. Such English terms are not the raison-d'être of the word-entry. That an English word corresponds to a foreign-language term helps us normally only if we understand that other language. The entry, "Castillo a castell," in John Thorius' rendering of a Spanish lexicon by Anthony de Corro, for instance, only helps us understand English "castell" if we know the meaning of the Spanish headword independently. Other corresponding English terms in the explanation segment of a bilingual lexicon, of course, are sometimes synonyms for the English word that is the subject of the query. However, the multiple senses of foreign-language words may bring together English terms in an explanation that are not semantically related in English itself. Consider John Florio's word-entry in 1598:
Famigliare, to set vp houshold, to become familiar, to tame. Also familiar, tame, gentle, acquainted, conuersant, a houshold guest.When two people decided to live together in Elizabethan England, they did not necessarily intend to "tame" one another, Shakespeare's The Taming of the Shrew notwithstanding. Collocational proximity of English words in the explanation segment of a foreign-language lexicon does not guarantee their synonymity.
For this reason, LEME enables readers to limit searches to different genres of lexical works, such as hard-word dictionaries on the one hand, and bilingual and polyglot dictionaries on the other. A word is marked differently by different lexical genres. Are you interested in new words introduced into English from other tongues? Look at hard-word lexicons. Do you want to investigate the mother tongue? Consider the English words employed in bilingual and polyglot lexicons. Foreign-language dictionaries normally use common, well-accepted English words to explain foreign vocabulary.
The licensed LEME database includes a large primary bibliography of over 1000 early works known to have lexical information about English. LEME at present thus holds somewhat under fifteen percent of the lexical works of the Early Modern period. This total does not include the multiple versions or editions through which any one of these texts went. The primary bibliography of all these lexical works -- not just lexical works in the searchable subset -- are indexed by author, title, date, subject, and genre. The bibliographical entries served by these indexes give an example of the lexical content of the work and as full a coverage of secondary literature about it as possible.
Search requests for a term often give a reader hundreds of word-entries. Terms may occur anywhere in an entry; and some occurrences do not help much with its meaning. The Modern Headwords word-index and search in LEME gives the reader-in-a-hurry only those word-entries that explicitly explain the search-term. Modern Headwords retrieves word-entries that include, in the case of English, a headword in multiple alternate spellings, forms, and inflections. Headwords may be selected manually from a ready-made list. English ones can be retrieved both in alphabetical order and by part of speech: an advantage, because many identically-spelled word-forms are both nouns and verbs, or adjectives and nouns. The analyzed part of the lexical database, in this way, gives readers a way of reducing the quantity of word-entries retrieved, and at the same time of increasing their relevance. The Modern Headwords word-list resembles the collected headwords in a regular dictionary. It irons out some differences in structure and spelling that characterize words in the dictionary world in the centuries before lexicographical standardization.
LEME processing of headwords for the Modern Headwords word-list has two stages: first we convert old-spelling forms into modern ones, and then we lemmatize modern spellings. LEME accepts the spelling of headwords in the Oxford English Dictionary as standard modern spellings because the Early Modern period lacks any accepted spelling system. This rule-of-thumb has some odd effects. It retains old spellings such as "murther" (which has a different OED headword from "murder") and abandons ones like "church-esset" (which becomes "church-scot"). Lemmatization reduces inflected forms to a standard inflection. For example, it converts all forms of a verb to the infinitive (except present participles that are used as nouns, and past particles that are used as adjectives), and the plural and genitive forms of a noun to the nominative case.
Entering a search term oneself involves some guesswork as to spelling and inflection. I may look for "abalienating" and come up empty, although the infinitive form, "abaleanate," exists. To help readers find such elusive search terms, the licensed version of LEME offers three different word-lists for browsing and for selecting strings to search for: Modern Headwords (as for English lemmas), a LEME word-list (for all words, in whatever language, from the lexical works), and a period word-list (for all strings in the EEBO/TCP corpus). Readers browsing for search terms in a modernized edition of an Early Modern English work should use Modernized Headwords first. Readers using old-spelling editions should use the LEME or the period word-list to browse for search terms.
Although LEME lexical works offer explicit information about Early Modern English vocabulary, all English writings of the period give implicit evidence of vocabulary size and word-meaning. The LEME period database offers a word-list of over 10,000 of these non-lexical texts, courtesy of EEBO/TCP. By comparing the complete word-list of LEME lexical texts with the complete word-list of the documentary database, we can get a more complete picture of the state of the language. How big was it, say, in Shakespeare's lifetime? Several answers to that question can be imagined. One could lemmatize all printed and manuscript English works surviving from those years and then count the lemmas (i.e., count the infinitive forms of verbs, the nominative singular forms of nouns, etc.). Another answer might simply conflate all spellings of the same inflected word-form and then count the total number of word-forms (i.e., all forms of a verb, no matter how inflected, so that "abalienate," "albalienates," "abalianated," etc., would be different words). We index EEBO/TCP texts as they are. That many strings in the period word-list are not genuine words but errors or inappropriately-split words, that is, substrings of words, does not in any way lessen their value.
Transcriptions retain original spelling, capitalization, and pointing or punctuation. For example, u and v, and i and j, are kept as they appear in the original. The following exceptions to this general practice hold: variant forms of s, r, / (virgule), end-of-line hyphen (= or -), and & (ampersand) are normalized to their modern forms; capital or long-I is represented as I, and capital ff as F; y with a superior dot becomes y; and all ligatures except æ and œ are separated. Capitalization varies widely from manuscript to manuscript. What, in general, may and may not be a capital (for example, the large form of A found in certain court hands) is determined by its presence or absence in medial position. Greek and Hebrew characters are reproduced as faithfully as possible. Transcriptions record, in tags found in the underlying text, what forms of script (Anglicana or court hand, secretary, italic, round, etc.) and typeface (black letter, italic, roman, lapidary or monumental, etc.) are used at any point in a text but does not render the original script or typeface in the displayed text itself.
Format is displayed only where it has functional significance. For example, paragraphing and lists are retained, but lineation is determined by the size of the display screen, and soft hyphens are not shown and do not cause a new line-break. Columns, hanging words, and the text of running titles, signatures, and catchwords are not retained, although folio and page numbers, and signatures appear in pagination tag. (We try to keep original lineation, however, in the underlying encoded text for ease of proofreading.) Tables present special problems: LEME reformats tabular word-entries so that they may be easily and faithfully read.
Some normalization occurs. Variation in the space separating two words is normalized to a single space. Sometimes faint printing has left certain pairs of characters confused (such as f and long-s, e, c, and t, and u and n), although in context only one is possible. In these instances, which can be very numerous in some printed works, the correct letter is silently preferred rather than an incorrect letter that would require emendation.
Abbreviations and contractions are expanded but are indicated in the text by coloured highlighting. A code for a brevigraph in the original appears in a tag surrounding the expansion in the underlying source text. The interpretation of a code for a brevigraph in the underlying encoded text may vary from one work to another because the practices of scribes and compositors vary. A few very common marks of abbreviation, particularly those still current now -- such as ampersand and types of money -- are unexpanded.
Editorial emendations and letters supplied for damaged text are noted in the underlying source text and marked by a colour change in its text itself. Errata are treated as authorial emendations. A word incorrectly separated into two or more fragments, or two or more words incorrectly joined, are explicitly emended. The original reading and the source of the emendation, if any, appear in a tag surrounding the emendation. Original erroneous readings appear within a tag surrounding the correction in the underlying source text.
Undated printed books are dated according to STC or Wing. Manuscript texts generally are dated by limits.
Identification of sources for quotations or citations, and of proper and place names, is not yet attempted. A word-search will bring up all word-entries from preceding lexical works from which a word-entry may have been copied. LEME does not translate words or passages from non-English languages into English.
A look at the source text underlying a word-entry will show the XML-like tagging by which segmentation and editorial functions are managed. The standard LEME model for word-entries imposes a simple structure: a form (with an editorially-assigned headword or lemma), an explanation, a cross-reference, a sub-form and a sub-explanation, a class type, and a term or a expression. Word-entries are gathered within word-groups, and (where useful) word-groups into sections. This model does not imply much about the lexical theory held by any Early Modern English lexicographer. The word-entry itself, in a discursive text, may have few explicit formal features. A LEME "word-entry" is a sometimes uncertainly demarcated passage that characterizes a word or a phrase. By "form" is meant the word or a phrase that may reference some thing or that is being translated (from another language), logically defined (as a thing), expounded (conceptually), or explained etymologically (as a word). Usually the form comes first, identified typographically by typeface or script. The explanation typically follows, marked often by a distinctive typeface or script, and characterizes the form in some way. The "explanation" segment in foreign-language dictionaries, clearly, gives corresponding expressions for, not analysis of, the headword. In prose treatises, the explanation may precede the form and may be typographically undistinguishable from it. Thus an "explanation" includes equivalents (or corresponding terms), synonyms, logical definitions (of things, not words), conceptualizations, etymologies, and even anecdotal digressions. It would be misleading to subdivide the explanatory, post-lemmatic segment of most early word-entries into different senses unless they are explicitly delimited by the lexicographer.
LEME interprets lexical texts when it imposes this minimalist model on them but does so only to the degree necessary for input into a relational database. LEME makes as few modern assumptions about lexical meaning as possible. By not encoding a passage in a text as a "word-entry," of course, LEME characterizes that passage as lexically unformed.
Proofreading is carried out during data-entry, encoding, and lemmatization. A policy of linking word-entries to EEBO images of the original text indicates that historical dictionaries, typeset in the English Renaissance, have many ways to resist accurate transcription. Correction of mistakes wherever they are found is an ongoing activity of LEME.
EMEDD was free. It enabled registered researchers to do single-word and proximity searches on one or all of these lexicons. Word-entries were retrieved from an Open Text Corporation Pat textbase of about 200,000 word-entries in all. Small to very large contexts were allowed, and a total of 100 hits was permitted for each search. Texts were segmented by word-entry in order to be processed by Pat and output with Patterweb, an interface developed by my former student Mark Catt.
The design of LEME texts has changed over time. Ten years ago, influenced by the Text-Encoding Initiative, they were SGML-encoded diplomatic transcriptions of single copies of original editions. Printer's matter (bibliographical text such as running titles, signatures, and catchwords), for example, was retained. Problems attended this original encoding. TEI guidelines for encoding modern dictionaries, at that point, did not well serve the experimental structures employed in early lexicons. Headwords, for example, turn up embedded in marginal notes, sentences, and tables, as well as where we expect to find them: an initial word, highlighted by means of a font, and followed by fields of information like inflection, etymology, definition, and illustrative citations. Sometimes headwords look like Latin but are treated as English, so that even as simple a matter as the language of a headword is unclear. The order of segments in a word-entry is highly variable. The dictionary attached to the standard school grammar by William Lily and John Colet, for instance, creates word-entries with a Latin headword, a Latin explanation, and an English equivalent. The post-lemmatic segment of most Early Modern English lexical works seldom holds lexical definitions as we know them. Some lexicons, like Randle Cotgrave's great French-English dictionary (1611), have a well-articulated structure, including quotations in the original language and translations of them, and streams of sub-entries for phrasal constructions. Other texts fill the explanation with equivalents in several languages, discussion, historical, political, religious, and literary comments, helpful advice, personal remarks, and even (occasionally) nothing at all. Eventually, resisting an impulse to encode every feature, recognizing that tagging involves a degree of interpretation, I segmented texts lexically by tags into alphabetical or topical word-groups, a sequence of word-entries inside them, and forms (headwords) and explanations within a word-entry. A word-group might have only one word-entry, and forms and explanations could occur anywhere, or nowhere, in a word-entry. The more lexical works placed in the database, the less prescriptive LEME's model had to become.
A four-year grant by the Social Sciences and Humanities Research Council of Canada enabled me to begin LEME in 2000. In the next few years, the University of Toronto Library and I started building the LEME database. Dr. Marc Plamondon acted as programmer, supervised by Sian Meikle and myself. By mid-2004, a very generous sub-grant from the Geoffrey Rockwell's TAPoR project at McMaster University, part of a six-institution Canada Foundation for Innovation (CFI) grant, enabled us to produce the alpha version of LEME for release to that network of partners in November 2004. This became a beta-version in mid-2005. Another grant from SSHRC in early 2005 further increased the pace of LEME data-entry and programming. The University of Toronto Press and Library formally launched LEME on April 12, 2006.
Public funding, the strong support of colleagues in peer review, and partnership with the University of Toronto Library and Press (thanks to the unflagging support of Carole Moore, the Chief Librarian) have made LEME possible. Its public accessibility arises from the process of its making. Its licensing reflects the widely perceived need for its continued growth.
Special thanks is owed to the hard-working people who toiled part-time, with me, on what would be LEME for nearly two decades. While raising a young family, Gail Richardson did initial text-entry for Palsgrave, Thomas Thomas, and Cotgrave in the early 1990s: a remarkable achievement. In the mid-1990s, a SSHRC grant enabled Geoffrey Booth, Maria Dumity, Allison Hay, Sharine Leung, Rosemary Newman, Katharine Patterson, and Jonathan Warren to help enter many English hard-word lexicons. Dr. Jean F. Shaw proofread Cotgrave. Jeffery Triggs and the North American Reading Program for the Oxford English Dictionary donated Mulcaster. Computer Input Corporation entered Florio (1598) and Cowell. Raymond Siemens donated Cawdrey. The Perseus Project offered Cooper (1584) and Florio (1611), whose proofreading and correction (very sizable tasks) continue. John Blankenship, a graduate student in Classics here, proofread Cooper, and my able research associate and student Shannon Robinson undertook Florio (1611). Both Robinson and Jennifer Roberts-Smith, a pair of research associates that it was my good fortune, in need, to find, transcribed still more lexicons and extended the encoding model in 2002-04. The Chicago Input Center processed several texts, including Levens. Sarah Greene, with LEME now, is the longest employed of my assistants: she has transcribed both major and minor texts, notably Withals and Kersey. Since 2004, Alexandra Pimenidis, a student of Spanish linguistics, has meticulously re-encoded and checked a group of very large bilingual dictionaries for SQL data-entry and entered Laurence Nowell's lexicon of Old English. Without EEBO/TCP versions of Phillips (1658), Wilkins and Lloyd (1668), and Elijah Coles (1676), those three important lexicons would not be part of LEME. Like many in Early Modern English studies, I owe Shawn Martin much for his generous help. Of late, Jason Boyd, Cheryl Cabellero, Brianna Goldberg, Anna Guy, Eliza Radovici, Gabriela Mircea, Clare Orchard, Siri Paulson, Julie Penner, and Peggy Skhuda have contributed valuably to LEME. Nancy Prior, most recently, entered John Baret's formidable An Alveary or Triple Dictionary, in English, Latin, and French (1574).
My greatest debt of thanks is to Dr. Marc Plamondon, LEME's superb programmer, with whom it has been my privilege to work for eight years. Without his dedication, and without the great University of Toronto Library that gave him, LEME, and myself a home, LEME would not have remained just a hope. Their expertise, their collegiality and faith in this project, the unflagging professional diligence with which they marshalled these lexical texts, and the heart-warming personal encouragement they gave me over some lean years have made me possibly the most fortunate computing humanist of my generation. My own personal efforts in all this, I dedicate to Anne.
With a laboratory in the Robarts Library, and with institutional sponsorship and publication, LEME now looks to its new Editorial Advisory Board, and to our community of readers, for guidance on how it will grow, season by season, from now on.