LEME Database Encoding Practice

Tagging in LEME adds information about the features of Early Modern English word-entries so as to support advanced searching. For example, we tag a word by its language so that researchers can restrict their searches to that language. We also encode texts so as to expand abbreviations and emend typos, thereby making word-entries easier to read.

LEME uses tags that are explicitly called for by lexicographical practices in this period. Early Modern lexicographers are seldom interested in variant spellings or syllabification: as a result, we do not have tags for these. LEME also does not require a conventional order of elements in a word-entry. If a lexicographer wants to reverse the order of headword and explanation, use sentences as headwords, or permit etymons inside an explanation, so be it. We do not use the word "definition" for an explanation because definitions bore only a non-lexical meaning until the first quarter of the eighteenth century. Definitions applied to things, not words. An explanation may be idiosyncratic. It can include translated equivalents, synonyms, logical definitions (of things), conceptualizations, etymologies, and even anecdotal digressions. Word-entries may appear as running prose within paragraphs.

The simpler the LEME tagset can be, the better. When drilling down into a word-entry, LEME enables you to see this encoding. All word-entries have these three structural tags:

<wordentry type="a">
<form lang="en">
</form>
<xpln lang="en">
</xpln>
</wordentry>

Each word-entry contains a headword (form) and an explanation (xpln). Each tag has a small number of features. The word-entry here is marked as sorted alphabetically within the dictionary. Both form and xpln tags can have a feature or element, lang, that specifies their language. Most tags come in pairs: the first one begins characterizing the word that follows, and the second tag ends that annotation. 

Tags also describe other structures inside word-entries. Both form and xpln may enclose other tags, some of which is textual: expan marks an expanded abbreviation, emend the LEME editor's emendation of a typo, correction the scribe's change of a word or passage in a manuscript text, and damage a section of the text that is obliterated. Inside a word-entry, term tags mark a change in language and xref marks a cross-reference. Each word-entry may also have, as an internal feature, an element that LEME calls a lexeme, which is the modern spelling and part of speech for the corresponding headword in the Oxford English Dictionary.

<wordentry type="h">
<form lexeme="ditty(n)">
dittie</form>
<xpln>
the matter of a song.</xpln>
</wordentry>

Imagine tags as Russian dolls. The outermost two-part leme tag encloses a number of inner sections, which may enclose several smaller gatherings, such as different glossaries. Each glossary may consist of groups of word-entries, such as for words beginning with A, B, C, etc., or for word-entries about a series of topical subjects.

The art of assigning lexemes to old-spelling words is called lemmatization. Lemmatization enables LEME to respond to search queries expressed in modern spellings by listing word-entries whose forms and explanations use old spellings and inflections of the search term.

In modern dictionaries, word-entries often have sub-entries that describe different senses of the headword. Many Early Modern lexical works do not clearly mark senses but all the same may have reams of sub-entries. Here is a sample from John Rider's English-Latin dictionary of 1589:

<wordentry type="h">
<form lang="en">
To Chaine, or tie in, or with chaines. </form>
<xpln lang="la">
1 Cateno.
<subform lang="en">To chaine together.</subform>
<subxpln lang="la">
1 Concateno.</subxpln>
<subform lang="en">
Chained.</subform>
<subxpln lang="la">
1 Catenatus, p. catenarius, ad</subxpln>
<subform lang="en">
A chaine.</subform>
<subxpln lang="la">
1 Catena, f.</subxpln>
<subform lang="en">
A little chaine.</subform>
<subxpln lang="la">
1 Catenula, catenna, f. catellum, n.</subxpln>
<subform lang="en">
A chaining.</subform>
<subxpln lang="la">
1 Catenatio, f.</subxpln>
</xpln>
</wordentry>

Rider's practice does not conform to modern expectations in some respects. Note how he places a miniature monolingual English word-entry within the headword form; and how his sub-entries sometimes resemble a different sense ("To chaine together"), and other times various derivatives of the "Cateno" word-family (which all share the same root).

LEME aims at an intelligible transcription of its lexical texts. In the 1990s, I developed a more florid tag-set suited for diplomatic or critical editions. When I changed from making editions to developing a database, my tags contracted. Eventually, with the emergence of the Text Encoding Initiative guidelines, I disregarded most tags for traits of display (e.g., font) and bibliographical elements (e.g., signature, catchword, running-titles). Yet LEME often keeps to the original lineation of its texts because that promotes easier proofreading.

Encoding in LEME exists even at the level of the individual character. Unicode supplies us with what we need, although it also uses many popular entity references that are not as taxing on memory as Unicode. For example, &shy; (soft hyphen) joins words split by a line-end, &amp; is ampersand, and &oelig; and &aelig; are the two digraphs. In another respect, regrettably, LEME lacks a standard for reproducing special abbreviation characters in Renaissance texts. We use our own shorthand codes based loosely on the appearance of the character, but in very formal encoding for exchange purposes, we use Babelmap, an excellent tool, to input actual Unicode character sets.

Various tags are permitted and recognized within wordentries but, during processing, are silently ignored. They are <cit>, <correction>, <damage>, <doubtful>, <f>, <hungword>, <infl>, <lemeformat>, <pos>, <pronun>, and <s>. They can be seen but are not acted on by the database.

The tag collection in LEME is a creature of SQL database technology. It does not pretend to be a recommended standard for other corpora. However, LEME is translating its SQL tags into two formal xml languages for purposes of exchange: one uses LEME's own tags, and the other uses TEI, an international standard. A discussion of that will coincide with the release of LEME's texts in xml format.

List of LEME Database Tags

STRUCTURAL

Leme  (text id)
Section  (such as title-page, preface, glossary, etc.)
Set  (for use at the start of an encoded text to set defaults)
Wordgroup1  (largest unit of grouping word-entries: e.g., alphabetical)
Wordgroup2  (sub-group within wordgroup1)
Wordgroup3  (sub-group within wordgroup2)
Wordentry  (complete word-entry)
Form  (headword)
Xpln  (explanation of headword)
Subform  (related sub-headword)
Subxpln  (explanation of sub-headword)
Heading  (titles of sections)
Closing  (colophon)
Alpha  (alphabetic letter for current wordgroup1)

TEXTUAL

Addition    (annotation by someone other than editor or author)
Blockquote  (equivalent to lemeformat)
Cit  (citation)
Class  (text-type, such as an etymon language)
Correction    (a scribe's correction in a manuscript)
Damage  (loss of text)
Doubtful  (uncertain reading)
Editoraddition  (addition of clarifying text by the editor)
Emend  (editor's correction)
Etym  (etymon)
Etymlang  (language of an etymon)
Expan  (expansion of a contraction)
Expression   (identical with term tag)
F  (font)
Hungword  (word or word-part shifted to adjacent line)
Infl  (inflection)
Joinnext  (display adjacent word-entries together)
Lemeformat    (passage set apart, such as verse)
Lemenote  (editorial comment in a wordentry)
Lemepagenote  (editorial note on a page but outside a word-entry)
Note  (marginal comment, item number, etc.)
Page   (page-break marker)
Pos   (part of speech)
Pronun     (pronunciation)
S  (scribe)
Term  (change of language)
Xref  (cross-reference)


Editor / 9 October 2018