TEI Simple

TEI Simple Sebastian Rahtz Brian Pytlik Zillig Martin Mueller Version 0.1: 30th November 2014

Summary

The TEI Simple project aims to define a highly-constrained and prescriptive subset of the Text Encoding Initiative (TEI) Guidelines suited to the representation of early modern and modern books, a formally-defined set of processing rules which permit modern web applications to easily present and analyze the encoded texts, mapping to other ontologies, and processes to describe the encoding status and richness of a TEI digital text. This document describes the constrained subset

Background

The Text Encoding Initiative (TEI) has developed over 20 years into a key technology in text-centric humanities disciplines, with an extremely wide range of applications, from diplomatic editions to dictionaries, from prosopography to speech transcription and linguistic analysis. It has been able to achieve its range of use by adopting a descriptive rather than prescriptive approach , by recommending customization to suit particular projects, and by eschewing any attempt to dictate how the digital texts should be rendered or exchanged. However, this flexibility has come at the cost of relatively limited success in interoperability. In our view there is a distinct set of uses (primarily in the area of digitized ‘European’-style books) that would benefit from a prescriptive recipe for digital text; this will sit alongside other domain-specific, constrained TEI customizations, such as the very successful Epidoc in the epigraphic community. TEI-Simple may become a prototype for a new family of constrained customizations. For instance, a TEI Simple MS for manuscript based work could be built on top of the ENRICH project, drawing on many of the lessons and some of the code for TEI Simple.

The TEI has long maintained an introductory subset (TEI Lite), and a constrained customization for use in outsourcing production to commercial vendors (TEI Tite), but both of these permit enormous variation, and have nothing to say about processing. The present project can be viewed in some ways as a revision of TEI Lite, re-examining the basis of the choices therein, focusing it for a more specific area, and adding a "cradle to grave" processing model that associates the TEI Simple schema with explicit and standardized options for displaying and querying texts. This means being able to specify what a programmer should do with particular TEI elements when they are encountered, allowing programmers to build stylesheets that work for everybody and to query a corpus of documents reliably.

This project, TEI Simple, focuses on interoperability, machine generation, and low-cost integration. The TEI architecture facilitates customizations of many kinds; TEI Simple aims to produce a complete 'out of the box' customization which meets the needs of the many users for whom the task of creating a customization is daunting or seems irrelevant. TEI Simple in no way intends to constrain the expressive liberty of encoders who do not think that it is either possible or desirable to follow this path. It does, however, promise to make life easier for those who think there is some virtue in travelling that path as far as it will take you, which for quite a few projects will be far enough. Some users will never feel the need to move beyond it, others will outgrow it, and when they do they will have learned enough to do so.

A major driver for this project is the texts created by phase 1 of the EEBO-TCP project, which were placed in the public domain on 1 January 2015. Another 45,000 texts will join over the following five years, creating by 2020 an archive of 70,000 consistently encoded books published in England from 1475 to 1700, including works of literature, philosophy, politics, religion, geography, science and all other areas of human endeavor. When we compare the query potential of the EEBO TCP texts in their current and quite simple encoding with flat file versions of those text, it is clear that the difference in query potential is very high, especially if you add to that coarse encoding simple forms of linguistic annotation or named entity tagging that can be added in a largely algorithmic fashion. During 2012 and 2013 extensive work has been undertaken at Northwestern, Michigan and Oxford to enrich these texts and bring them into line with the current TEI Guidelines (where necessary working with the TEI to modify the Guidelines). TEI Simple uses this corpus as a point of departure and will provide its users with a friendlier environment for manipulating EEBO texts in various projects. But TEI Simple should not be understood as an EEBO specific project. We believe that, given the extraordinary degree of internal diversity in the EEBO source files, a project that starts from them can, with appropriate modifications, accommodate a wide range of printed texts differing in language, genre, or time and place of origin.

The TEI Simple schema

The TEI infrastructure

The header

The default set of elements for the header are loaded using the header module. In addition, elements from other modules are loaded, if they are tagged in the classification as being needed for the header only.

Elements which are only intended to be used in the header are banned from the text, using a Schematron rule.

Transcription

In order to support the sourcedoc and facsimile elements, the basic transcriptional elements are loaded, and two attribute classes.

Attribute classes

The tei module brings with it a default set of attribute classes. We need some more specialist ones from other modules, and to delete some default ones which we don't plan to use.

Some uncommon attributes are removed from global linking.

URLs have a constraint that a local pointer must have a corresponding ID.

Error: Every local pointer in "" must point to an ID in this document ()

Constrained value lists are added to attribute classes where possible.

above the line below the line at the top of the page at the top right of the page at the top left of the page at the top center of the page at the bottom right of the page at the bottom left of the page at the bottom center of the page at the foot of the page underneath a table in the outer margin in the left margin in the right margin on the opposite, i.e. facing, page. on the other side of the leaf. at the end of the volume. at the end the current division. at the end the current paragraph. within the body of the text. in a predefined space, for example left by an earlier scribe. formatted like a quotation characters lines pages words centimetres millimetre inches Error: Each of the rendition values in "" must point to a local ID or to a token in the Simple scheme () Error: Every local pointer in "" must point to an ID in this document () all capitals black letter or gothic typeface bold typeface marked with a brace under the bottom of the text border around the text centred cursive typeface block display strikethrough with double line underlined with double line initial letter larger or decorated floated out of main flow with a hyphen here (eg in line break) inline rendering italic typeface larger type aligned to the left or left-justified marked with a brace on the left side of the text letter-spaced upright shape and default weight of typeface normal typeface weight aligned to the right or right-justified marked with a brace to the right of the text rotated to the left rotated to the right small caps smaller type strike through subscript superscript marked with a brace above the text fixed-width typeface, like typewriter underlined with single line underlined with wavy line

Model classes

A set of unused model classes are removed.

Elements

The main part of Simple is the set of selected elements.

color: green; text-decoration: underline; margin-top: 2em; margin-left: 2em; margin-right: 2em; margin-bottom: 2em; white-space: nowrap; margin-bottom: 0.5em; Element "" may not be empty. Insert list. Insert item, rendered as described in parent list rendition. list-style: ordered; Insert table cell. Element "" must have at least two child elements. Element "" must have corresponding corr/sic, expand/abbr, reg/orig Insert cit. margin-top: 1em; margin-left: 1em; margin-left: 1em; Omit, if handled in parent choice. content: '['; content: ']'; text-decoration: line-through; border: 1px solid black; padding: 5px; Omit if located in teiHeader. Omit if located in teiHeader. Omit if located in teiHeader. Omit if located in teiHeader. Omit if located in teiHeader. font-size: large; content: '[..'; content: '..]'; color: grey;font-style:italic; display: block; border-top: solid 1pt blue; border-bottom: solid 1pt blue; margin: 6pt; border: solid black 1pt; font-style:italic; color: grey; content: '[..'; content: '..]'; color: grey; content: '[...]'; font-style: italic; font-style: italic; font-style: italic; font-weight: bold; font-style: italic; font-style: italic; margin-left: 1em; margin-left: 10px;margin-right: 10px; font-size:smaller; content:" ["; content:"] "; font-size:small; Omit, if handled in parent choice. text-align: justify; please make sure pb elements are not at the start or end of mixed content display: block; color: grey; float: right; content: '[Page '; content: ']'; Omit if located in teiHeader. Omit if located in teiHeader. margin-left: 10px; margin-right: 10px; content: '‘'; content: '’'; margin-left: 10px; margin-right: 10px; If it is inside a paragraph then it is inline, otherwise it is block level content: '‘'; content: '’'; If it is inside a paragraph then it is inline, otherwise it is block level margin-left: 10px; margin-right: 10px; Omit, if handled in parent choice. font-weight: bold; Insert table row. content: '{'; content: '}'; text-align: right; font-style: italic; font-style:italic; font-style: italic; content:"<"; content:">"; content:"["; content:"]"; content:"("; content:")"; content:"{"; content:"}"; font-size: smaller; background-color: #F0F0F0; max-width: 80%; margin: auto; font-family: Verdana, Tahoma, Geneva, Arial, Helvetica, sans-serif; color: red; font-size: 2em; font-style: italic; font-style: italic; font-style: italic; font-style: italic; text-align: center; color: green; content: ' [?] ';

A small number of elements have constrained value lists added.

Using TeX or LaTeX notation data cell label cell row or column sum data table total data data cell label cell row or column sum data table total data text-transform: uppercase; font-family: fantasy; font-weight: bold; padding-bottom: 2pt; border-bottom: dashed gray 2pt; padding: 2pt; border: solid black 1pt; text-align: center; font-family: cursive; text-decoration: line-through; color: red; text-decoration: underline; color: red; font-size : 6em; font-family: cursive; font-weight : bold; vertical-align: top; height: 1em; line-height: 1em; float : left; width : 1em; color : #c00; margin: 0em; padding: 0px; float:right; display: block; font-size: smaller; clear: right; padding: 4pt; width: 15%; display:inline; display:block; font-style: italic; font-size: larger; text-align: left; padding-left: 2pt; border-left: dotted gray 2pt; letter-spacing: 0.5em; font-style:roman; font-weight:normal; text-align: right; padding-right: 2pt; border-right: dotted gray 2pt; -webkit-transform: rotate(90deg); transform: rotate(90deg); -webkit-transform: rotate(-90deg); transform: rotate(-90deg); font-variant: small-caps; font-size: smaller; text-decoration: line-through; vertical-align: bottom; font-size: smaller; vertical-align: super; font-size: smaller; padding-top: 2pt; border-top: dotted gray 2pt; font-family:monospace; text-decoration: underline; text-decoration: underline; text-decoration-style: wavy;