Optional Markup Guidelines

Guidelines for Optional Digital Editing

TypeWright provides a tool which matches digital images of printed pages to plain text transcription of the images. The TypeWright enabled documents in 18thConnect start with plain text, computer generated using optical character recognition (OCR) software engines. The TypeWright tool allows human correction of this plain text, but it is still only plain text.

Some scholars have asked for optional ways to move beyond the plain text, by using indicators for formatting and for characters that are not found on the standard computer keyboard, or in other words, markup notation. The TypeWright team is pleased to make this selection of markup notations available, using a combination of two standard, widely documented systems: Unicode character designations for many special characters and Text Encoding Initiative (TEI) tags for both formatting and more obscure special characters. This mark-up is completely optional; each and every editor is of course welcome to use this markup notation or not.

An important note:
All TEI markup must start and end on the same line. Even if the text needing markup continues onto a subsequent line, please close the element, and begin anew on the second line.

 

Missing or Unreadable Text
There is already one indicator used in TypeWright to indicate missing or unreadable text: the @ sign. Please do not guess at unreadable print. Use this sign to fill in what cannot be determined.

TEI Tags
For formatting indications, we are including the following TEI system on-off switches or Toggles that are called tags. Each complete tag has a start notation and a close notation.These tags can include attributes that specify descriptive values. This system describes what is on the page that has been scanned.

Text appearance
For the textual content of TypeWrght enabled documents, the TypeWright team has chosen the tags <hi> for highlight, <del> for delete, and <c> for character. All three of these tags can include the attribute rend= with a value describing how the text is rendered. See the Appendix for our list of these tags for TypeWright.

Form Notations
For words that appear on the page but which can be considered as outside of the textual content, such as titles, page numbers, and catchwords, the TypeWright team chose the forme work tag <fw>. This tag uses the attribute type= , again with a value describing what the text represents on the page. The Appendix gives the complete list of TypeWright approved values.

Non-Keyboard characters
You can also use the TEI tag, combined with Unicode values, to substitute keyboard characters for those characters which you find in a document but cannot be typed into TypeWright with a standard keyboard. This can include things like standard characters which have accent marks, ligatures, and characters not defined in the standard Unicode set. Below is an example of using to use an s where a long-s (ſ – part of the standard Unicode set) exists in the document, and another example to replace a c-t ligature (not part of the standard Unicode set so it can’t be displayed here):

<subst>

<add>s</add>

<del>&#383;</del>

</subst>
<subst>

<add>ct</add>

<del>&#61125;</del>

</subst>

However, when inserting this into the TypeWright correction box, this series must be typed in only one line:

<subst> <add>s</add> <del>&#383;</del> </subst>

<subst> <add>ct</add> <del>&#61125;</del> </subst>

Using the substitution tag to make an “editorial intervention” with the TypeWright editing interface can be conceptualized by: “I am substituting (<subst>) keyboard characters ct (<add>) for the glpyh/special character represented in Unicode as &#61125; (<del>).”

There are two main sets of Unicode characters that you can/should be using for this work. The standard Unicode set is an internationally established method of representing and encoding text characters on a computer system. The standard set includes over 110,000 character from over 100 scripts. Several encoding schemes are used to represent these characters, the most popular being UTF-8 and UTF-16. Each character in the set is represented by a decimal and/or hexadecimal numerical reference which is the same as its Unicode codepoint. To represent a Unicode character the following notation is used:

&#(x)<codepoint>;

The “&#…;” characters are required in all cases. The “x” is used when the codepoint is represented by a hexadecimal number (contains the digits 0-9 and A-F). No “x” means it’s a decimal number (digits 0-9 only).

You can find lists of the standard Unicode codepoints at many website or by doing a web search on individual characters (for example (“latin small a macron” or “latin long s”).

The Medieval Unicode Font Initiative (MUFI) has established an alternative set of character references which map characters common to pre-modern printed and hand-written documents to Unicode codepoints that are NOT CURRENTLY ASSIGNED to any other character. While these codepoints are not currently assigned in standard Unicode, MUFI has lobbied the international standards body that oversees Unicode for their inclusion in the standard set. Using these values should ensure some relative consistency in your encoding of these non-standard characters into the future. An excellent website for finding MUFI characters and their codepoints is: https://www.abdn.ac.uk/skaldic/db.php?if=mufi&table=mufi_char&val=&view=. Once you’ve found the character you’re looking for you can cut-and-paste the decimal number representation of that character right out of the table.

The system of TEI markup is very flexible, and huge. Other tags that you can use later, in the exported document we send you, can be found in the TEI P5 Guidelines at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.

 

 

APPENDIX: TYPEWRIGHT MARKUP

TypeWright Accepted tags, attributes, and attribute values.

MARKUP

TAG/ATTRIBUTE

Italics

<hi rend=”italic”>…</hi>

Boldface

<hi rend=”bold”>…</hi>

Superscript

<hi rend=”sup”>…</hi>

Subscript

<hi rend=”sub”>…</hi>

Smallcaps

<hi rend=”smallcaps”>…</hi>

Underlining

<hi rend=”underline”>…</hi>

Centered Line

<hi rend=”center”>…</hi>

Drop Cap

<hi rend=”dropcap”>…</hi>

Page Header

<fw type=”header”>…</fw>

Page Number

<fw type=”pageNum”>…</fw>

Signature Mark

<fw type=”sig”>…</fw>

Catchword

<fw type=”catch”>…</fw>

Strikethrough

<del rend=”overstrike”>…</del>

Inverted character

<c rend=”inverted”>…</c>

Substitutions (per Unicode Character Table at http://unicode-table.com/en)

<subst>  <add>s</add> <del>&#383;</del> </subst>

Substitutions (per MUFI character database at https://www.abdn.ac.uk/skaldic/db.php?table=mufi_char&view=&val=&if=mufi)

<subst> <add>ct</add> <del>&#61125;</del> </subst>

 

If you have any questions, please email us: technologies@18thConnect.org