Digital Syriac Corpus Documentation

DRAFT

Table of Contents

Overview

ODD Chaining

The Srophé ODD

Element Modules

The Syriaca All ODD

Final Syriaca ODDs

Schematron

Overview

Syriaca.org is a Linked Open Data (LOD) oriented project using the Text Encoding Initiative (TEI) standard of XML to encode data about core entities (persons, places, works, manuscripts, and bibliography) relevant to the field of Syriac Studies. Syriaca.org uses the TEI because it offers a widely-used standard for humanities projects and because it allows us to capture the nuances of the textual source base we draw upon when encoding data. The LOD-oriented nature of the project, especially the fact that Syriaca.org does not focus on the encoding of texts but rather focuses on the encoding of LOD derived from texts, means that Syriaca.org's use of the TEI is somewhat idiosyncratic.

Syriaca.org's research goals have also required the development of the Srophé Application, an open source eXist-DB application for TEI projects dealing with Cultural Heritage Information. As more and more projects begin to use the Srophé App, users have expressed interest in having a customized TEI schema that will facilitate the encoding of data in ways that the Srophé App expects.

Please note that the aim of this document is not to provide comprehensive documentation on the use of each element and attribute discussed. Rather, the aim is to outline the customizations made to the TEI. This will be sufficient for users familiar with the TEI. For those less familiar with the TEI, we recommend that you also consult the TEI Guidelines for a fuller discussion of the elements and attributes mentioned below.

The various needs of Syriaca.org and users of the Srophé App have required the creation of a series of customized schemas that can be applied to different data types and different projects. This project is very much a work in progress. This wiki documents the current state of this customization work.

[For more on TEI schema customization, see the chapters on "Documentation Elements" and "Using the TEI" in the TEI Guidelines.]

ODD Chaining

Out of the various processes available for producing TEI customizations, Syriaca.org writes ODD files (a distinctive type of TEI XML file) that can be used to produce a RelaxNG schema against which a TEI file can be validated. Given the need to have schema customizations that apply to different data types and can offer some assistance to projects outside Syriaca.org, we have taken the approach of producing a series of chained ODD files. When generating a schema out of an independent (i.e. un-chained) ODD, the source for that file is the TEI (generally TEI P5 All but any TEI subset, i.e. TEILite, could be the source). ODD chaining entails having an initial ODD with the TEI as its source followed by subsequent ODD files that use an earlier ODD customization as their source.

Syriaca.org has produced an initial ODD customization for all Srophé projects. This customization takes a subset of TEI P5 All elements and attributes, constrains how these can be used in different contexts, and introduces a customized attribute in the Srophé namespace. That customization then becomes the source for a Syriaca All ODD that constrains the Srophé ODD for features common to most Syriaca.org data types: persons, places, manuscripts, works, bibliography. The final link in the chain is separate customized ODD files for each of these data types. Each of these ODD files uses the Syriaca All ODD as its source.

As of June 2020, only the Syriaca Places ODD is in use and the other data types are validating against the Syriaca All ODD.

[For more on ODD chaining see Lou Bernard's ODD Chaining for Beginners.]

The Srophé ODD

The Srophé ODD customization constrains the TEI P5 All to remove certain elements, to constrain where certain elements may nest, and to require attributes on certain elements. What follows is a summary of these customizations.

Element Modules

Customization requires the inclusion of some of all of the elements available in the TEI. The Srophé ODD begins by including only select TEI modules. If you are unfamiliar with TEI Modules, the TEI Guidelines contain a discussion of Modules and Appendix C allows you to organize elements according to modules.

The Srophé ODD excludes the following TEI modules and all of their constituent elements: analysis, corpus, drama, figures, gaiji, iso-fs, nets, spoken, and verse.

It includes select elements from the following modules

  • certainty: <precision>
  • core: <abbr>, <author>, <biblScope>, <choice>, <citedRange>, <corr>, <date>, <desc>, <editor>, <expan>, <foreign>, <label>, <listBibl>, <measure>, <name>, <note>, <orig>, <p>, <ptr>, <quote>, <reg>, <ref>, <resp>, <respStmt>, <sic>, <title>
  • dictionaries: <entryFree>
  • header: <authority>, <availability>, <catDesc>, <category>, <change>, <classDecl>, <edition>, <editionStmt>, <editorialDecl>, <encodingDesc>, <fileDesc>, <funder>, <idno>, <interpretation>, <langUsage>, <license>, <principal>, <profileDesc>, <publicationStmt>, <revisionDesc>, <seriesStmt>, <sourceDesc>, <sponsor>, <taxonomy>, <teiHeader>, <titleStmt>
  • linking: <link>
  • msdescription: [all elements]
  • namesdates: <addName>, <bloc>, <country>, <death>, <district>, <event>, <forename>, <geo>, <listPerson>, <listPlace>, <listRelation>, <location>, <offset>, <persName>, <person>, <place>, <placeName>, <region>, <relation>, <roleName>, <settlement>, <sex>, <state>, <surname>
  • tagdocs: <gi>
  • tei: [all elements]
  • textcrit: [all elements]
  • textstructure: <back>, <body>, <div>, <front>, <TEI>, <text>

The TEI Header

The <teiHeader> is a mandatory element of every TEI file, and it encodes important information about the process which created this file. Every <teiHeader> element contains a namespace declaration, a <fileDesc> element (information about the creation of a file), an <encodingDesc> element (editorial rules), a <profileDesc> element (non-bibliographic aspects of a text), and a <revisionDesc> element (history of revisions).

<fileDesc>

Each <fileDesc> element contains (in order) a <titleStmt>, an <editionStmt>, a <publicationStmt>, an optional <seriesStmt>and a <sourceDesc>.

<titleStmt>

The <titleStmt> provides the title of the TEI document and identify the people and institutions responsible for the creation of the document. The <titleStmt> must contain a <title> and an <editor>. It may also contain <sponsor>, <funder>, <principal>, and <respStmt>.

titleStmt/respStmt

If the document includes one or more <respStmt> element, they must include a <resp> followed by either an <orgName> element indicating the organization responsible or a <name> element indicating the person responsible.

<editionStmt>

The <editionStmt> contains an <edition> element indicating an edition number for the TEI file. As the document goes through revisions, editors can apply different edition numbers to the file.

<publicationStmt>

The <publicationStmt> contains information on the publication of the TEI file. It must contain (in order) an <authority>, an <idno>, an <availability>, and a <date>.

<seriesStmt>

The <seriesStmt> contains a <title>, an <editor>, an optional <respStmt>, an <idno>, and an optional <biblScope>.

<sourceDesc>

The <sourceDesc> must contain one and only one of the following: <biblStruct> for the encoding of a published work, <msDesc> for the encoding of a manuscript, or <p> for a prose description of the source. The <p> element is used for a born digital project.

<encodingDesc>

The <encodingDesc> contains information about the practices used in encoding the document. It must contain an <editorialDecl> and it may also contain a <classDecl>.

<editorialDecl>

The <editorialDecl> is used to describe the editorial practices followed in the encoding of the document. The <editorialDecl> must contain at least one <p> element and may contain an option <interpretation> element.

<classDecl>

The optional <classDecl> can contain a <taxonomy> used to define classifications used elsewhere in the document.

<profileDesc>

The <profileDesc> is used to describe aspects of a text not included in normal bibliographic information. The TEI has a particular focus on languages used in the text.

<revisionDesc>

The <revisionDesc> is used for tracking revisions made to the TEI file. An optional @status attribute takes the values: "draft", "incomplete", "published", or "underReview". It may contain a <change> element.

<change>

The <change> element is used to describe the changes made to the document. It must take a @when attribute and a @who attribute indicating the date and person responsible for changes to the file. It may also take an optional @n attribute.

The TEI Text

Where the <teiHeader> is used for metadata about the file, the next major division in a TEI file shifts to the data and begins with the <text> element.

text//@source

Given the Srophé Application's focus on the production of linked open data, it expects certain data standards throughout the text of the file. A schematron rule in the Srophé ODD requires that all @source attributes within <text> point to the @xml:id attribute on a <bibl> or <listBibl> element.

The Syriaca All ODD -- DETAILS FORTHCOMING

The Syriaca All ODD customization builds on the Srophé ODD to established a TEI customization that applies to most of the Syriaca.org data types: persons, places, works, manuscripts, and bibliography.

Final Syriaca ODDs

Each Syriaca.org data type (persons, places, manuscripts, works, bibliography, and taxonomy) will eventually have a unique ODD that uses the Syriaca All ODD as its source. Each of these will have an encoding manual describing how that type of data is encoded by Syriaca.org researchers. Readers interested in the final stage of these chained TEI customizations are encouraged to consult the individual encoding manuals. As of June 2020, the only one currently available is the TEI Encoding Manual for The Syriac Gazetteer.

Schematron

As discussed in the Srophé ODD section, we have created the @srophe:tags attribute. Schematron can handle paths and rules that include a prefix designation (in this case "srophe:") but an ODD files with embedded schematron rules cannot properly render those rules in the RelaxNG schema. As a result, Syriaca.org uses a stand-alone schematron file for validation rules that contain the @srophe:tags attribute.

Syriaca.org indicates that certain name variants are "headwords." We do not consider these to be canonical forms of the name. Rather they are variants that appear in headings for each entity. The schematron rules in this stand-alone file ensure that there is one and only one headword in English (@xml:lang="en") and that no language contains more than one headword.

For more on Syriaca.org headwords in the context of place data, see the Encoding Manual for the Syriac Gazetteer.

Editorial Documentation

Technical Documentation