Povestʹ vremennyx let


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2014-04-13T03:11:20+0000


Aligning the PVL with CollateOS

Synopsis

CollateOS, a preprocessing protocol for CollateX, is a knowledge-based system that normalizes variation in early Cyrillic orthography so as to improve the accuracy of the CollateX alignment of early Slavic materials. The CollateOS preprocessing protocol was developed by David J. Birnbaum (djbpitt@gmail.com) and Minas Abovyan. For more information about CollateX see http://collatex.net.

The PVL is the Rus′ primary chronicle (Повесть временных лет in modern Russian), a history of Rus′ that was compiled and edited several times before the early twelfth century, and that survives in manuscripts no earlier than 1377, which do not always agree with one another in their readings. Collation is the process of aligning the variant readings in the manuscript witnesses so that they can be compared, with the goal of determining the original text, as well its subsequent history as it was copied repeatedly, with each new copy introducing deliberate or inadvertent deviations from its immediate source text. The digital edition available at http://pvl.obdurodon.org is based on The Povest′ vremennykh let: An interlinear collation and paradosis, ed. Donald Ostrowski, with David Birnbaum and Horace G. Lunt. (Harvard Ukrainian Research Institute Publications: Texts, 10.) Cambridge, MA: Harvard UP, 2003. ISBN 9780916458911.

System requirements

Pipeline

[Pipeline]

In the pipeline diagram above, the bottom tier lists the document formats: XML, JSON, and HTML. The middle tier lists the programming languages used for each stage of the transformation: XSLT and Python. The one item on the top tier indicates that CollateX, a Java application, is launched by a Python program, and not directly by the user. There is a wrapper script that enables the user to oipeline all of the Python processes with a single command (see below).

The original input

The PVL source file, pvl.xml (9.2M), contains a line-level (not word-level) collation of all manuscript witnesses, with the lineation based on E. F. Karskijʹs 1926 edition of the Laurentian manuscript, which has come to be regarded as the closest thing we have to a canonical reference system for the PVL.1 The information was originally marked up in troff, an old UNIX typesetting language, to support the publication of the Harvard University Press print edition. The troff files were subsequently converted first to SGML2 and then to XML; the XML version is now regarded as our archival source, and the locus where all corrections and modification are made.

The following is a single collation block (one Karskij line) from the XML input file:

<!-- snip -->
<block column="1" line="4">
    <manuscripts>
        <Lav>персида. ватрь. тоже <lb/> и до индикиꙗ в долготу</Lav>
        <Tro>персида ватрь даже и до индикия в долготу</Tro>
        <Rad>персида. ватрь. доже и до ин<sup>д</sup>икиа. в до<lb/>лготоу</Rad>
        <Aka>персїда. ватръ. до<sup>ж</sup>и и до индикїа. в долготѹ <lb/></Aka>
        <Ipa>перь<lb/>сида. ватрь. доже и до инь<lb/>дикиꙗ. в долготу</Ipa>
        <Xle>персида. ватръ. даже и до индикїа. въ <lb/> долготоу</Xle>
    </manuscripts>
    <Bych>Персида, Ватрь, доже и до Индикия в долготу,</Bych>
    <Shakh>Персида, Ватрь доже и до Индикия въ дълготу,</Shakh>
    <Likh>Персида, Ватрь, доже и до Индикия в долготу,</Likh>
    <paradosis>
        <Ost>Персида, Ватрь доже и до Индикия въ дълготу,</Ost>
    </paradosis>
</block>
<!-- snip -->

The edition lists manuscripts first (the principal manuscripts are Lav, Tro, Rad, Aka, Ipa, and Xle; others are used selectively), then print editions (Byč, Šax, and Lix), and then our own reconstruction (Ost, which we translate α to before rendering). The XML input document can be converted to HTML in way that aligns the manuscripts by line, but not by word within the lines; after conversion, the HTML for the preceding block looks like:

1, 4

Lav персида. ватрь. тоже | и до индикиꙗ в долготу
Tro персида ватрь даже и до индикия в долготу
Rad персида. ватрь. доже и до индикиа. в до|лготоу
Aka персїда. ватръ. дожи и до индикїа. в долготѹ |
Ipa перь|сида. ватрь. доже и до инь|дикиꙗ. в долготу
Xle персида. ватръ. даже и до индикїа. въ | долготоу
Byč Персида, Ватрь, доже и до Индикия в долготу,
Šax Персида, Ватрь доже и до Индикия въ дълготу,
Lix Персида, Ватрь, доже и до Индикия в долготу,
α Персида, Ватрь доже и до Индикия въ дълготу,

The problem

The problem CollateOS was designed to address was to align manuscript witnesses on a word-by-word level. Since for the PVL we had already performed manual alignment at the line level, we treated each line block as a separate collation problem, thus leveraging the initial coarse alignment and removing the possibility of misalignment that might cross line blocks. Within our project, then, we converted the line-by-line alignment, like the one above, into a word-by-word one. Since the witnesses may disagree not only in the spellings of the words, but also in insertions, deletions, replacements, and transpositions, this is not a simple task, and it requires aligning the words in the witnesses not just by position in the line, but also by their textual content. A simple Levenshtein edit-distance calculation is too crude for this task because not all variation is equally important for alignment purposes, and intelligent normalization before collation must ignore orthographic variation that is not textually significant while remaining attentive to variation that philologists consider meaningful. The remainder of this paper describes the process of applying intelligent normalization to the manuscript evidence in order to obtain more accurate results from CollateX. The normalized version shadows the original; it is used by CollateX to align the manuscripts, but the resulting edition retains all original orthographic detail.

Transformation 1: Custom XML to word-tagged TEI XML (XSLT)

We first convert the initial custom XML markup to markup that is consistent with the guidelines of the Text encoding initiative (TEI) using xml-to-tei.xsl. Our initial custom XML schema and the TEI XML markup have different advantages; by authoring in the former and then converting to the latter, we are able to use each at a stage in production where it is most beneficial.3

The output of the xml-to-tei.xsl process looks like:

<!-- snip -->
<div type="block" n="4">
    <p>
        <app>
            <lem wit="#lem">Персида, Ватрь доже и до Индикия въ дълготу,</lem>
            <rdgGrp type="manuscripts">
                <rdg wit="#lav">персида. ватрь. тоже <lb/> и до индикиꙗ в долготу</rdg>
                <rdg wit="#tro">персида ватрь даже и до индикия в долготу</rdg>
                <rdg wit="#rad">персида. ватрь. доже и до ин<hi rend="sup">д</hi>икиа. в до<lb/>лготоу</rdg>
                <rdg wit="#aka">персїда. ватръ. до<hi rend="sup">ж</hi>и и до индикїа. в долготѹ <lb/></rdg>
                <rdg wit="#ipa">перь<lb/>сида. ватрь. доже и до инь<lb/>дикиꙗ. в долготу</rdg>
                <rdg wit="#xle">персида. ватръ. даже и до индикїа. въ <lb/> долготоу</rdg>
            </rdgGrp>
            <rdgGrp type="editions">
                <rdg wit="#bych">Персида, Ватрь, доже и до Индикия в долготу,</rdg>
                <rdg wit="#shakh">Персида, Ватрь доже и до Индикия въ дълготу,</rdg>
                <rdg wit="#likh">Персида, Ватрь, доже и до Индикия в долготу,</rdg>
            </rdgGrp>
        </app>
    </p>
</div>
<!-- snip -->

We next split the different blocks into separate files, using split-tei-into-collation-blocks.xsl, which creates a separate TEI file for each block, writing them into a blocks subdirectory. Filenames take the form xxxx_xxxx_xxxx.xml, where the first xxxx is a decimal number representing the order of the block in the entire edition, the second is the Karskij column number, and the third is the Karskij line number. For example, 7981_0277_0028.xml is the 7981st file and represents column 277, line 28. (Line numbers may also contain hyphens and letters, about which see pp. lxxv–lxxvi of the Principles of transcription of the 2003 Harvard University Press print edition.)

Tokenization for our alignment purposes is not as simple as splitting the input files on white space, we next use add-word-markup.xsl to wrap <w> tags around the individual words using. We run this from a shell script, add-word-markup, which iterates over the 8239 blocks and writes the results into a blocks/word-tagged subfolder. The word-tagged version of the block above is:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>1, 4</title>
            </titleStmt>
            <publicationStmt>
                <p>Unpublished interim file for collation</p>
            </publicationStmt>
            <sourceDesc>
                <p>Extracted from pvl-tei.xml</p>
            </sourceDesc>
        </fileDesc>
    </teiHeader>
    <text>
        <body>
            <milestone unit="column" n="1"/>
            <div type="block" n="4">
                <p>
                    <app>
                        <lem wit="#lem">
                            <w n="1,4">Персида,</w>
                            <w n="1,4">Ватрь</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">Индикия</w>
                            <w n="1,4">въ</w>
                            <w n="1,4">дълготу,</w>
                        </lem>

                        <rdg wit="#lav">
                            <w n="1,4">персида.</w>
                            <w n="1,4">ватрь.</w>
                            <w n="1,4">тоже <lb/></w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">индикиꙗ</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготу</w>
                        </rdg>

                        <rdg wit="#tro">
                            <w n="1,4">персида</w>
                            <w n="1,4">ватрь</w>
                            <w n="1,4">даже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">индикия</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготу</w>
                        </rdg>

                        <rdg wit="#rad">
                            <w n="1,4">персида.</w>
                            <w n="1,4">ватрь.</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">ин<hi rend="sup">д</hi>икиа.</w>
                            <w n="1,4">в</w>
                            <w n="1,4">до<lb/>лготоу</w>
                        </rdg>

                        <rdg wit="#aka">
                            <w n="1,4">персїда.</w>
                            <w n="1,4">ватръ.</w>
                            <w n="1,4">до<hi rend="sup">ж</hi>и</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">индикїа.</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготѹ <lb/></w>
                        </rdg>

                        <rdg wit="#ipa">
                            <w n="1,4">перь<lb/>сида.</w>
                            <w n="1,4">ватрь.</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">инь<lb/>дикиꙗ.</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготу</w>
                        </rdg>

                        <rdg wit="#xle">
                            <w n="1,4">персида.</w>
                            <w n="1,4">ватръ.</w>
                            <w n="1,4">даже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">индикїа.</w>
                            <w n="1,4">въ <lb/></w>
                            <w n="1,4">долготоу</w>
                        </rdg>

                        <rdg wit="#bych">
                            <w n="1,4">Персида,</w>
                            <w n="1,4">Ватрь,</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">Индикия</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготу,</w>
                        </rdg>

                        <rdg wit="#shakh">
                            <w n="1,4">Персида,</w>
                            <w n="1,4">Ватрь</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">Индикия</w>
                            <w n="1,4">въ</w>
                            <w n="1,4">дълготу,</w>
                        </rdg>

                        <rdg wit="#likh">
                            <w n="1,4">Персида,</w>
                            <w n="1,4">Ватрь,</w>
                            <w n="1,4">доже</w>
                            <w n="1,4">и</w>
                            <w n="1,4">до</w>
                            <w n="1,4">Индикия</w>
                            <w n="1,4">в</w>
                            <w n="1,4">долготу,</w>
                        </rdg>
                    </app>
                </p>
            </div>
        </body>
    </text>
</TEI>

The entire initial XML-to-XML conversion procedure, run from a directory that contains the original custom XML file, is:

saxon -o:pvl-tei.xml -s:pvl.xml -xsl:xml-to-tei.xsl
saxon pvl-tei.xml split-tei-into-collation-blocks.xsl
./add-word-markup

Transformation 2: Word-tagged TEI XML to word-aligned XML

We transform the word-tagged TEI XML to word-aligned XML by piping it through three Python programs, which we invoke using a single wrapper program:

The programs above call Preprocessing.py, which applies Soundex normalization rules (see below) to each token, creating a shadow version that will be used by CollateX to generate the aligned output. This module is called only by other scripts, and is not intended to be run directly by the user.

The three Python programs can be run by a wrapper program, Wrapper.py, which calls XMLtoJSON.py, pipes the output to runCollatex.py, and then pipes that output to JSONtoXML.py. To process all XML input files in the blocks subdirectory of the current working directory, run:

python /path-to-scripts/wrapper.py XMLtoJSON.py -new -i blocks

The -new switch creates a collatexOutput folder in the original input directory and writes the output there, so as not to overwrite input and interim files. When the smoke clears, blocks will contain the original word-tagged XML input files and the JSON output of XMLtoJSON.py; it will also contain a collatexOutput subdirectory with the JSON output of CollateX and the XML output created by converting the CollateX JSON output to custom (not TEI-conformant) XML. The filenames are consistent with the input filenames, with appropriate extensions (.xml and .json).

Transformation 2, part 1: Word-tagged TEI XML to CollateX JSON input

The JSON output of XMLtoJSON.py, in a format expected by CollateX, looks like the following (the u value is the column and line number, the t value is the original XML input, and the n value is the Soundex-normalized form):

{"witnesses": [
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "Персида,",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "Ватрь",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "Индикия",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "въ",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "дълготу,",
                "n": "длгт"
            }
        ],
        "id": "#lem"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "персида.",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватрь.",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "тоже <lb/>",
                "n": "тжXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "индикиꙗ",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготу",
                "n": "длгт"
            }
        ],
        "id": "#lav"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "персида",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватрь",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "даже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "индикия",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготу",
                "n": "длгт"
            }
        ],
        "id": "#tro"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "персида.",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватрь.",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "ин<hi rend=\"sup\">д<\/hi>икиа.",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "до<lb/>лготоу",
                "n": "длгт"
            }
        ],
        "id": "#rad"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "персїда.",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватръ.",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "до<hi rend=\"sup\">ж<\/hi>и",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "индикїа.",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготѹ <lb/>",
                "n": "длгт"
            }
        ],
        "id": "#aka"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "перь<lb/>сида.",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватрь.",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "инь<lb/>дикиꙗ.",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготу",
                "n": "длгт"
            }
        ],
        "id": "#ipa"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "персида.",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "ватръ.",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "даже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "индикїа.",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "въ <lb/>",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготоу",
                "n": "длгт"
            }
        ],
        "id": "#xle"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "Персида,",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "Ватрь,",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "Индикия",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготу,",
                "n": "длгт"
            }
        ],
        "id": "#bych"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "Персида,",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "Ватрь",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "Индикия",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "въ",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "дълготу,",
                "n": "длгт"
            }
        ],
        "id": "#shakh"
    },
    {
        "tokens": [
            {
                "u": "1,4",
                "t": "Персида,",
                "n": "прсд"
            },
            {
                "u": "1,4",
                "t": "Ватрь,",
                "n": "втрX"
            },
            {
                "u": "1,4",
                "t": "доже",
                "n": "джXX"
            },
            {
                "u": "1,4",
                "t": "и",
                "n": "иXXX"
            },
            {
                "u": "1,4",
                "t": "до",
                "n": "дXXX"
            },
            {
                "u": "1,4",
                "t": "Индикия",
                "n": "индк"
            },
            {
                "u": "1,4",
                "t": "в",
                "n": "вXXX"
            },
            {
                "u": "1,4",
                "t": "долготу,",
                "n": "длгт"
            }
        ],
        "id": "#likh"
    }
]}

Transformation 2, part 2: CollateX JSON input to CollateX JSON output

CollateX reads JSON files like the one above and outputs JSON that looks like the sample below. We set a Levenshtein distance threshhold value of 1, so that approximate matches not covered by our Soundex normalization will also be aligned:

{
    "witnesses": [
        "#aka",
        "#bych",
        "#ipa",
        "#lav",
        "#lem",
        "#likh",
        "#rad",
        "#shakh",
        "#tro",
        "#xle"
    ],
    "table": [
        [
            [{
                "u": "1,1",
                "t": "се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "Се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "Се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "Се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "Се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "се",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "Се",
                "n": "сXXX"
            }]
        ],
        [
            [{
                "u": "1,1",
                "t": "на<lb/>чнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемь",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начьнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начн<hi rend=\"sup\">м<\/hi>е",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начьнѣмъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемъ",
                "n": "нчнм"
            }],
            [{
                "u": "1,1",
                "t": "начнемъ <lb/>",
                "n": "нчнм"
            }]
        ],
        [
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "по<lb/>вѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }],
            [{
                "u": "1,1",
                "t": "повѣсть",
                "n": "пвст"
            }]
        ],
        [
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию.",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сию",
                "n": "сXXX"
            }],
            [{
                "u": "1,1",
                "t": "сїю.",
                "n": "сXXX"
            }]
        ]
    ]
}

Transformation 2, part 3: CollateX JSON output to custom XML

JSONtoXML.py converts the CollateX JSON output to XML like the following:

<?xml version="1.0" ?>
<witnesses>
    <block>
        <token n="прсд" u="1,4" witness="#aka">персїда.</token>
        <token n="прсд" u="1,4" witness="#bych">Персида,</token>
        <token n="прсд" u="1,4" witness="#ipa">перь<lb/>сида.</token>
        <token n="прсд" u="1,4" witness="#lav">персида.</token>
        <token n="прсд" u="1,4" witness="#lem">Персида,</token>
        <token n="прсд" u="1,4" witness="#likh">Персида,</token>
        <token n="прсд" u="1,4" witness="#rad">персида.</token>
        <token n="прсд" u="1,4" witness="#shakh">Персида,</token>
        <token n="прсд" u="1,4" witness="#tro">персида</token>
        <token n="прсд" u="1,4" witness="#xle">персида.</token>
    </block>
    <block>
        <token n="втрX" u="1,4" witness="#aka">ватръ.</token>
        <token n="втрX" u="1,4" witness="#bych">Ватрь,</token>
        <token n="втрX" u="1,4" witness="#ipa">ватрь.</token>
        <token n="втрX" u="1,4" witness="#lav">ватрь.</token>
        <token n="втрX" u="1,4" witness="#lem">Ватрь</token>
        <token n="втрX" u="1,4" witness="#likh">Ватрь,</token>
        <token n="втрX" u="1,4" witness="#rad">ватрь.</token>
        <token n="втрX" u="1,4" witness="#shakh">Ватрь</token>
        <token n="втрX" u="1,4" witness="#tro">ватрь</token>
        <token n="втрX" u="1,4" witness="#xle">ватръ.</token>
    </block>
    <block>
        <token n="джXX" u="1,4" witness="#aka">до<hi rend="sup">ж</hi>и</token>
        <token n="джXX" u="1,4" witness="#bych">доже</token>
        <token n="джXX" u="1,4" witness="#ipa">доже</token>
        <token n="тжXX" u="1,4" witness="#lav">тоже <lb/></token>
        <token n="джXX" u="1,4" witness="#lem">доже</token>
        <token n="джXX" u="1,4" witness="#likh">доже</token>
        <token n="джXX" u="1,4" witness="#rad">доже</token>
        <token n="джXX" u="1,4" witness="#shakh">доже</token>
        <token n="джXX" u="1,4" witness="#tro">даже</token>
        <token n="джXX" u="1,4" witness="#xle">даже</token>
    </block>
    <block>
        <token n="иXXX" u="1,4" witness="#aka">и</token>
        <token n="иXXX" u="1,4" witness="#bych">и</token>
        <token n="иXXX" u="1,4" witness="#ipa">и</token>
        <token n="иXXX" u="1,4" witness="#lav">и</token>
        <token n="иXXX" u="1,4" witness="#lem">и</token>
        <token n="иXXX" u="1,4" witness="#likh">и</token>
        <token n="иXXX" u="1,4" witness="#rad">и</token>
        <token n="иXXX" u="1,4" witness="#shakh">и</token>
        <token n="иXXX" u="1,4" witness="#tro">и</token>
        <token n="иXXX" u="1,4" witness="#xle">и</token>
    </block>
    <block>
        <token n="дXXX" u="1,4" witness="#aka">до</token>
        <token n="дXXX" u="1,4" witness="#bych">до</token>
        <token n="дXXX" u="1,4" witness="#ipa">до</token>
        <token n="дXXX" u="1,4" witness="#lav">до</token>
        <token n="дXXX" u="1,4" witness="#lem">до</token>
        <token n="дXXX" u="1,4" witness="#likh">до</token>
        <token n="дXXX" u="1,4" witness="#rad">до</token>
        <token n="дXXX" u="1,4" witness="#shakh">до</token>
        <token n="дXXX" u="1,4" witness="#tro">до</token>
        <token n="дXXX" u="1,4" witness="#xle">до</token>
    </block>
    <block>
        <token n="индк" u="1,4" witness="#aka">индикїа.</token>
        <token n="индк" u="1,4" witness="#bych">Индикия</token>
        <token n="индк" u="1,4" witness="#ipa">инь<lb/>дикиꙗ.</token>
        <token n="индк" u="1,4" witness="#lav">индикиꙗ</token>
        <token n="индк" u="1,4" witness="#lem">Индикия</token>
        <token n="индк" u="1,4" witness="#likh">Индикия</token>
        <token n="индк" u="1,4" witness="#rad">ин<hi rend="sup">д</hi>икиа.</token>
        <token n="индк" u="1,4" witness="#shakh">Индикия</token>
        <token n="индк" u="1,4" witness="#tro">индикия</token>
        <token n="индк" u="1,4" witness="#xle">индикїа.</token>
    </block>
    <block>
        <token n="вXXX" u="1,4" witness="#aka">в</token>
        <token n="вXXX" u="1,4" witness="#bych">в</token>
        <token n="вXXX" u="1,4" witness="#ipa">в</token>
        <token n="вXXX" u="1,4" witness="#lav">в</token>
        <token n="вXXX" u="1,4" witness="#lem">въ</token>
        <token n="вXXX" u="1,4" witness="#likh">в</token>
        <token n="вXXX" u="1,4" witness="#rad">в</token>
        <token n="вXXX" u="1,4" witness="#shakh">въ</token>
        <token n="вXXX" u="1,4" witness="#tro">в</token>
        <token n="вXXX" u="1,4" witness="#xle">въ <lb/></token>
    </block>
    <block>
        <token n="длгт" u="1,4" witness="#aka">долготѹ <lb/></token>
        <token n="длгт" u="1,4" witness="#bych">долготу,</token>
        <token n="длгт" u="1,4" witness="#ipa">долготу</token>
        <token n="длгт" u="1,4" witness="#lav">долготу</token>
        <token n="длгт" u="1,4" witness="#lem">дълготу,</token>
        <token n="длгт" u="1,4" witness="#likh">долготу,</token>
        <token n="длгт" u="1,4" witness="#rad">до<lb/>лготоу</token>
        <token n="длгт" u="1,4" witness="#shakh">дълготу,</token>
        <token n="длгт" u="1,4" witness="#tro">долготу</token>
        <token n="длгт" u="1,4" witness="#xle">долготоу</token>
    </block>
</witnesses>

Transformation 3: Custom XML to HTML

For web publication we use xmlobjects-to-column-html-tables.xsl to combine the blocks for each column into a single HTML fragment, which can be incorporated into a browsing interface using server side includes and AJAX. The resulting word-aligned block looks like:

1,4

Lav персида. ватрь. тоже и до индикиꙗ в долготу
Tro персида ватрь даже и до индикия в долготу
Rad персида. ватрь. доже и до индикиа. в долготоу
Aka персїда. ватръ. дожи и до индикїа. в долготѹ
Ipa перьсида. ватрь. доже и до иньдикиꙗ. в долготу
Xle персида. ватръ. даже и до индикїа. въ долготоу
Byč Персида, Ватрь, доже и до Индикия в долготу,
Šax Персида, Ватрь доже и до Индикия въ дълготу,
Lix Персида, Ватрь, доже и до Индикия в долготу,
α Персида, Ватрь доже и до Индикия въ дълготу,

The full edition is available at http://pvl.obdurodon.org/browser.xhtml.


Appendix 1: Normalization algorithm

The original XML input is retained throughout and process and used in the eventual edition, but a normalized shadow view is created as follows:

  1. Remove all opening and closing tags of the following elements (some TEI, some from alternative markup schemes that are also supported): <add>, <hi>, <unclear>, keeping their inner text node intact.
  2. Remove the following elements entirely: <del>, <gap>, <lacuna>, <lb>, <pb>.
  3. If a <choice> element is present, choose the second member of each of the following pairs: (<sic>, <corr>), (<orig>, <reg>), (<abbr>, <expan>), (<seg>, <seg>).
  4. Ignore other markup, operating just on the data content of the words from now on.
  5. Remove all punctuation.
  6. If at this point length of our text is 0, the node originally contained only punctuation, so return a normalized value of PUNC. Otherwise, continue.
  7. Check for presence of Arabic numbers in the string. If any are found, transform them to their Cyrillic counterparts.
  8. Apply Soundex normalization, as modified for early Slavic Cyrillic writing (see the description of the Cyrillic-specific algorithm, below), as follows:
    1. Apply orthographic transformation rules outlined in soundex-rules.xml in the following order: $manyToOne, $oneToMany, $oneToOne. This merges letterforms that frequently appear as orthographic variants that are not meaningful for collation purposes, much as the original Soundex algorithm ignored orthographic variation that interfered with retrieving English-language surnames (see the brief description and history of Soundex at http://en.wikipedia.org/wiki/Soundex).
    2. Eliminate vowels from the Special vowels category in vowels.xml, unless they are word-initial.
    3. Degeminate consonants (e.g., нн becomes н) and remove all non-initial vowels from the rest of the word. Note that the word-initial letter is retained even if it is a vowel, but other vowels are eventually deleted.
    4. If the length of the resulting normalization is longer than 4, truncate it to 4. If it’s shorter than 4, pad it with X (upper-case Latin letter) characters.

Appendix 2: Soundex normalization

[Describe Soundex normalization here]

Notes

  1. Карский, Евфимий Федорович, ред. 1926. Лаврентьевская летопись. 2-е изд. Вып. 1. Повесть временных лет. Ленинград: АН СССР.
  2. Birnbaum, David J. 2000. In A TEI-Compatible Edition of the Rusʹ Primary Chronicle. Medieval Slavic Manuscripts and SGML: Problems and Perspectives, ed. Anisava Miltenova and David J. Birnbaum, 15–43. Sofia: Institute of Literature, Bulgarian Academy of Sciences, Marin Drinov Publishing House.
  3. Birnbaum, David J. 2000. The relationship between general and specific DTDs: criticizing TEI critical editions. Markup languages: theory and practice 3, no. 1:17–53.

PVL project site http://pvl.obdurodon.org. CollateOS GitHub repository https://github.com/obdurodon/CollateOS. Thanks to Ronald Dekker and the rest of the CollateX development team for their advice.