The main goal of this package is to develop an algorithm of conversion TEI XML into Edition Formats (HTML, DOCX, JSON).
The algorithm consists of two main parts:
- an algorithm for converting TEI XML to a format (TEITransformer);
- a front and back application architecture for creating a digital publication and integrating it into an application or website (https://github.com/Stoneberry/tei_platform.git).
The client interface is implemented by the TEITransformer
class. The user interacts with the algorithm only using this module. When initializing the object, the user must specify the scenario according to which the transformation will take place.
The algorithm for enabling visualization uses a set of XSLT stylesheets.
XSLT stands for the Extensible Stylesheet Language for Transformations. The main idea is to describe the template of the output document and fill it with the extracted information from XML. The extracting process is conducted by writing rules that specify which element should be converted and under what condition.
You can install package from PyPi website:
pip install TEItransformer
There are only two methods available in the class:
- load_tei()
- transform()
First, you need to load the data using the load_tei
method. The function receives the path to the TEI XML file and the path to the schema as input.
import TEItransformer as TTR
tei_path = "filename.xml"
schema_path = "schema.rng"
TT = TTR.TEITransformer(scenario='drama')
TT.load_tei(tei_path, schema_path=schema_path)
TT.transform(
output_format='html',
keep_all=False,
full_page=True,
enable_valid=False
)
tei_path = "filename.xml"
schema_path = "schema.rng"
TT = TTR.TEITransformer(scenario='drama')
TT.load_tei(tei_path, schema_path=schema_path)
TT.transform(
output_format='docx',
modules=[],
keep_all=False,
enable_valid=False,
output_filename='output',
odd_path=None,
custom_css_path=None)
tei_path = "filename.xml"
schema_path = "schema.rng"
TT = TTR.TEITransformer(scenario='drama')
TT.load_tei(tei_path, schema_path=schema_path)
TT.transform(
output_format='json',
enable_valid=False,
output_filename="output",
)
The concept of a scenario is key in the current algorithm. TEI allows you to describe different types of sources. It could be dramatic texts or linguistic corpora, for example. Each of these types differs not only in its purpose, but also in its structure and presentation. In order to take these features into account and create the most suitable output product, transformation and stylization rules have been defined for several data types.
At the moment, the algorithm has two scenarios:
- drama
- plain
Drama is a scenario that describes performance texts. Based on the Guidelines, the main elements of the dramatic texts characteristic were collected and their behavior in XSLT templates and styles was determined.
Plain is a simple format that has no predefined structure. It was created specifically so that the users could determine their own settings of the output. The plain format is selected as a default value.
The templates are further divided into two subcategories:
- complete
- selective
Complete template allows users to convert XML file to HTML while preserving the entire structure of the document. The only changes are that:
- all internal nodes are saved in the div tag;
- all leaf nodes with text are saved in the p tag;
- the node name becomes a class name.
TEI:
HTML:
A selective template also preserves the overall structure of the document. However, the text is saved for pre-selected items only. The main idea is to create a standardized representation without unnecessary elements.
Before proceeding to the description of the templates, it is important to define the TEI infrastructure. The organization of the elements is based on a modular system. Elements that occur under the same conditions are combined into a single module. For example, the drama module describes elements for dramatic text encoding. It is convenient when forming your own TEI vocabulary. You can easily select the necessary elements depending on the specifics and subject matter of your documents.
A similar system was created for the current algorithm. For each of the selected scenarios, lists of core elements were compiled based on the TEI Guidelines and organized into separate files by modules. Then, a document for connecting these modules was created. The user can independently determine which modules to use. By default, the modules that are specific to the scenario are connected. The overall system goes as follows:
Elements considered in modules:
module | elements |
---|---|
drama elements | titleStmt, author, castItem, castItem, castGroup, roleDesc, role, actor, camera, sound, caption, tech, view, set, set, set, pb, sp, speaker, sp, div[@type='act'], div[@type='scene'], castGroup, castList, head, div[@type='scene'], sp, sp, sp, sp, stage, titleStmt |
verse elements | l, lg |
As mentioned earlier, the program supports 3 output formats: HTML, DOCX, and JSON. Each of the formats is created by a separate class that has its own settings.
Arguments:
argument | data type | description |
---|---|---|
module | list | list of module names to include in the transformation process |
keep_all | bool | whether to keep all data (True) or select only modular ones (False) |
output_filename | str | output file path |
odd_path | str | odd file path |
custom_css_path | str | css file path |
links | bool | whether to include link processing or not |
full_page | bool | whether to create a full interface representation (True) or not (False) |
enable_valid | bool | whether to consider validity obligatory for transformation |
The user can also select one of the following output formats:
- a simple text representation which includes the text of the document stylized in accordance with predefined styles (whether by the user or default settings);
- a standalone responsive text viewer interface.
The standalone interface includes features such as:
- increasing/decreasing the text size
- file description
- automatically generated document content
- search engine.
The document also consists of an HTML structure and CSS code for stylization. However, it also requires JavaScript (JS) code for interactive work with the page. The autonomy of the file is provided by the fact that all the additional code is included in the document. Since it is just one file, it can easily be used not only for personal browsing but also to be shared with others or embedded in any resource.
- 1 — collapsible sidebar with the content of the document;
- 2 — text size increasing/decreasing buttons;
- 3 — pop-up file description window;
- 4 — pop-up search engine window;
- 5 — link to the author's wikipedia page.
The formation of the description and content of the document requires an analysis of its structure. According to the guidelines of the default text structure any bibliographical description about the file should be stored inside the fileDesc tag (TEI Consortium. n.d., f). Thus, the program extracts all the data from the tag and presents it in the pop-up file description window (see Figure 7).
As for document content generation, the algorithm searches for the head element. Despite the fact that the head tag most often indicates the beginning of a new section. It can also be part of lower-level elements. Therefore, the user needs to define the behavior of the tag in the ODD file for better document processing.
There are two types of search in:
- content
- structure of the document.
The program extracts all text elements from the document, divides them into sentences, and finds matches with the user's query using a fuzzy string searching algorithm.
The structure is searched using XPATH. The user is prompted to enter a string with XPATH expressions to select nodes. This is possible due to the preservation of the document structure at the stage of transformation of XML to HTML. The only exception is that the tag names are converted to classes. Accordingly, to find all the phrases of a certain character the expressions .sp[who=“character_name”]
should be used instead of sp[who=“character_name”]
, where the dot denotes that the pursuing sequence is a class name. It is also possible to save the search results in a txt file.
In addition, the program considers several types of links during the analysis process:
- links defined by the @ref attribute;
- links decorated in nodes specified with the attribute @type “URL” or “UTI”;
- Wikidata ID decorated in nodes specified with the attribute @type “wikidata”;
- Wikidata ID references in @key attribute in form of “wikidata:Q729569”
Arguments:
argument | data type | description |
---|---|---|
module | list | list of module names to include in the transformation process |
keep_all | bool | whether to keep all data (True) or select only modular ones (False) |
output_filename | str | output file path |
odd_path | str | odd file path |
custom_css_path | str | css file path |
enable_valid | bool | whether to consider validity obligatory for transformation |
DOCX styling is limited to the following methods:
- text alignment;
- indent the first line of text;
- adjusting the line spacing;
- changing font characters: weight, size, color and style;
- color can be specified only with RGB notation;
- size cab be specified only in cm, pt or in
Another limitation is that all styles are applied only to the paragraph, i.e. text elements. Accordingly, to add margins, for example, you need to specify the appropriate style not for the block as a whole, but for its last element or for the first element of the next block.
Arguments:
argument | data type | description |
---|---|---|
output_filename | str | output file path |
enable_valid | bool | whether to consider validity obligatory for transformation |
The transformation to the JSON format occurs by traversing all the elements of the XML tree. Each node in the tree is examined for the presence of descendants. If no children are found, the node is perceived as a leaf node. Each node is presented as a dictionary of 2 primary parameters: attributes and value. For internal nodes, the children list parameter is also defined. The choice of presenting children in the form of a list, rather than a dictionary, is due to the possibility of the presence of several tags of the same name in the structure. Everything is stored in a Python dictionary and then converted to JSON. The resulting structure goes as follows:
{node_name:
{value: “”,
attrs: {id: “node1”},
children: [
{“children1”: {...}},
{“children2”: {...}}
]
}
For each of the scenarios its own stylesheet was defined. The display rules were determined based on the typesetting standards.
Users can also specify their own rules by providing a path to the CSS file. In this case, the program analyzes the user's data and compares it with the default sheets. There are three possible cases:
- the user defined rule set for a tag that is not specified in the default styles;
- the user defined a not specified rule for an existing tag in the default styles;
- the user defined an already specified rule for an existing tag in the default styles.
In the first case, the entire rule set is added to the style constructor. In the second case, the program adds a new rule to the existing set of rules. Finally, the program replaces the existing rule with a new one.
Due to the fact that at the transformation stage, the tag names become the name of the class. It is important that the user uses class notation when defining rules for tags.
ODD is used for allowing the user to customize the output of the algorithm in question. Since ODD is based on XML, the user does not need to have additional knowledge like XSLT to adjust the transformation behavior.
However, for now, this ability is limited to the processing of several types of tag behaviour with the tag. A single element can be specified with the tag. The tag, as a part of , documents the rules for the node rendering. For example, CSS style name identification through a @cssClass attribute or predefined function name selection through @behaviour attribute. In the present paper, I consider only two following functions of @behaviour attribute: omit mode to ignore the element and heading mode to view the element as part of a document content. The program also takes into account conditions in which the element occurs via @predicate attribute. Nevertheless, it is worth mentioning that the CSS is sufficient to implement most of the behaviour functions. For instance, line display: none can remove an item from the document.