Important: This repository is still work in progress.
This repository collects concepts, procedures, and simple XSLT files for text processing, e.g. to simplify InDesign documents (.idml) to simplified XML, or Office formats (.fodt, .odt, .docx) to simplified XML. Subsequently the simplified XML may function as a foundation from where nested TEI-P5-XML may be generated.
The following sections explain the scripts that may be used to process the source files. The scripts sections are followed by a section that exemplifies several concepts, workflows, and approaches to text processing and transformation of XML based text files into TEI-XML.
The scripts are organised in folder that follow the following convention: input_format/software/output_format/xslt_file.xsl
The Flat Open Document Format (FODT) is a clearly and concisely structured XML format representing the structure of a LibreOffice text document. Every ODT/F document created by LibreOffice may be saved as an FODT file.
FODT contains all the metadata, styles, and structural information one may expect from a LibreOffice document. It’s conciseness, however, makes it easy to transform. The below schematic gives an overview over the basic structure of an FODT file.
Contrary to ODT/F and DOCX, an FODT file is a standalone XML file – that may be opened and processed as such without changing the file extension – and not an archive file that bundles several (heterogenous) files.
The fact that FODT is a single file format makes it especially useful for workflows that incorporate version control via Git or SVN.
fodt2base_XML_elements.xsl
(oxygen, saxon)fodt2base_XML_elements_attributes.xsl
(oxygen, saxon)fodt2base_XML_complex.xsl
(oxygen)
fodt2base_HTML_elements_attributes.xsl
(oxygen)
fodt2base_json_elements.xsl
(saxon)fodt2base_json_reduced.xsl
(saxon)
Nothing here yet …
The IDML specification may.
idml2base_XML.xsl.xsl
(oxygen)
Nothing here yet …
The JavaScript Object Notation (JSON, see 1, 2, 3) provides a slim, hierarchical data structure basically consisting of key value pairs.
Keys have to be strings, values may be strings (in "…"
), numbers, booleans, arrays (aka lists, in […]
), or, additional, JSON objects (in {…}
).
A basic, explanatory JSON structure modelling a bibliographical entry:
{
"title": {
"main": "Digital Humanities",
"sub": "Eine Einführung"
},
"editors": [
"Jannidis, Fotis",
"Kohle, Hubertus",
"Rehbein, Malte"
],
"published": 2017,
"publisher": "J.B. Metzler",
"chapters": [
{
"author": "Thaller, Manfred",
"title": "Geschichte der Digital Humanities",
"pages": [ 3, 12 ]
},
{
"author": "Thaller, Manfred",
"title": "Digital Humanities als Wissenschaft",
"pages": [ 13, 18 ]
}
],
"price": {
"ebook": 22.99,
"softcover": 29.95
}
}
- Flanders, Julia/Jannidis, Fotis (2019): Data modeling in a digital humanities context, in: Flanders, J./Jannidis, F.: The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. London, pp. 3–25.
- Flanders, Julia/Jannidis, Fotis (2019): A gentle introduction to data modeling, in: Flanders, J./Jannidis, F.: The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. London, pp. 26–96.
- Jannidis, Fotis (2017): Grundlagen der Datenmodellierung, in: Jannidis, F. et al.: Digital Humanities. Eine Einführung. Stuttgart, pp. 99–108.
- Vogeler, Georg/Sahle, Patrick (2017): XML, in: Jannidis, F. et al.: Digital Humanities. Eine Einführung. Stuttgart, pp. 128–146.
The oXygen XML Editor provides one of the best prorpietary integrated development environments (IDE) for the development with XML and XML related technologies.
Additionally, the oXygen XML Editor provides scholars with the opportunity to set up their own individualized working environment by implementation of a document type association (DTA). A DTA within oXygen is a bundle of configurations and configuration files, transformation scenarios, and CSS files that generate an individualized GUI overlay for editing, transforming, and querying of XML files. One well known DTA – at least in German speaking DH community – is Ediarum.
- Beautiful Soup 4
- lxml
- Docs: https://lxml.de/
- Tutorial: https://lxml.de/tutorial.html
Important: Before moving, copying, and deleting files in your file system please make sure that you know what you do! Erratic moving and deleting of files may have disastrous consequences! The walkthrough below is based on my own file system and system setup and should only be used as a guideline and in a reasonable way.
Since october 2019 the Saxon/C library for XSLT & XQuery processing has a native Python API available (C++, Java, and PHP APIs are available as well, see here). Following, I will give a short walkthrough on how one may set everything up on MacOS to usage in a Jupyter Notebook (the walkthrough follows the information provided with the Saxon/C library):
- Installing Python 3 and the Jupyter library.
- Download and install Python from the official website (Don’t forget to let the installer add Python to the PATH-variable).
- Install the Jupyter library, e.g. via PIP:
pip3 install jupyter
. - Install the Cython library, e.g.
pip3 install Cython
- Installing Saxon/C for Python on MacOS (please consult the README file distributed with the Saxon/C library as well).
- Download the Saxon/C-HE ZIP-file from the Saxonica-website.
- Navigate into your Downloads folder:
cd ~/Downloads/
- Create a temporary folder for the files, e.g.:
mkdir temp_saxon
- Move the ZIP-file into the temporary folder:
mv libsaxon-HEC-mac-setup-v1.2.0.zip temp_saxon/
- Move into the temporary folder
cd temp_saxon
and extract the ZIP-file, e.g.:unzip libsaxon-HEC-mac-setup-v1.2.0.zip
- move the files into your
/usr/local/lib/
folder:cp libsaxonhec.dylib /usr/local/lib/
cp -r rt /usr/local/lib/
- Adjust your
PATH
environment variables in your.bash_profile
or your.zshrc
shell configuration file:export JET_HOME=/usr/local/lib/rt
export DYLD_LIBRARY_PATH=$JET_HOME/lib/jetvm:$DYLD_LIBRARY_PATH
- Now move back into your temporary folder and from there into the folder where the Python API files are located
cd ~/Downloads/temp_saxon/Saxon.C.API
- Move the folder
python-saxon
to where you want to keep the Python extension. From this location the Saxon/C library will be imported into your Python scripts, e.g.cp -r python-saxon /Users/houzi/
. - Then move into this folder
cd /Users/houzi/python-saxon
and build the Python extensionpython3 saxon-setup.py build_ext -if
- Now you may import the
saxonc
library from your scripts after adding yoursaxon-python
folder to thesys.path
. In your script or console start with the following:import sys
sys.path.append("/Users/houzi/python-saxon")
import saxonc
# import the sys library to be able to append your Saxon/C Python API folder to the library loading path
import sys
sys.path.append("/Users/houzi/python-saxon")
# import the Saxon/C library
import saxonc
# import other libraries you may need, e.g. JSON
import json
with saxonc.PySaxonProcessor(license=False) as proc:
print(proc.version)
# Initialize the XSLT 3.0. processor
xsltproc = proc.new_xslt30_processor()
# set the directory where your XML & XSLT files are located
xsltproc.set_cwd('docs')
# set the XSLT 3.0 processor’s result to a raw string
xsltproc.set_result_as_raw_value(True)
# set your source file, e.g. the XML file you want to transform, on the XSLT 3.0 processor
xsltproc.set_initial_match_selection(file_name="flat_open_office_document.fodt")
# apply your XSLT stylesheet on the XSLT 3.0 processor
result = xsltproc.apply_templates_returning_string(stylesheet_file="fodt2base_json_reduced.xsl")
# Write the string output to a file, e.g. to a JSON file
with open("test.json",'w') as file:
file.write(result)
# load the JSON string result for further work within Python
j = json.loads(result)
# Print the result
print(j)
- Docs: http://www.exist-db.org/exist/apps/doc/
- Tutorial: https://howto.acdh.oeaw.ac.at/blog/books/how-to-build-a-digital-edition-web-app/
MIT License
Copyright (c) 2019–2020 Max Grüntgens (猴子)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This work is licensed under a Creative Commons Attribution 4.0 International License.