Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEI serializer #30

Open
tenzin3 opened this issue Sep 4, 2024 · 5 comments
Open

TEI serializer #30

tenzin3 opened this issue Sep 4, 2024 · 5 comments

Comments

@tenzin3
Copy link

tenzin3 commented Sep 4, 2024

Is there TEI serializer for STAM python?

Currently we have plan to develop an text api based on DTS specifications.
But in that specification, the response text should be in TEI format and since we are using STAM as our data format. Having an inbuilt TEI serializer would be very helpful.

@proycon
Copy link
Collaborator

proycon commented Sep 4, 2024 via email

@dirkroorda
Copy link
Member

Not to speak of how to serialize annotations whose textual targets do not form a clean hierarchy!

@awagner-mainz
Copy link

TLDR; +1

More of a comment than a question/issue, I apologize:

I would also be interested in TEI XML roundtripping. So far, I know of these tools which each have their own approach or internal implementation, but they are not as generic as STAM:

I wish there was some standardization here. STAM seems to go in this direction and identifies as a "pivot model", but does not address the TEI XML serialization.

Perhaps an example of a mapping configuration and processor would be feasible for the STAM project? Or checking a STAM object if it is serializable to TEI XML as a precondition for such processing (f. ex. with stam-vocab or via json schema)? Hopefully at some point someone will come up with such things.

@proycon
Copy link
Collaborator

proycon commented Sep 26, 2024 via email

@dirkroorda
Copy link
Member

dirkroorda commented Oct 1, 2024

Related to this, I am pondering about the following question:

We have 10,000 pages of 17th century Italian letters, upconverted from word + excel to TEI (with customisations).
After that we convert it to Text-Fabric and from there we use machinery to mark up ca. 12,000 entities.
The entities are delivered as a tsv file and then baked into a new Text-Fabric dataset.

But what about porting the entities back to TEI? The difficult thing is that the entities may span other elements, e.g. note, hi, pb, lb and possibly even p elements.

It is hard to weave those entities in, I think we have to fragment them around intervening markup, while we can also allow some markup within the entities. This is getting messy.

Also, there is every reason to assume that future runs of entity detection will result in different entities, so we have to regenerate all those tei files.

I prefer to leave the TEI as is, and deliver the entities as stand-off annotations to the TEI instead.

A way to do that could be address text in a concrete TEI file by means of the stack of its containing elements, e.g

tei[0]/text[0]/body[0]/div[2]/p[3]/_text_[4] = "foo"

where _text_[4] refers to the fourth text node of the element. You get multiple text nodes inside an element if the content is interrupted by other elements. text nodes are always taken maximally.

Then we can address every piece of text in Text-Fabric in this way, and from there we can generate the exact places in the XML file where the entities start and end, in terms of xpath-like expressions, so that XML processing tools can find them back.

But I really think that TEI is good for archiving, but that the results of processing TEI are not always fit to land in TEI format again, or, for that matter, in XML. Much better to use plain text plus standoff annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants