-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEI serializer #30
Comments
On Wed Sep 4, 2024 at 11:11 AM CEST, Tenzin Tsundue wrote:
Is there TEI serializer for STAM python?
Currently we have plan to develop an text api based on [DTS specifications](https://distributed-text-services.github.io/specifications/).
But in that specification, the response text should be in TEI format and since we are using STAM as our data format. Having an inbuilt TEI serializer would be very helpful.
No, a serialisation from STAM to TEI would only be feasible if a STAM
model would be strictly contrained to TEI's vocabulary, but STAM by definition
allows any kind of vocabulary and doesn't predefine anything, TEI
on the other hand defines a lot of vocabulary. So this is something that needs to be
implemented on a higher-level and depends greatly on your use-case and how you decide to map
whatever vocabulary you use to TEI. So you could build a library that
does this (for your particular use-case) using stam-python .
We do have a tool for the reverse, mapping formats like TEI XML to STAM
(via `stam fromxml` in stam-tools). That doesn't help you much here, but
the TEI configuration there (https://github.com/annotation/stam-tools/blob/master/config/fromxml/tei.toml)
may give you an idea what vocabulary mappings could look like.
|
Not to speak of how to serialize annotations whose textual targets do not form a clean hierarchy! |
TLDR; +1 More of a comment than a question/issue, I apologize: I would also be interested in TEI XML roundtripping. So far, I know of these tools which each have their own approach or internal implementation, but they are not as generic as STAM:
I wish there was some standardization here. STAM seems to go in this direction and identifies as a "pivot model", but does not address the TEI XML serialization. Perhaps an example of a mapping configuration and processor would be feasible for the STAM project? Or checking a STAM object if it is serializable to TEI XML as a precondition for such processing (f. ex. with stam-vocab or via json schema)? Hopefully at some point someone will come up with such things. |
@awagner-mainz Thanks for your comment! That is most welcome. I'll have
a closer look at all the links you provided.
I wish there was some standardization here. STAM seems to go in this direction and identifies as a "pivot model", but does not address the TEI XML serialization.
Perhaps an example of a mapping configuration and processor would be
feasible for the STAM project? Or checking a STAM object if it is
serializable to TEI XML as a precondition for such processing (f. ex.
with stam-vocab or via json schema)? Hopefully at some point someone
will come up with such things.
Yes, a mapping and processor are definitely feasible on top of STAM. I
do wonder to what extend it can really be done generically, considering
that one TEI often differs from the other and there are often
use-case-specific issues. But a base template sounds feasible.
Stam-vocab might indeed provide the means to then test the vocabulary in
a STAM model programmatically, once a mapping is defined. Aside from the
vocabulary there is the overlap/hierarchy issue @dirkroorda raised
above, but that too is something that could be checked.
I can't promise I myself can get to this anytime soon, as I'd need a
proper use-case to justify it for my employer. (the project that funded
STAM thus-far has reached its end, but I have every intention of
continuing). But I do recognize the value (and the challenges) in a TEI
serialisation.
|
Related to this, I am pondering about the following question: We have 10,000 pages of 17th century Italian letters, upconverted from word + excel to TEI (with customisations). But what about porting the entities back to TEI? The difficult thing is that the entities may span other elements, e.g. It is hard to weave those entities in, I think we have to fragment them around intervening markup, while we can also allow some markup within the entities. This is getting messy. Also, there is every reason to assume that future runs of entity detection will result in different entities, so we have to regenerate all those tei files. I prefer to leave the TEI as is, and deliver the entities as stand-off annotations to the TEI instead. A way to do that could be address text in a concrete TEI file by means of the stack of its containing elements, e.g
where Then we can address every piece of text in Text-Fabric in this way, and from there we can generate the exact places in the XML file where the entities start and end, in terms of xpath-like expressions, so that XML processing tools can find them back. But I really think that TEI is good for archiving, but that the results of processing TEI are not always fit to land in TEI format again, or, for that matter, in XML. Much better to use plain text plus standoff annotations. |
Is there TEI serializer for STAM python?
Currently we have plan to develop an text api based on DTS specifications.
But in that specification, the response text should be in TEI format and since we are using STAM as our data format. Having an inbuilt TEI serializer would be very helpful.
The text was updated successfully, but these errors were encountered: