Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCL JATS import to TB using Pensoft pathway? #103

Open
myrmoteras opened this issue Sep 27, 2023 · 4 comments
Open

DCL JATS import to TB using Pensoft pathway? #103

myrmoteras opened this issue Sep 27, 2023 · 4 comments
Assignees

Comments

@myrmoteras
Copy link
Contributor

myrmoteras commented Sep 27, 2023

issue

we have many old scanned publications with few illustrations that take a lot of time to process via GGI because of the OCR errors we get.

A solution is to the send those articles/journals to DCL, get them converted in JATS, may be a simple, minimal version of Tapub.

To import them use the Taxpub import pathway we have for the Pensoft journals. which would allow to use GGe to annotate names and mostly manually material citations.

feasibility

@gsautter how feasible is this from the TB import point of view?

@gsautter
Copy link

In general, this should be quite feasible ... however, most likely not via the Pensoft pathway, as that is specifically tailored to their use of TaxPub (and variations therein over time), so more generic JATS would most likely require a somewhat different approach, as well as a good few additional taggers to run after the fact.

Another aspect is that while DCL transcripts are good, they might not be perfect, so it'd be a shame to disconnect the XML from the page images ... an alternative approach would be to take the XML for structure and text, still run OCR, create an IMF, and then structure and OCR correct the latter by means of the JATS ... which in essence will give us the best of both worlds.

@myrmoteras
Copy link
Contributor Author

myrmoteras commented Sep 27, 2023

Yes, we can combine and make the best out of it, but this takes a lot of time.

We assume that DCL JATS is good enough for our purpose, and the few OCR issues we can fix with GGe. They are the OCR specialists.

Why not make a simple version first so we can process, and then later add the more complex version that depends on few major changes you need to make (core split, XMF).

We could use the Pensoft XML to develop a mininmal XML?

What is the effort to get a straight JATS import as described above?

@gsautter
Copy link

Yes, we can combine and make the best out of it, but this takes a lot of time.

We assume that DCL JATS is good enough for our purpose, and the few OCR issues we can fix with GGe. They are the OCR specialists.

Sure, and I am not questioning that ... the idea is to OCR the page images, and not correct them at all, but simply use the OCR result (or whichever parts of it are correct) for matching to the XML, so basically use our own OCR as anchor points for matching and pinning the DCL XML to the page images, thereby adding positional information, etc., and then all but discard the original OCR result altogether.

Why not make a simple version first so we can process, and then later add the more complex version that depends on few major changes you need to make (core split, XMF).

Frankly, by the time we insert all the detail markup into the JATS, most of the work is done either way ... and by not fitting to the page images, we'll basically simply create more legacy XML that we have to get back to at some point, synchronize UUIDs, etc.

We could use the Pensoft XML to develop a mininmal XML?

Sounds like a sensible plan to me ... in order of importance, I'd want to have at least the following elements: paragraph, italics, bold, heading, section, pageBreak, caption. These elements should give us enough structure to base the semantic enhancement on.

What is the effort to get a straight JATS import as described above?

Depends up looking at a bunch of examples, what layout peculiarities they have, and what level of detail we want to take the semantic enhancement to ... "Order out of Chaos" took a sweet while, mainly because the target granularity wasn't clear, and also there were a good few OCR errors in quite crucial places, namely in the punctuation around in-line treatment citations, which foiled automated tagging of the latter and required a ton of manual correction (including character guess work, which would have been straightforward in presence of the page images).

@lyubomirpenev
Copy link

lyubomirpenev commented Sep 29, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants