Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible fusion backend #813

Open
c-poley opened this issue Nov 1, 2024 · 3 comments
Open

Flexible fusion backend #813

c-poley opened this issue Nov 1, 2024 · 3 comments

Comments

@c-poley
Copy link

c-poley commented Nov 1, 2024

With Annif, it is possible to use several specialised models for prediction in an ensemble. However, all models in an Annif ensemble, can only be given one specific single kind of text for prediction, so it is not possible to pass on different kinds of text to each single model.
The only way to adapt the text for prediction currently is to use the transform parameter. We can make use of that parameter to read either a limited amount of characters from the beginning or read all of the text. A parameter that would enable us to set a specific range (from character x to character y) to be read from the text would give us additional way to specify/cut down the text for specific models.

Could the annif ensemble functionality be extended in such a way that the individual models of an ensemble could be given different kinds of text (expressions of a document) for processing?

Another flexibility for the ensemble functionality would be the use of subsets of vocabularies for individual models in ensembles as discussed in issue #596 .

Anyway, the interface for the predictions need enhancments. In the following we describe some ideas we already discussed a few weeks ago:

  1. Allow to submit text as structured data:
    Use json:
    -d '%7B%22headline%22%3A%22Wonderful%22%2C%22fulltext%22%3A%22Oh%2C%20what%20a%20wonderful%20world%22%7D'
    or
    Use xml tags:
    -d '%3Cheadline%3EWonderful%3C%2Fheadline%3E%3Cfulltext%3EOh%2C%20what%20a%20wonderful%20world%3C%2Ftext%3E'
    ... with the possibility to define the tags at the right places in the projects.conf, like:
    submitted_text=headline,' ',fulltext
    The empty space between the "headline" and the "fulltext" defines the character(s), how to glue the parts of the text to submit.
    submitted_text=headline,'.',toc
    Here is a headline and a toc to submit. They are connected with a "."

  2. An approach on the way to allow a fusion is an enhancement of the limit parameter. This will allow us to define the part of the submitted text.
    In the projects.conf we only need a little enhancement, that defines the starting point and the number of characters to proceed:
    transform=limit(500,2000)

We (and we think the whole community) would really benefit from the implementation of a fusion with freely configurable structured data like in (1). We have to admit, that the usage of structured data would be the most favorite and clean implementation.

Best regards,
Christoph, Frank, Jan-Helge and Sandro from the German National Library

@juhoinkinen
Copy link
Member

A third possible approach to passing different variants/subsets of a full text to different backends, a kind of fusion of the approaches (1) and (2):

  1. Add a new select(<tag>) transform, which would retain only the input text between the given tags. The tags should be removed when transforming the text.

@c-poley et al., how would you identify the different parts of the text? Would you use some existing metadata of a document to get e.g. the headline, or would a user manually input it separately in your workflow when inputting the text/file?

I started to think could there be a transform to also detect and tag some particular parts of texts, e.g. title, abstract, TOC, authors, publishers, etc. (the authors and publishers could be advantageous to be deselected and removed from the text when performing subject indexing, so there could be a deselect(<tag>) transform too).

@c-poley
Copy link
Author

c-poley commented Nov 1, 2024

Well, the third possibility also can fulfil our requirements. Maybe, it is important, how we connect the text parts we use. We call it "text glue". Otherwise, it can become our own job to add one more character at the end of the headline or something else. For our purpose, we have the information like "headline", "toc", "blurb" or "fulltext" separately available.

But: the mentioned idea at the end of your answer can get very interesting. Basically with such a feature, Annif moves from a toolbox for automatic suggestions to a tool box that makes it possible to identify structures in plain text with the help of algorithms and perhaps AI magic. Maybe it will become a research feature, because a low error structure extraction needs a lot of knowledge of the plain text (or that, which we think is text). One of my colleagues is in the process of looking more closely at text quality. Maybe it helps to get better suggestions. Maybe we get some side effects.

@juhoinkinen
Copy link
Member

juhoinkinen commented Nov 5, 2024

Maybe I got carried away with the text parts identification; there are dedicated packages for this, and an Annif workflow could just use such a software via its API to analyse the input documents.

Annif itself could then use either the approach (1) or (3) to pass different parts of the texts to different backends (or do some cleaning of the text, like for the mentioned authors).

Some possibly useful software for such PDF layout analysis are:

  • MinerU:
  • PaperMage:
    • Aimed for scientific publications.
    • Can detect many parts of the text, importantly the abstract.
    • A demo here.
  • Docling
    • Supports many formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown)
    • Detects quite many parts, like title, headers and footers etc.
  • Surya:
    • An OCR toolkit with layout analysis.
    • Cannot identify as many parts as PaperMage, but still title, section and page headers.
    • A demo here.
  • Parsr
    • Detects headings, tables, lists, table of contents, page numbers, headers/footers, links.
  • LayoutParser
    • Apparently is meant for documents in image format, not for PDFs?
    • The detected text parts apparently depend on the used model.
    • A demo in a documentation.

Anyway, the benefit of passing different variants of a document text to different backends should be evaluated before contributing too much time for the implementation. Good if someone can experiment with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants