Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hOCR functionality #1006

Merged
merged 13 commits into from
Apr 19, 2024
Merged

Add hOCR functionality #1006

merged 13 commits into from
Apr 19, 2024

Conversation

joecorall
Copy link
Member

GitHub Issue: (link)

Pulls the relevant hOCR functionality out of #983, which is a PR focused on TIFF width/height calculations

What does this Pull Request do?

A brief description of what the intended result of the PR will be and/or what
problem it solves.

What's new?

A in-depth description of the changes made by this PR. Technical details and
possible side effects.

  • Changes x feature to such that y
  • Added x
  • Removed y
  • Does this change add any new dependencies?
  • Does this change require any other modifications to be made to the repository
    (i.e. Regeneration activity, etc.)?
  • Could this change impact execution of existing code?

How should this be tested?

A description of what steps someone could take to:

  • Reproduce the problem you are fixing (if applicable)
  • Test that the Pull Request does what is intended.
  • Please be as detailed as possible.
  • Good testing instructions help get your PR completed faster.

Documentation Status

  • Does this change existing behaviour that's currently documented?
  • Does this change require new pages or sections of documentation?
  • Who does this need to be documented for?
  • Associated documentation pull request(s): ___ or documentation issue ___

Additional Notes:

Any additional information that you think would be helpful when reviewing this
PR.

Interested parties

Tag (@ mention) interested parties or, if unsure, @Islandora/committers

@alxp
Copy link

alxp commented Mar 13, 2024

Thanks for pulling this out. I was using that branch as kind of a junk drawer.

It took a while but I finally got together how to test this, including the search features provided by discoverygarden/islandora_hocr and born-digital-us/islandora_iiif_hocr:

To test

Check out the "solr-hocr" branch of Islandora-Devops/isle-dc

Run "make hocr_test"

You may need to provide a GitHub token so Composer can check out a custom branch..

In the new site, add a Repository Item content type object with model "Paged Content".

Then batch upload as many TIFFs or JP2s as you like, with Media Type of 'File', Model 'Page', Media Use 'Original File'.

Then when you go back to the parent book object you should see Mirador, with embedded hOCR text if your image has text on it.

Then click on the button to activate Mirador's sidebar. Then click 'Search'.

Enter a search term that can be found in the hOCR and press Enter.

Search results should appear in the sidebar.

image image

@alxp alxp marked this pull request as ready for review March 20, 2024 11:51
@seth-shaw-asu seth-shaw-asu self-requested a review March 27, 2024 17:07
@Natkeeran
Copy link

@alxp

Is this limited to File and jp2 and tiff, wondering if we can use this with Image and jpg as well? Is there any config we have adjust to do that?

@alxp
Copy link

alxp commented Apr 10, 2024

@Natkeeran I think if you create an Original File that is a JPEG, the hOCR derivative task should run. Right now the config just looks for Service File so that would either need to be changed in the manifest view or make sure there is a service file with the same size as original (I don't know if we ever got around to fixing the problems with an item having both Original File and Service File tags)

@alxp
Copy link

alxp commented Apr 17, 2024

Made 2 changes:

Added a null check that @seth-shaw-asu found was necessary.

Also fixed a problem with the ISLE-DC branch I noted above for testing this PR, it turns out you can't tell composer to check out a branch on GitHub when using the drupal/ namespace.

So I just made make hocr_test delete the islandora repo and check it out from git.

This was causing an unpredictability of whether tthe testing branch of islandora or an older release tag was checked out.

@seth-shaw-asu
Copy link
Member

seth-shaw-asu commented Apr 18, 2024

Went to test again and it appears to be getting the hOCR:

{
  "@type": "sc:Manifest",
  "@id": "/node/1/book-manifest",
  "label": "IIIF Manifest",
  "@context": "http://iiif.io/api/presentation/2/context.json",
  "sequences": [
    {
      "@context": "http://iiif.io/api/presentation/2/context.json",
      "@id": "https://islandora.traefik.me/node/1/sequence/normal",
      "@type": "sc:Sequence",
      "canvases": [
        {
          "@id": "https://islandora.traefik.me/node/1/canvas/4",
          "@type": "sc:Canvas",
          "label": "7a9230aa-8-Service File.jpg",
          "height": 1920,
          "width": 1080,
          "images": [
            {
              "@id": "https://islandora.traefik.me/node/1/annotation/4",
              "@type": "oa:Annotation",
              "motivation": "sc:painting",
              "resource": {
                "@id": "https://islandora.traefik.me/cantaloupe/iiif/2/https%3A%2F%2Fislandora.traefik.me%2Fsites%2Fdefault%2Ffiles%2F2024-04%2F7a9230aa-8-Service%2520File.jpg/full/full/0/default.jpg",
                "@type": "dctypes:Image",
                "format": "image/jpeg",
                "height": 1920,
                "width": 1080,
                "service": {
                  "@id": "https://islandora.traefik.me/cantaloupe/iiif/2/https%3A%2F%2Fislandora.traefik.me%2Fsites%2Fdefault%2Ffiles%2F2024-04%2F7a9230aa-8-Service%2520File.jpg",
                  "@context": "http://iiif.io/api/image/2/context.json",
                  "profile": "http://iiif.io/api/image/2/profiles/level2.json"
                }
              },
              "on": "https://islandora.traefik.me/node/1/canvas/4"
            }
          ],
          "seeAlso": {
            "@id": "https://islandora.traefik.me/sites/default/files/hocr/2024-04/2-hOCR.html",
            "format": "text/vnd.hocr+html",
            "profile": "http://kba.cloud/hocr-spec",
            "label": "hOCR embedded text"
          }
        },
...
  ],
  "service": [
    {
      "@context": "http://iiif.io/api/search/0/context.json",
      "@id": "https://islandora.traefik.me/paged-content-search/1",
      "profile": "http://iiif.io/api/search/0/search",
      "label": "Search inside this work"
    }
  ]
} 

The search icon appears, but none of my searches show any results...

Screenshot 2024-04-18 163825

Side note: the text does appear in the regular solar search and I can see them when I open the hOCR files themselves, so the text is there, just not the Mirador search.

@alxp
Copy link

alxp commented Apr 19, 2024

Hi @seth-shaw-asu , Thanks for testing. I'll look into why you might be having problems with searching but since this PR is about just getting the hOCR into the manifest,, I am happy to see that working.

@seth-shaw-asu
Copy link
Member

It is searching, just not finding any results:
Screenshot 2024-04-19 075049

Copy link
Member

@seth-shaw-asu seth-shaw-asu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@seth-shaw-asu seth-shaw-asu merged commit 9b26616 into 2.x Apr 19, 2024
24 checks passed
@joecorall joecorall deleted the hocr branch April 19, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants