Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New gbifIDs issued for unchanged occurrence records #1105

Open
dshorthouse opened this issue Nov 28, 2024 · 4 comments
Open

New gbifIDs issued for unchanged occurrence records #1105

dshorthouse opened this issue Nov 28, 2024 · 4 comments

Comments

@dshorthouse
Copy link

dshorthouse commented Nov 28, 2024

For some BioCASE-based datasets that lack source occurrenceIDs such as https://www.gbif.org/dataset/e88fdc62-ea3b-41a3-8967-b3780f602c0b, I've seen re-issuance of new gbifIDs upon recent re-harvest but apparently no significant/relevant new adjustments or enhancements were made to records in their source. For instance, this occurrence record is deprecated https://www.gbif.org/occurrence/1946076733 whereas it has been issued a new gbifID, https://www.gbif.org/occurrence/4981443192. In other instances, it looks like the interpretation of occurrenceID and occurrenceStatus has been mistakenly swapped, resulting in an issuance of new gbifIDs. See https://www.gbif.org/occurrence/144842729 and what appears to be its replacement https://www.gbif.org/occurrence/4979384474 by way of example.

@dshorthouse
Copy link
Author

Ping @MattBlissett. If there's a relatively quick fix & reharvest because issues are discovered to be a result of processing datasetKey e88fdc62-ea3b-41a3-8967-b3780f602c0b and 85714c48-f762-11e1-a439-00145eb45e9a & not originating at their sources, I'll hold off on a Bionomia refresh. Otherwise, I'll proceed and swallow the ~20k broken links.

@timrobertson100
Copy link
Member

Hi @dshorthouse

I can't diagnose this fully tonight, but I looked quickly at the last example. It looks like something might have changed at the source and produced a broken archive.

If you look at the DwC-A for the dataset you will find the CSV headers and the metadata are contradictory (swapped occurrenceStatus and occurrenceID.

In the CSV the first columns read:

"catalogNumber","institutionCode","collectionCode","basisOfRecord","occurrenceStatus","occurrenceID"

but in the meta.xml:

        <files>
            <location>occurrence.txt</location>
        </files>
        <id index="0" />
        <!-- Occurrence fields -->
        <field index="0" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
        <field index="1" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
        <field index="2" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
        <field index="3" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
        <field index="4" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
        <field index="5" term="http://rs.tdwg.org/dwc/terms/occurrenceStatus"/>

Because the meta.xml exists and defines the schema, the column header is ignored but the data is swapped.

Could this explain other anomalies you see please?

(Pinging @jholetschek too)

@dshorthouse
Copy link
Author

dshorthouse commented Nov 28, 2024

Thanks @timrobertson100 for the prompt response. What you discovered appears to be the source of some new gbifIDs though I'm confused by the fact that some have evidently remained the same pre- and post- mistaken swap of the terms in the meta.xml file such as https://www.gbif.org/occurrence/165319672. Incomplete harvesting or logic fallback to the triplet perhaps that appears to have stuck the landing in some instances but not in others?

@jholetschek
Copy link

Hi Tim, David,

yes, there was a bug in the DwC archive generation in BioCASe, which caused the entries occurrenceStatus and occurrenceID in meta.xml to be swapped. It has been fixed, but that was an old archive. But the DwC-A it shouldn't have been used, maybe a metadata update caused this endpoint to pop up again. I removed it, so that ABCD gets used again.

But the other dataset (https://www.gbif.org/dataset/e88fdc62-ea3b-41a3-8967-b3780f602c0b) doesn't have a DwC endpoint, and the contents hasn't been changed in years, it's a static dataset.

Cheers
Jörg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants