Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create sitemap publishing and versioning control #193

Open
ksonda opened this issue Jul 10, 2023 · 2 comments
Open

Create sitemap publishing and versioning control #193

ksonda opened this issue Jul 10, 2023 · 2 comments
Assignees

Comments

@ksonda
Copy link
Member

ksonda commented Jul 10, 2023

harvest.geoconnex.us ideally will automatically recrawl all new or modified resources added to the PID registry.

harvest.geoconnex.us uses sitemap.xml to crawl resources

Therefore, we need a away to process diffs between sitemap.xml according to PID additions or other triggered recrawls by data contributors.

Suggestion: add releases of zipped sitemap_XXX.xml files, so that harvest.geoconnex.us can download last release, to compare with the contents of sitemap_XXX.xml directed to by live sitemap index https://geoconnex.us/sitemap.xml

Suggestion: change how sitemap.xml is generated so that lastmod reflects the true last filechange datetime by csv file in /namespaces

@webb-ben
Copy link
Member

Have a working version of the sitemap index generation that reflects the last time an update was made to the source of the urlset (csv file or xml files). Am planning on modifying this to run as a standalone GitHub Action that can be implemented in pids.geoconnex.us.

Thinking about generating sitemaps for regex: and explicit truth of a regex namespace is that there are a lot of features. Having a URL to download a file where the corresponding sitemap can be generated becomes a problem as the PID list grows. I suggest we invest more effort in the tooling to generate sitemaps indexes and urlsets from arbitrary source (as I am planning to implement in the GitHub action) - to promote contributors to generate and include their urlset files to reduce the exponential growth this entails.

Having a mechanism to only regenerate sitemaps that have a change instead of all sitemaps anytime ANY namespace changes.

@ksonda
Copy link
Member Author

ksonda commented Jul 24, 2023

As per meeting, proposed strategy:

Use github Action/ pygeoapi container to generate sitemaps from

a) zipped csv template PR'd directly to github or
b) ESRI/CKAN/Socrata/ any remote geospatial file with the URIs in the data. + the attribute name for the URis

The User's decision tree is:

  1. (<300,000 sites) Submit regular csv
  2. (between 300,000 and 2,000,000 sites) Submit multiple regular csv with different filenames
  3. (>2,000,000 sites OR can maintain a remote endpoint and don't want to interact with github to update) Submit regex csv w/ endpoint + pygeoapi provider name + attribute id)

@webb-ben webb-ben changed the title sitemap publishing and versioning Create sitemap publishing and versioning control Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants