Add a feature that prevents Google Scholar error code 429 "too many requests" during build #2686

jiaye-wu · 2024-09-09T14:39:52Z

jiaye-wu
Sep 9, 2024

Hi, thank the author and the contributing community members for this amazing template!

While I am locally debugging the website, each file change triggers a rebuild, and with every rebuild it queries Google Scholar to update the citations. When we have a lot of papers, and when the debugging work is heavy, it is very easy to exceed the limit and encounter the error code HTTP 429 "too many requests", which slows down the speed of local build, and more importantly it seems to stop GitHub Pages auto-deployment. We don't know how long code 429 persists, but it can take days.

Before I switched to al-folio, I was using academicpages for my personal website. I implemented in my old website a Google Scholar Crawler that caches Google Scholar info in another branch in the form of .json files. With this method, the building process is faster since it stores GS info in one go rather than querying individually -- also it does not trigger error code 429 "too many requests". This crawler was originally from jeckyll theme AcadHomepage and I managed to add a bit my improvements and got it working on academicpages. However, unfortunately, it does not work with al-folio. When it is added to al-folio, the local debug build freezes in the middle of nowhere and gives no error message.

My modified version of this decoupled Google Scholar Crawler is here. I wonder if this small project will be useful somehow for al-folio, since the user might already have their GS paper ID in the bibtex. All the improvement needs to do is to search the paper ID in the gs_data.json database of the crawler's sub-branch without sending queries too frequently to Google. This crawler is optimized for internet conditions in China that grabs info from a mirrored source.

I am no expert in jekyll coding, and I currently cannot implement it alone. I am launching this discussion to provide a potential solution and see if you are interested in avoiding the error "too many requests" which can be very annoying.

george-gca · 2024-09-09T15:52:08Z

george-gca
Sep 9, 2024
Collaborator

The easiest solution for you would be disabling Google scholar crawler during debug, and only re-enable it when pushing to GitHub. Other than that, the crawler should be set to run at frequent intervals or run manually to fetch new content if done your way. Maybe it would be more useful to create a GitHub action for that.

2 replies

jiaye-wu Sep 9, 2024
Author

Thank you for your reply. Could you please indicate how to disable the crawler in al-folio? Thanks!

BTW, I would like to know your opinions on the idea of caching the GS data like my small project shows -- since al-folio is also not updating until next rebuild.

george-gca Sep 9, 2024
Collaborator

Set here to false:

al-folio/_config.yml

Line 345 in 97f78e5

    
           google_scholar: true # Google Scholar badge (https://scholar.google.com/intl/en/scholar/citations.html)

About caching, it would be useful. Instead of generating in another branch, we could simply add the json file here. Also I think creating a workflow for this is more useful than the current approach, since it could be scheduled for fetching this information every X days or be run manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a feature that prevents Google Scholar error code 429 "too many requests" during build #2686

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Add a feature that prevents Google Scholar error code 429 "too many requests" during build #2686

jiaye-wu Sep 9, 2024

Replies: 1 comment · 2 replies

george-gca Sep 9, 2024 Collaborator

jiaye-wu Sep 9, 2024 Author

george-gca Sep 9, 2024 Collaborator

jiaye-wu
Sep 9, 2024

Replies: 1 comment 2 replies

george-gca
Sep 9, 2024
Collaborator

jiaye-wu Sep 9, 2024
Author

george-gca Sep 9, 2024
Collaborator