Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching to optimize calls to ->canIncludeInGoogleSitemap() #180

Open
patricknelson opened this issue Oct 3, 2021 · 1 comment
Open

Comments

@patricknelson
Copy link
Contributor

patricknelson commented Oct 3, 2021

This isn't a typical scenario, however: I manage a site with a little over 4.3k pages. I found that with the proper caching applied, you can dramatically reduce server-side load. For example, some areas that could use improvement:

  • Caching the filtered results from calls to ->canIncludeInGoogleSitemap() somehow
  • Add <% cached ... %> blocks for partial template caching to save overhead in generating XML
  • Incorporating sane defaults in cache-control headers (minor benefit if you have a CDN and your sitemap.xml is popular)

The performance boost is particularly noticeable if you happen to have any logic placed inside of ->canIncludeInGoogleSitemap() on your pages. Even if not, caching these calls can help a lot due to the overhead of going through the call stack to hit the extension and then the overhead of the call in the extension version of that method itself. The extension's own ->canIncludeInGoogleSitemap() then calls ->hasPublishedParent() recursively down the page hierarchy, which is an expensive operation when done 4.3k times (or maybe 500 times per page, for example), particularly if the site content has a lot of organizational depth.

On my site, after some quick hacking, I found that I could reduce load time from ~4s 100% of the time down to only ~4s on the first hit, then ~300ms on subsequent loads as long no SiteTree instances listed in the current sitetree.xml page have been modified (since you want to spoil the cache if there were any modifications that could have impacted the result of that function call). This would be especially beneficial in a distributed cluster where the cache layer is utilizing a distributed k/v cache like Memcached or Redis or whatever.

Two main approaches:

  • My first approach was to just cache on request, e.g. take current paginated page number combined with the class being looked at and verify that the list of instances are the same as the last time we cached output results and, if so, just use the last cached output results. This can be done super quickly by just using ->column(...) on the list to get certain fields to spoil/vary the caching.
  • Alternatively, you could possibly cache the value of ->canIncludeinGoogleSitemap() on ->write() in a special column on the root SiteTree object itself instead of calling it all the time. The benefit here is you get extremely fast performance by just filtering on the result of that column. The downside is that you lose the benefit of any complicated side-effects from ->canIncludeGoogleSitemap() that don't rely purely on the state of that DataObject (e.g. say it relies on the state of some other unrelated object or a global value). So, doing this would be a breaking change, I'd immagine.

Just some ideas. I would have submitted a PR, however my site is still stuck on SilverStripe 3.x so there's no point. I'm also just going to leave my modified version of the extension (mentioned above in my testing) out of production code since we have a pretty performant cluster, send proper HTTP cache headers (via an extension to GoogleSitemapController) have a CDN that will force clients to use the cached version (in case a bot sees this weakness and uses it to pound our server, for example).

@wilr
Copy link
Owner

wilr commented Oct 3, 2021

@patricknelson happy to see any improvements you've got. If you want to email anything through, I can adapt for v4.

Even though caching logic can be quite site specific, for this usage 95% of the time we can rely on the LastEdited timestamp on the record. Would need some logical API's to override whatever the system does out of the box but seems an easy improvement.

Not sure if you're running an old version of this module but everything is paginated to max X records per view or prevent overwhelming the server (https://github.com/wilr/silverstripe-googlesitemaps/blob/main/src/GoogleSitemap.php#L51). If your can* check is slow then set that accordingly to a smaller figure (i.e 100) to reduce load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants