Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.1], [3.0]: Google/Crawler Inefficiencies #8367

Open
sbulen opened this issue Dec 22, 2024 · 16 comments · May be fixed by #8382
Open

[2.1], [3.0]: Google/Crawler Inefficiencies #8367

sbulen opened this issue Dec 22, 2024 · 16 comments · May be fixed by #8382

Comments

@sbulen
Copy link
Contributor

sbulen commented Dec 22, 2024

Basic Information

I'm sharing some observations here & a suggestion.

A couple notes to start...

  • Google does not honor crawl-delay, they will hit your forum as often as they want. Sometimes quite a lot...
  • Google does not honor nofollow. So when crawling a page, it will follow all the links on the page.
  • Google only indexes pages with urls that follow the canonical form - for all SMF topics & messages, that is: https://forum.com/index.php?topic=######.##.
  • Google search console documents the excluded non-canonical urls as duplicates, with a message: "Alternate page with proper canonical tag - These pages aren't indexed or served on Google".

I believe most other crawlers exhibit similar behavior. Some honor crawl-delay. I don't see anybody honoring nofollow. And from what I can see (e.g., Bing), they also only index canonicals.

Some notes on SMF...

  • For each message displayed on a page, SMF provides a link in the format https://forum.com/index.php?msg=######, with a nofollow.
  • When provided a "msg=" format url, SMF will issue a 302 redirect to a url with the format https://forum.com/index.php?topic=######.msg######msg#####.
  • The redirect is OK, and that request returns a status code of 200.
  • When the page is built, the canonical format provided by SMF in the head is in the format https://forum.com/index.php?topic=######.##

So... Putting this all together... For a forum that is configured to show 25 posts on a page, the sequence of events looks like this:

  • Let's assume Google starts a crawl with https://forum.com/index.php?topic=######.##. This is canonical form, so it is indexed.
  • This page has 25 messages on it, including one url per message, each of the form: https://forum.com/index.php?msg=######.
  • Google will then, 25x:
    • Request all message urls provided https://forum.com/index.php?msg=######
    • SMF says hold on, 302, you should really be using https://forum.com/index.php?topic=######.msg######msg#####
    • Google then requests the page https://forum.com/index.php?topic=######.msg######msg#####, but that page has a canonical of https://forum.com/index.php?topic=######.## in the head, so Google discards it. Note this canonical URL will match what was on the very first line in this sequence.

So... There was one successful page load & index. However, there were 25 subsequent requests to SMF that resulted in redirects, and 25 more requests to SMF that were discarded...

Literally 50x the number of requests to the site for the actual content indexed.

Note that all the content was already properly indexed - all 25 messages were on the original https://forum.com/index.php?topic=######.## request. It's just requesting that same content an additional 50x.

Suggestions:

  • When building pages, don't post the links in https://forum.com/index.php?msg=###### format, use the https://forum.com/index.php?topic=######.msg######msg##### format, as that will avoid the 302s.
  • An alternate approach would be to not issue a 302, just build the page.

I believe either of the above suggestions will reduce the # of crawler requests to the site by half.

I don't know how to get Google to honor nofollow... But they do honor robots.txt. Maybe site admins can disallow msg= via robots.txt. I don't believe there is a msg= valid canonical anywhere that would be indexed.

Steps to reproduce

You can see all of the above in:

  • Your web access logs
  • Google Search Console
  • Google search results

Expected result

Closer to a 1:1 relationship between requests & indexed URLs

Actual result

Crawlers hit the site 50x times more than is necessary.

Version/Git revision

3.0 alpha 2 & 2.1.4

Database Engine

All

Database Version

8.4

PHP Version

8.3.8

Logs

No response

Additional Information

No response

@Sesquipedalian
Copy link
Member

Yeah, that's a problem, and I see that @live627 has already submitted a pair of PRs with a simple change to address it.

However, there is a reason why the ?msg=<id_msg> URLs exist. Since topics can be split or merged, the value for topic in ?topic=<id_topic>.msg<id_msg>#msg<id_msg> can change, causing previous links to that individual post to break. The ?msg=<id_msg> form ensures that a link to a specific post always resolves to the correct canonical URL.

In light of that, I think we need a more nuanced solution to this problem.

My immediate thought is to use some conditional logic to serve the long form to robots and the short form to humans, but perhaps someone can think of a more clever idea...

@sbulen
Copy link
Contributor Author

sbulen commented Dec 25, 2024

In that case, SMF should ignore the topic & just use the msg. I.e., continue to use the existing msg oriented logic.

No need to get complex.

@Sesquipedalian
Copy link
Member

Well, the ?topic=... URL is indeed the canonical form, since single posts are always shown within a topic. That's why we redirect from ?msg=... to ?topic=... in the first place.

Honestly, what should happen is that Google should respect nofollow. We're doing the correct thing, whereas they are not. But since we can't control what they do, we need to do something clever ourselves.

@Sesquipedalian
Copy link
Member

Sesquipedalian commented Dec 25, 2024

Hm. I wonder whether it would help if we sent a 301 or 308 response code instead of a 302.

@sbulen
Copy link
Contributor Author

sbulen commented Dec 25, 2024

When we see topic=<id_topic>.msg<id_msg>#msg<id_msg>, we can just use the msg & ignore the topic. Kinda simple, I think.

@Sesquipedalian
Copy link
Member

Sesquipedalian commented Dec 25, 2024

Not really. As I think about it, I don't think that using ?topic=<id_topic>.msg<id_msg>#msg<id_msg> would actually help with the original problem anyway. Search engines will ignore the fragment, but since the message ID is in the query string, each post will still have a unique URL that ultimately leads back to the same page. So the search engines will still follow all 25 links. It doesn't really matter what form the URLs are presented in. What matters is that the URLs contain the posts' ID values, yet all resolve to the same page in the end.

The only way to stop the search engines from making 25 pointless extra queries is to not include the post ID in the URL query params. We could use something like ?topic=<id_topic>.<page_start>#msg<id_msg>, since that would confine the post ID only to the fragment. But then we are back to the problem that any existing links to an individual post will break if its topic changes.

@Sesquipedalian
Copy link
Member

Is it really such a problem if the search engine is making these unnecessary queries to the server? Yes, it's inefficient, but is it an inefficiency that makes a practical difference? Do we have evidence that this causes notable performance degradations or other issues?

@sbulen
Copy link
Contributor Author

sbulen commented Dec 25, 2024

@Sesquipedalian
Copy link
Member

Then we should use conditional logic to simply not show robots any links for individual posts, while still showing them to humans. Problem solved.

@sbulen
Copy link
Contributor Author

sbulen commented Dec 25, 2024

Yep that would work. Only (minor) hole in that is not all bots identify with useragents. Google Ireland keeps crawling me with a plain browser useragent... (The GoogleOther useragent is a true PITA - I've blocked it outright...)

On high MySQL CPU days, I may see thousands of pairs of these entries in my access logs from Google:

canonical-3

Using ugc instead of nofollow is also a consideration... Note their definition of what 'nofollow' means keeps changing:
https://developers.google.com/search/blog/2019/09/evolving-nofollow-new-ways-to-identify

Note that Google does honor disallows in robots.txt (though it ignores crawl-delay). I am experimenting with a disallow for PHPSESSID and msg= and early indications are that it's working great...

@sbulen
Copy link
Contributor Author

sbulen commented Dec 25, 2024

And oh yeah, by the way, Mele Kalikimaka!

@live627
Copy link
Contributor

live627 commented Dec 27, 2024

Shall we ship a basic robots.txt in the final built package with a few starter rules?

@Oldiesmann
Copy link
Contributor

When we see topic=<id_topic>.msg<id_msg>#msg<id_msg>, we can just use the msg & ignore the topic. Kinda simple, I think.

The reason we do "topic=X.msgY" is that it tells SMF to start on whichever page of the topic the specified post would appear on according to forum and/or user settings, since admins and users can choose how many posts to display per page. We need the topic ID in order to determine which posts to display, and there's no point in looking it up if we already have it in the URL - one less subquery to deal with.

@Sesquipedalian
Copy link
Member

Shall we ship a basic robots.txt in the final built package with a few starter rules?

We wouldn't want to risk overwriting any existing robots.txt file.

However, we could add some code that would try to append some rules to an existing robots.txt, possibly creating the file if necessary. This could only be a best effort attempt, though, because the file might be outside of the forum directory and we might not have file permissions, etc.

@Oldiesmann
Copy link
Contributor

We could ship one in the install package at least. For the upgrade package we could just name it something else like "smf-robots.txt" and tell users to rename it if they want to use it.

@Sesquipedalian
Copy link
Member

Sesquipedalian commented Dec 30, 2024

#8382 provides SMF with the ability to add rules to robots.txt. Currently, it adds rules to tell all spiders that they should ignore URLs that match the pattern /path/to/index.php?msg= (where /path/to/index.php will be set to the appropriate value for the individual forum instance), as well as URLs containing PHPSESSID or ;topicseen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants