-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.1], [3.0]: Google/Crawler Inefficiencies #8367
Comments
Yeah, that's a problem, and I see that @live627 has already submitted a pair of PRs with a simple change to address it. However, there is a reason why the In light of that, I think we need a more nuanced solution to this problem. My immediate thought is to use some conditional logic to serve the long form to robots and the short form to humans, but perhaps someone can think of a more clever idea... |
In that case, SMF should ignore the topic & just use the msg. I.e., continue to use the existing msg oriented logic. No need to get complex. |
Well, the Honestly, what should happen is that Google should respect |
Hm. I wonder whether it would help if we sent a 301 or 308 response code instead of a 302. |
When we see topic=<id_topic>.msg<id_msg>#msg<id_msg>, we can just use the msg & ignore the topic. Kinda simple, I think. |
Not really. As I think about it, I don't think that using The only way to stop the search engines from making 25 pointless extra queries is to not include the post ID in the URL query params. We could use something like |
Is it really such a problem if the search engine is making these unnecessary queries to the server? Yes, it's inefficient, but is it an inefficiency that makes a practical difference? Do we have evidence that this causes notable performance degradations or other issues? |
Then we should use conditional logic to simply not show robots any links for individual posts, while still showing them to humans. Problem solved. |
Yep that would work. Only (minor) hole in that is not all bots identify with useragents. Google Ireland keeps crawling me with a plain browser useragent... (The GoogleOther useragent is a true PITA - I've blocked it outright...) On high MySQL CPU days, I may see thousands of pairs of these entries in my access logs from Google: Using ugc instead of nofollow is also a consideration... Note their definition of what 'nofollow' means keeps changing: Note that Google does honor disallows in robots.txt (though it ignores crawl-delay). I am experimenting with a disallow for |
And oh yeah, by the way, Mele Kalikimaka! |
Shall we ship a basic robots.txt in the final built package with a few starter rules? |
The reason we do "topic=X.msgY" is that it tells SMF to start on whichever page of the topic the specified post would appear on according to forum and/or user settings, since admins and users can choose how many posts to display per page. We need the topic ID in order to determine which posts to display, and there's no point in looking it up if we already have it in the URL - one less subquery to deal with. |
We wouldn't want to risk overwriting any existing robots.txt file. However, we could add some code that would try to append some rules to an existing robots.txt, possibly creating the file if necessary. This could only be a best effort attempt, though, because the file might be outside of the forum directory and we might not have file permissions, etc. |
We could ship one in the install package at least. For the upgrade package we could just name it something else like "smf-robots.txt" and tell users to rename it if they want to use it. |
#8382 provides SMF with the ability to add rules to robots.txt. Currently, it adds rules to tell all spiders that they should ignore URLs that match the pattern |
Basic Information
I'm sharing some observations here & a suggestion.
A couple notes to start...
https://forum.com/index.php?topic=######.##
.I believe most other crawlers exhibit similar behavior. Some honor crawl-delay. I don't see anybody honoring nofollow. And from what I can see (e.g., Bing), they also only index canonicals.
Some notes on SMF...
https://forum.com/index.php?msg=######
, with a nofollow.https://forum.com/index.php?topic=######.msg######msg#####
.https://forum.com/index.php?topic=######.##
So... Putting this all together... For a forum that is configured to show 25 posts on a page, the sequence of events looks like this:
https://forum.com/index.php?topic=######.##
. This is canonical form, so it is indexed.https://forum.com/index.php?msg=######
.https://forum.com/index.php?msg=######
https://forum.com/index.php?topic=######.msg######msg#####
https://forum.com/index.php?topic=######.msg######msg#####
, but that page has a canonical ofhttps://forum.com/index.php?topic=######.##
in the head, so Google discards it. Note this canonical URL will match what was on the very first line in this sequence.So... There was one successful page load & index. However, there were 25 subsequent requests to SMF that resulted in redirects, and 25 more requests to SMF that were discarded...
Literally 50x the number of requests to the site for the actual content indexed.
Note that all the content was already properly indexed - all 25 messages were on the original
https://forum.com/index.php?topic=######.##
request. It's just requesting that same content an additional 50x.Suggestions:
https://forum.com/index.php?msg=###### format
, use thehttps://forum.com/index.php?topic=######.msg######msg#####
format, as that will avoid the 302s.I believe either of the above suggestions will reduce the # of crawler requests to the site by half.
I don't know how to get Google to honor nofollow... But they do honor robots.txt. Maybe site admins can disallow msg= via robots.txt. I don't believe there is a msg= valid canonical anywhere that would be indexed.
Steps to reproduce
You can see all of the above in:
Expected result
Closer to a 1:1 relationship between requests & indexed URLs
Actual result
Crawlers hit the site 50x times more than is necessary.
Version/Git revision
3.0 alpha 2 & 2.1.4
Database Engine
All
Database Version
8.4
PHP Version
8.3.8
Logs
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: