Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double-processing post content when parsing block template HTML #3549

Open
wants to merge 8 commits into
base: trunk
Choose a base branch
from

Conversation

felixarntz
Copy link
Member

@felixarntz felixarntz commented Nov 1, 2022

  • get_the_block_template_html() calls various (partially expensive) functions to parse the block template content.
  • A problem with that is that, within the block template, there usually are one or more instances of the post-content block, which itself parses post content using similar functions (which has historically been the established behavior).
  • This means that the post content blobs within the template content are now double-processed. Not only is this wasteful, more importantly it leads to various bugs (e.g. Trac ticket 56930) and is unintuitive for plugin developers (e.g. if you use the wp_content_img_tag filter to make a modification, you now have to consider that this filter may run twice on the same piece of content).
  • This PR addresses this problem by doing the following:
    • After fully parsing the blocks in the block template, the post content has already been fully parsed including all the typical functions.
    • To avoid double parsing, it is then extracted out of the block template temporarily and replaced with placeholders.
    • The similar functions can then process the rest of the block template, without touching the post content again.
    • Once that has happened, the post content is re-injected in the correct places, keeping the overall template content intact while fixing the double parsing.
  • Having benchmarked this with a basic post in Twenty Twenty-Two, there is no notable performance benefit to this (neither faster nor slower, since replacing the content is a similar amount of logic), but the critical part here is fixing the double execution and the bugs that are resulting from it.

Trac ticket: https://core.trac.wordpress.org/ticket/55996


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

$content = wptexturize( $content );
$content = convert_smilies( $content );
$content = shortcode_unautop( $content );
$content = wp_filter_content_tags( $content );
$content = wp_filter_content_tags( $content, 'the_block_template' );
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a bit unrelated, but since here this function is called outside of any filter, it is crucial to manually provide a $context. I went with the_block_template for now (similar to e.g. the_content), but certainly open to alternatives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this in another commit. No need to block this change as of this ticket.

@carlomanf
Copy link

carlomanf commented Nov 2, 2022

I can see three major flaws with this implementation:

  1. If the render_block filter is used to change the string that was saved to the global variable, the string search will fail and the post content will not get replaced.
  2. If the same placeholder string was already present in the raw content, such as inside a Custom HTML block, it will get replaced unintentionally.
  3. If one of the shortcode callbacks called through do_shortcode corrupts the placeholder string, it will not get replaced and the post content will remain missing from the final output.

There is an alternative implementation at WordPress/gutenberg#44995 that avoids all of the above and is much cleaner.

@felixarntz
Copy link
Member Author

@carlomanf Thanks for sharing these. Some replies:

  1. That's a great catch. We can address that though, by moving the logic of storing in the global variable to the latest possible point, where no external code can interfere anymore.
  2. I think this really is a theoretical concern. We definitely should use a placeholder string that is "unique" enough so that nobody will randomly use a similar placeholder. Of course one can still choose to use exactly the same placeholder, but then you're intentionally breaking this feature, and it goes without saying that if you have code access and want to break WordPress, you already can in millions of ways. Definitely open to choosing another "more unique" placeholder string if that's preferable.
  3. Realistically, under which circumstances could that possibly occur? The template content would need to have a shortcode, which is the first thing I question. Shortcodes are mostly a thing of the past, and while they are still used a ton in post content, I have yet to see or hear about any usage of shortcodes in a block template. More importantly though, even if a shortcode is used in a template, it would need to be a shortcode that wraps the post-content block, which I think is even far more unlikely. So this one I as well consider a theoretical concern.

To be clear, I'm not saying the last 2 things are impossible to happen, but they are extreme edge cases to the degree that makes them irrelevant (<0.1% of sites).

@felixarntz
Copy link
Member Author

@carlomanf I have pushed additional commits which address your points 1. and 2. While I don't think 2. is a significant concern, it's still trivial to address so I think it's worth doing.

The fix to point 1. led me to another neat code cleanup here, as I've now replaced the global with a simple local variable that cannot be tampered with outside of the intended usage (when parsing the core/post-content block).

@carlomanf
Copy link

carlomanf commented Nov 2, 2022

Points 2 and 3 that I raised can also be combined, so one of the shortcode callbacks could add <!-- wp-post-content-placeholder-1 --> and it would then get replaced. I wouldn't say it's necessarily an intentional attempt to break this feature, it might just be a coincidence that they choose the same string.

Why not use a string that is generated uniquely at runtime, such as with uniqid? Then, in case any corruption errors occur, they would likely be solved by refreshing the page.

Also, this solution does not stop double filtering when one post content is embedded inside another post content, which is a far more likely occurrence than the other scenarios and is handled easily by WordPress/gutenberg#44995. How feasible do you think it would be to modify this solution to handle it?

@felixarntz
Copy link
Member Author

Providing a quick update here, I've found another simpler fix for the lazy-loading problem from https://core.trac.wordpress.org/ticket/56930 specifically and opened #3560. Therefore I've removed one unit test from here which is now part of that PR. This PR here remains valid for fixing https://core.trac.wordpress.org/ticket/55996.

@felixarntz
Copy link
Member Author

@carlomanf

Points 2 and 3 that I raised can also be combined, so one of the shortcode callbacks could add <!-- wp-post-content-placeholder-1 --> and it would then get replaced. I wouldn't say it's necessarily an intentional attempt to break this feature, it might just be a coincidence that they choose the same string.

My previous comment remains, if you want to break this feature, you can. Let's be realistic here, what are the chances that there is a shortcode in an FSE template that wraps a post-content block and then adds some placeholder in there (without replacing it with anything, which would probably be a bug by itself!) that is exactly the same placeholder as WP core uses? Obviously WordPress operates at a massive scale and edge-cases occur, but what we're talking about here is extremely far fetched to the degree that I can't see it happening without intentionally wanting to break this.

Why not use a string that is generated uniquely at runtime, such as with uniqid? Then, in case any corruption errors occur, they would likely be solved by refreshing the page.

We can potentially combine this with the current approach. Using uniqid() by itself would arguably worse than the current approach, since now it attempts to find a placeholder that isn't anywhere in the content. A string created by uniqid() could also be in the template, while the current code makes a collision with the parsed post content impossible.

Also, this solution does not stop double filtering when one post content is embedded inside another post content, which is a far more likely occurrence than the other scenarios and is handled easily by WordPress/gutenberg#44995. How feasible do you think it would be to modify this solution to handle it?

What is a realistic use-case to include a post-content block within another post-content block? To my knowledge, that is not even possible since this is a template-specific block.

@carlomanf
Copy link

Realistically, under which circumstances could that possibly occur? The template content would need to have a shortcode, which is the first thing I question. Shortcodes are mostly a thing of the past, and while they are still used a ton in post content, I have yet to see or hear about any usage of shortcodes in a block template. More importantly though, even if a shortcode is used in a template, it would need to be a shortcode that wraps the post-content block, which I think is even far more unlikely. So this one I as well consider a theoretical concern.

Imagine there is a shortcode that wraps some content, performs a word count on it, and prints out the word count appended to the raw content. Currently, if a post content block is added within the shortcode, the post content is included in the word count. After this patch is applied, the shortcode will begin to receive the placeholder and the post content will no longer be included in the word count.

I don't know of a particular shortcode that does the above or something like it, but I wouldn't underestimate the use of legacy features and I wouldn't at all be surprised if the bug gets reported at some point. For example, WordPress/gutenberg#35676 is a separate but similar issue that shows how widely plugin/theme authors are using shortcodes together with blocks and discovering unexpected behaviour.

The issue with the hypothetical word count shortcode would actually be worse, because it would not be the unexpected behaviour of a new feature, but an unexpected change in behaviour after several release cycles given that shortcode support was added in 5.9.

The root cause of all of these bugs is that the block paradigm that was ostensibly intended to be a clean slate ended up receiving support for legacy features, some of which date back to the b2/cafelog days, and neither was ever designed to be compatible with the other. There is probably no way forward here that isn't going to break at least something, and quoting obviously fabricated statistics like "<0.1% of sites" is not going to be helpful, especially not after the bug reports inevitably roll in.

The main difference between WordPress/gutenberg#44995 and this patch is as follows:

  • Both patches ensure that the filters never run more than once, and therefore solve any problems introduced by the double filtering, e.g. the lazy-load images.
  • This one does so by running the filters separately on different pieces of content and re-assembling them at the end.
  • The other patch does so by running the filters just once on the whole content, which might be better from a performance perspective.
  • Both patches are likely to break existing behaviour in different ways.
  • It will be practically impossible to quantify the scale of the breakage, especially before any patch is deployed.

@felixarntz
Copy link
Member Author

@carlomanf I agree with you in that there is a risk of potential breakage in this change. I do however still think that the risk for what you're pointing out to come up as a real-world bug is extremely low. It obviously does not make sense to argue about this because we can't know until it's out there. My take here specifically is that the chance of this causing a bug is so small that we can add it to core, basically "wait and see". If my assessment here shows to be wrong, and we get bug reports of this, I'll happily revert and then we have to look for alternatives. We also have beta period for that purpose (though even if this is in a stable release, we could revert it if it became a problem).

The implementation from WordPress/gutenberg#44995 has other problems, specifically for the wp_filter_content_tags() function it introduces a backward compatibility break. The function is supposed to run specifically on the_content, as otherwise the lazy-loading logic will not apply correctly. It seems to me that the main difference between the two PRs is that this one keeps those functions running on the_content instead of the block template, while the other one runs them on the block template instead of the_content. The first is how WordPress has been behaving since forever basically, which is why I consider that a backward compatible approach.

Comment on lines +265 to +266
$pos = strpos( $content, $post_content );
if ( false !== $pos ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use str_contains here?

// Ensure a placeholder is used that is not already present in the content.
$content_placeholder = '<!-- wp-post-content-placeholder-%d -->';
$existing_placeholder_count = 0;
while ( preg_match( '/' . sprintf( $content_placeholder, '[0-9]+' ) . '/', $content ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love using a while loop and regex ( preg_match ) here. Is there no way we could do a single regex and loop through all the matching blocks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it should be possible with preg_match_all().

Comment on lines +281 to +290
$content = str_replace(
array_map(
function( $i ) use ( $content_placeholder ) {
return sprintf( $content_placeholder, $i );
},
array_keys( $parsed_post_content )
),
$parsed_post_content,
$content
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say this is specially readable. Any chance we could just use a simple for each loop here? It would mean not requiring this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea. Also, while not huge in real terms, using foreach() will be ~15-30% faster.

foreach ( $parsed_post_content as $item_number => $replacement ) {
    $content = str_replace(
        sprintf( $content_placeholder, $item_number ),
        $replacement,
        $content
    );
}

},
PHP_INT_MAX
);

$content = $wp_embed->run_shortcode( $_wp_current_template_content );
$content = $wp_embed->autoembed( $content );
$content = do_blocks( $content );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this?

remove_filter( 'the_excerpt', 'wp_filter_content_tags' );
remove_filter( 'the_content', 'wp_filter_content_tags' );
$content = do_blocks( $content );
add_filter( 'the_excerpt', 'wp_filter_content_tags' );
add_filter( 'the_content', 'wp_filter_content_tags' );

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this?

remove_filter( 'the_excerpt', 'wp_filter_content_tags' );
remove_filter( 'the_content', 'wp_filter_content_tags' );
$content = do_blocks( $content );
add_filter( 'the_excerpt', 'wp_filter_content_tags' );
add_filter( 'the_content', 'wp_filter_content_tags' );

WordPress/gutenberg#44995 does exactly what you are suggesting. @felixarntz informed me that it does not actually fix the problem, and nor does your similar implementation #3833.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He did not seem at all interested in assisting with a solution based on filter removal, even after I informed him that his implementation would break shortcodes. I just don't have the familiarity with wp_filter_content_tags to know how to implement it correctly, and all I can tell is that the $context variable needs to be set to 'the_content' for it to work as intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in my PR, even after applying this PR, filters are applied double, as image filter is applied in template_part funciton.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not talking about the same thing. It's not because of the template part, it's because there are conditions that require the the $context variable to be set to 'the_content' rather than 'the_block_template_html', and I'm not sure of the ramifications of changing the condition to allow a different value or explicitly passing 'the_content' to make the condition pass. If @felixarntz is willing to assist in this area then we might be able to use WordPress/gutenberg#44995 as the solution instead of putting up with his messy implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this?

@spacedmonkey Removing the filters around the do_blocks() call is only a partial solution. This "record before, reinstate after" approach is the only one that solves both "double execution problem" and also the "avoid all execution problem" as discussed at length in 55996.

If I need to open another ticket on Trac so that the "avoid all execution problem" is considered of equal importance to the "double execution problem" then I guess I'll do that :) In the meantime, this approach is the one that works for both problems.

Comment on lines +253 to +254
// Replace all post content within the template with a temporary placeholder so that the post content is not
// unnecessarily processed again. See https://core.trac.wordpress.org/ticket/55996.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Replace all post content within the template with a temporary placeholder so that the post content is not
// unnecessarily processed again. See https://core.trac.wordpress.org/ticket/55996.
/*
* Replace all post content within the template with a temporary placeholder so that the post content is not
* unnecessarily processed again. See https://core.trac.wordpress.org/ticket/55996.
*/

Should use the multiline comment format here.

Copy link
Contributor

@costdev costdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @felixarntz and was glad to see it resolved the issue with shortcodes reported in another ticket. Nice work!

I've left some thoughts above/below.

@@ -226,4 +264,109 @@ public function test_resolve_home_block_template_no_resolution() {

$this->assertNull( $template );
}

/**
* @ticket 55996
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have test descriptions beginning with "Tests" per convention, and covers annotations?

/**
 * Tests that X does/does not Y when Z
 *
 * @ticket 55996
 *
 * @covers
 */

Same for other tests in the PR as well.

Comment on lines +305 to +306
$this->assertStringContainsString( '<img data-attr="value" ', $html );
$this->assertStringNotContainsString( '<img data-attr="value" data-attr="value" ', $html );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test methods with multiple assertions should have a final $message parameter to explain what went wrong should an assertion fail. Same for other tests in the PR as well.

}

public static function wpTearDownAfterClass() {
wp_delete_post( self::$post_id, true );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the parent method deletes all posts. Curious, do other tests in the test suite fail without this line?

// Ensure a placeholder is used that is not already present in the content.
$content_placeholder = '<!-- wp-post-content-placeholder-%d -->';
$existing_placeholder_count = 0;
while ( preg_match( '/' . sprintf( $content_placeholder, '[0-9]+' ) . '/', $content ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it should be possible with preg_match_all().

@carlomanf
Copy link

Also, this solution does not stop double filtering when one post content is embedded inside another post content, which is a far more likely occurrence than the other scenarios and is handled easily by WordPress/gutenberg#44995. How feasible do you think it would be to modify this solution to handle it?

What is a realistic use-case to include a post-content block within another post-content block? To my knowledge, that is not even possible since this is a template-specific block.

It is possible, encouraged, and mentioned in this video at 5–6 minutes, 18 minutes and 32 minutes. It is also mentioned in this article and this article.

Handling post content within post content is essential for this issue to be solved, and currently, this patch does nothing to handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants