Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OP_SUBSTR_LEFT - a specialised OP_SUBSTR variant #22785

Open
wants to merge 4 commits into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

This commit adds OP_SUBSTR_NIBBLE and associated machinery for fast handling of the constructions:

    substr EXPR,0,LENGTH,''

and

    substr EXPR,0,LENGTH

Where EXPR is a scalar lexical, the OFFSET is zero, and either there is no REPLACEMENT or it is the empty string. LENGTH can be anything that OP_SUBSTR supports. These constraints allow for a very stripped back and optimised version of pp_substr.

The primary motivation was for situations where a scalar, containing some network packets or other binary data structure, is being parsed piecemeal. Nibbling away at the scalar can be useful when you don't know how exactly it will be parsed and unpacked until you get started. It also means that you don't need to worry about correctly updating a separate offset variable.

This operator also turns out to be an efficient way to (destructively) break an expression up into fixed size chunks. For example, given:

my $x = ''; my $str = "A"x100_000_000;

This code:

$x = substr($str, 0, 5, "") while ($str);

is twice as fast as doing:

for ($pos = 0; $pos < length($str); $pos += 5) {
    $x = substr($str, $pos, 5);
}

Compared with blead, $y = substr($x, 0, 5) runs 40% faster and $y = substr($x, 0, 5, '') runs 45% faster.


  • This set of changes requires a perldelta entry, and I will add one shortly.

@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from d6f958e to f258d43 Compare November 26, 2024 14:23
Copy link
Contributor

@leonerd leonerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small comments but overall nothing troubling-looking here.

I wonder a bit about the name though. I've usually seen the word "nibble" to mean a half-byte; i.e. a 4-bit value. I wondered if that is what is going on here at first. If there are other candidate names to call it, perhaps something else would be better? Not a huge problem though.

peep.c Outdated Show resolved Hide resolved
peep.c Outdated Show resolved Hide resolved
pp.c Outdated Show resolved Hide resolved
@richardleach
Copy link
Contributor Author

How about substr_chop, since a big part of its design is being a fast route for getting to Perl_sv_chop?

Food related alternatives: substr_peck? substr_graze? substr_smidgen? substr_julienne? substr_shuck? substr_heel, like the end of a loaf? Non-foodie: substr_shave?

@richardleach
Copy link
Contributor Author

Not sure what's going on with the ABRT test failures. Don't get them locally. ./perl -Ilib t/perf/benchmarks.t seemingly only uses 8MB of memory, so running out of memory doesn't seem to be the cause.

@richardleach
Copy link
Contributor Author

Not sure what's going on with the ABRT test failures.

Looks like an op_private flags assertion. I'll dig into it soon.

@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from f258d43 to 98d187d Compare November 28, 2024 07:14
@richardleach richardleach changed the title OP_SUBSTR_NIBBLE - a specialised OP_SUBSTR variant OP_SUBSTR_CHOP - a specialised OP_SUBSTR variant Nov 28, 2024
@richardleach
Copy link
Contributor Author

I'm rebasing and renaming it to SUBSTR_CHOP.

@leonerd
Copy link
Contributor

leonerd commented Nov 28, 2024

Doesn't perl's chop() function eat from the other end though?

@richardleach
Copy link
Contributor Author

The Perl chop takes from the end, but Perl_sv_chop takes from the front. (I don't know who we have to thank for that amazing piece of naming.) The pp_ function for this op calls Perl_sv_chop.

@jkeenan
Copy link
Contributor

jkeenan commented Nov 29, 2024

@richardleach , merge conflicts ^^

@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from 98d187d to 24108a7 Compare November 29, 2024 00:50
@leonerd
Copy link
Contributor

leonerd commented Nov 29, 2024

The Perl chop takes from the end, but Perl_sv_chop takes from the front. (I don't know who we have to thank for that amazing piece of naming.) The pp_ function for this op calls Perl_sv_chop.

Oh wow. Huh. In that case, might as well call this one SUBSTR_CHOP indeed then.

Otherwise my thoughts were going to be something like SUBSTR_PREFIX but that isn't much more descriptive.

@Grinnz
Copy link
Contributor

Grinnz commented Nov 30, 2024

Consider ltrim, with inspiration from PHP and Redis (or lstrip a la Ruby/Python but that sounds more whitespace-specific). Though it is also unrelated to builtin::trim, I think it's a bit more descriptive at least

@richardleach
Copy link
Contributor Author

Consider ltrim, with inspiration from PHP and Redis (or lstrip a la Ruby/Python but that sounds more whitespace-specific). Though it is also unrelated to builtin::trim, I think it's a bit more descriptive at least

Hmmm, I'm not sure about this. It seems only more descriptive to someone who already is familiar with ltrim, otherwise it's likely to lead to confusion with builtin:trim or even reducing the other end of the string. There might be some confusion around _CHOP, but at least the connection to sv_chop is there.

@tonycoz
Copy link
Contributor

tonycoz commented Dec 2, 2024

Maybe OP_SUBSTR_LEFT borrowing from perl's BASIC antecedents.

peep.c Outdated Show resolved Hide resolved
@richardleach
Copy link
Contributor Author

Maybe OP_SUBSTR_LEFT borrowing from perl's BASIC antecedents.

Ok, that seems straightforward enough without colliding with Perlspace. Will rename.

@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from 24108a7 to 2712d8f Compare December 5, 2024 22:23
Variants are named to match the style of macros in op.h
@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from 2712d8f to 3c55fa6 Compare December 5, 2024 22:36
@richardleach richardleach changed the title OP_SUBSTR_CHOP - a specialised OP_SUBSTR variant OP_SUBSTR_LEFT - a specialised OP_SUBSTR variant Dec 12, 2024
@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from 3c55fa6 to c468cc5 Compare December 12, 2024 22:12
@richardleach
Copy link
Contributor Author

OP renamed to OP_SUBSTR_LEFT and, I think, review comments all addressed.

peep.c Outdated Show resolved Hide resolved
pp.c Outdated Show resolved Hide resolved
This commit adds OP_SUBSTR_LEFT and associated machinery for fast
handling of the constructions:

        substr EXPR,0,LENGTH,''
and
        substr EXPR,0,LENGTH

Where EXPR is a scalar lexical, the OFFSET is zero, and either there
is no REPLACEMENT or it is the empty string. LENGTH can be anything
that OP_SUBSTR supports. These constraints allow for a very stripped
back and optimised version of pp_substr.

The primary motivation was for situations where a scalar, containing
some network packets or other binary data structure, is being parsed
piecemeal. Nibbling away at the scalar can be useful when you don't
know how exactly it will be parsed and unpacked until you get started.
It also means that you don't need to worry about correctly updating
a separate offset variable.

This operator also turns out to be an efficient way to (destructively)
break an expression up into fixed size chunks. For example, given:

    my $x = ''; my $str = "A"x100_000_000;

This code:

    $x = substr($str, 0, 5, "") while ($str);

is twice as fast as doing:

    for ($pos = 0; $pos < length($str); $pos += 5) {
        $x = substr($str, $pos, 5);
    }

Compared with blead, `$y = substr($x, 0, 5)` runs 40% faster and
`$y = substr($x, 0, 5, '')` runs 45% faster.
@richardleach richardleach force-pushed the hydahy/op_substr_nibble2 branch from c468cc5 to 0e93328 Compare December 16, 2024 23:13
Comment on lines +3738 to +3749

if (SvROK(sv) && do_chop) {
Perl_ck_warner(aTHX_ packWARN(WARN_SUBSTR),
"Attempt to use reference as lvalue in substr"
);
}

if (do_chop) {
SvGETMAGIC(sv);
tmps = SvPV_force_nomg(sv, curlen);
} else
tmps = SvPV_const(sv, curlen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're checking SvROK(sv) before calling SvGETMAGIC() which would be incorrect if sv is tied - I think I would have caught that before, but maybe I missed it.

You could move the reference check after the SvGETMAGIC() inside the do_chop conditional to fix it (and better match pp_substr).

@bulk88
Copy link
Contributor

bulk88 commented Dec 19, 2024

BINOPs like PP

if(index($str, 'ZZZZZZ) == -1) { }
XS have no concpext of "G_BOOL" content. There is definently a need to deliver bool contet, from runloop to the XS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants