Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

${parameter//“”/string} does not work correctly with multibyte string #813

Open
ko1nksm opened this issue Dec 29, 2024 · 4 comments
Open
Labels
bug Something is not working

Comments

@ko1nksm
Copy link

ko1nksm commented Dec 29, 2024

$ ksh --version
  version         sh (AT&T Research) 93u+m/1.1.0-alpha+be0cfbff 2024-12-29

$ ksh -c 'v=あいう; echo "${v//""/-}"'
-あ-�-�-い-�-�-う-�-�-

$ ksh -c 'v=あいう; echo "${v//""/-}"' | od -tx1
0000000 2d e3 81 82 2d 81 2d 82 2d e3 81 84 2d 81 2d 84
0000020 2d e3 81 86 2d 81 2d 86 2d 0a
0000032
@McDutchie McDutchie added the bug Something is not working label Dec 30, 2024
@McDutchie
Copy link

McDutchie commented Dec 30, 2024

Thanks for the report!

It works correctly in 93u+ and ksh2020, so we introduced this bug. :(

Testing my stash of compiled commits shows that the bug was introduced in commit ceae1e4.

edit: Actually, it doesn't work at all in 93u+ and ksh2020, it has no effect, but at least it doesn't output invalid characters.

@McDutchie
Copy link

${v//""/-} has no effect in bash or mksh either, it only works on zsh. So I think it is okay to revert to the previous behaviour in this case. This patch does that.

diff --git a/src/cmd/ksh93/sh/macro.c b/src/cmd/ksh93/sh/macro.c
index c99ac5719..757ea2d9b 100644
--- a/src/cmd/ksh93/sh/macro.c
+++ b/src/cmd/ksh93/sh/macro.c
@@ -1916,7 +1916,7 @@ retry2:
 							flag & STR_MAXIMAL);
 					else
 						nmatch = strngrpmatch(v, vsize,
-							*pattern ? pattern : "~(E)^",
+							*pattern || (flag & STR_MAXIMAL) ? pattern : "~(E)^",
 							(ssize_t*)match,
 							elementsof(match) / 2,
 							flag | STR_INT);

@McDutchie
Copy link

However, that fix just masks a bug in the libast regex code, which can also be triggered in the AT&T versions, like so:

$ /bin/ksh -c 'echo ${.sh.version}; v=あいう; echo "${v//~(E)^/-}"'
Version AJM 93u+ 2012-08-01
-あ-?-?-い-?-?-う-?-?-
$ ksh-2020.0.1 -c 'echo ${.sh.version}; v=あいう; echo "${v//~(E)^/-}"'
Version A 2020.0.0-8-ge4fea8c5
-?-?-?-?-?-?-?-?-?-

@ko1nksm
Copy link
Author

ko1nksm commented Dec 31, 2024

I don't mind about reverting back to the original behavior. I was wondering if this might be a feature.

I was wondering about the behavior of an empty string in a pattern, so I looked up the behavior of other languages (e.g., replaceAll, replaceFirst).Some languages, such as JavaScript and Ruby, seem to insert a character between characters (but also at the end). Frankly, I find these specifications counter-intuitive. Python does not seem to accept empty characters.

Since there is no pattern to replace, it is a reasonable enough behavior that nothing changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

2 participants