Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regrepeat: Use new utf8_to_uv; not utf8_to_uvchr_buf #22827

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

This is a subtle bug fix when the input is malformed UTF-8. We say we don't support malformed, but this commit is a step towards better protecting against that eventuality.

Prior to this commit, some patterns that use regrepeat() would exhibit different matching behavior of malformed input depending on if utf8 warnings were enabled or not.

This is because utf8_to_uvchr_buf() returns NUL if utf8 warnings are on; and the REPLACEMENT CHARACTER if they are off. If the match criteria accepts one but not the other, the behavior would differ.

Now, the repetition stops immediately without it being considered a match when a malformed input character is found

  • This set of changes does not require a perldelta entry.

This is a subtle bug fix when the input is malformed UTF-8.  We say we
don't support malformed, but this commit is a step towards better
protecting against that eventuality.

Prior to this commit, some patterns that use regrepeat() would exhibit
different matching behavior of malformed input depending on if utf8
warnings were enabled or not.

This is because utf8_to_uvchr_buf() returns NUL if utf8 warnings are on;
and the REPLACEMENT CHARACTER if they are off.  If the match criteria
accepts one but not the other, the behavior would differ.

Now, the repetition stops immediately without it being considered a
match when a malformed input character is found
@khwilliamson khwilliamson marked this pull request as draft December 6, 2024 04:53
@khwilliamson khwilliamson marked this pull request as ready for review December 19, 2024 00:04
@khwilliamson
Copy link
Contributor Author

@tonycoz approved this, but I'm wondering if it would be better to treat this (and similar) issues as a run-time error, and croak should malformed UTF-8 be encountered. The commit causes the repeat to stop at the first such one, but maybe it's better to not continue at all. We pretty much assume that by the time strings get here, they have been validated. @demerphq what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants