Merge branch 'Fix utf8 corner cases' into blead

There are around 20 different functions that take a UTF-8 sequence of
bytes and try to find the ordinal code point represented by them. It was
becoming clear that the existing tests in our suite were inadequate, not
finding glaring bugs. And UTF-8 handling is important, with failures in
it having been exploited by hackers in various products over the years
for various nefarious purposes.

I set out to improve the tests, spending way too much time before
realizing that adding band aids to the current scheme was not going to
work out. So I undertook rewriting the tests. This turned out to be way
harder and time consuming than I expected. And it still isn't ready to
go into blead. But along the way, I discovered that it was finding
corner case bugs that I would never have anticipated. This series of
commits fixes those, while simplifying the code and reducing redundancy.

The new test file needs clean-up, and probably ways to make it faster,
but it is finally far enough along that I believe it has caught most of
the bugs out there. So I'm submitting these now to get into v5.42. The
deadline for the test file is later in the development process.
This commit is contained in:
Karl Williamson 2025-03-17 08:42:50 -06:00
commit a1805b9cc6
4 changed files with 579 additions and 568 deletions

View File

@ -2020,8 +2020,12 @@ foreach my $test (@tests) {
@warnings_gotten = @returned_warnings;
}
SKIP: {
skip "$0 doesn't handle _msgs functions AV returns", 1
if $utf8_func =~ /_msgs/;
do_warnings_test(@expected_warnings)
or diag "Call was: " . utf8n_display_call($eval_text);
}
undef @warnings_gotten;
# Check CHECK_ONLY results when the input is

View File

@ -3244,7 +3244,6 @@ PERL_STATIC_INLINE UV
Perl_utf8_to_uvchr_buf(pTHX_ const U8 *s, const U8 *send, STRLEN *retlen)
{
PERL_ARGS_ASSERT_UTF8_TO_UVCHR_BUF;
assert(s < send);
UV cp;

1139
utf8.c

File diff suppressed because it is too large Load Diff

3
utf8.h
View File

@ -1206,8 +1206,9 @@ point's representation.
* First one will convert the overlong to the REPLACEMENT CHARACTER; second
* will return what the overlong evaluates to */
#define UTF8_ALLOW_LONG 0x2000
#define UTF8_ALLOW_LONG_AND_ITS_VALUE 0x4000
#define UTF8_GOT_LONG UTF8_ALLOW_LONG
#define UTF8_ALLOW_LONG_AND_ITS_VALUE 0x4000
#define UTF8_GOT_LONG_WITH_VALUE UTF8_ALLOW_LONG_AND_ITS_VALUE
/* For back compat, these old names are misleading for overlongs and
* UTF_EBCDIC. */