Merge branch 'Fix utf8 corner cases' into blead

There are around 20 different functions that take a UTF-8 sequence of bytes and try to find the ordinal code point represented by them. It was becoming clear that the existing tests in our suite were inadequate, not finding glaring bugs. And UTF-8 handling is important, with failures in it having been exploited by hackers in various products over the years for various nefarious purposes. I set out to improve the tests, spending way too much time before realizing that adding band aids to the current scheme was not going to work out. So I undertook rewriting the tests. This turned out to be way harder and time consuming than I expected. And it still isn't ready to go into blead. But along the way, I discovered that it was finding corner case bugs that I would never have anticipated. This series of commits fixes those, while simplifying the code and reducing redundancy. The new test file needs clean-up, and probably ways to make it faster, but it is finally far enough along that I believe it has caught most of the bugs out there. So I'm submitting these now to get into v5.42. The deadline for the test file is later in the development process.
2026-01-26 08:38:23 +00:00 · 2025-03-17 08:42:50 -06:00 · 2025-03-17 08:42:50 -06:00 · a1805b9cc6
commit a1805b9cc6
parent 6a4f62c873 cab4c62820
4 changed files with 579 additions and 568 deletions
--- a/ext/XS-APItest/t/utf8_warn_base.pl
+++ b/ext/XS-APItest/t/utf8_warn_base.pl
@ -2020,8 +2020,12 @@ foreach my $test (@tests) {
                        @warnings_gotten = @returned_warnings;
                    }

+                  SKIP: {
+                    skip "$0 doesn't handle _msgs functions AV returns", 1
+                                                    if $utf8_func =~ /_msgs/;
                    do_warnings_test(@expected_warnings)
                      or diag "Call was: " . utf8n_display_call($eval_text);
+                    }
                    undef @warnings_gotten;

                    # Check CHECK_ONLY results when the input is
--- a/inline.h
+++ b/inline.h
@ -3244,7 +3244,6 @@ PERL_STATIC_INLINE UV
 Perl_utf8_to_uvchr_buf(pTHX_ const U8 *s, const U8 *send, STRLEN *retlen)
 {
    PERL_ARGS_ASSERT_UTF8_TO_UVCHR_BUF;
-    assert(s < send);

    UV cp;

--- a/utf8.c
+++ b/utf8.c
--- a/utf8.h
+++ b/utf8.h
@ -1206,8 +1206,9 @@ point's representation.
 * First one will convert the overlong to the REPLACEMENT CHARACTER; second
 * will return what the overlong evaluates to */
 #define UTF8_ALLOW_LONG                 0x2000
-#define UTF8_ALLOW_LONG_AND_ITS_VALUE   0x4000
 #define UTF8_GOT_LONG                   UTF8_ALLOW_LONG
+#define UTF8_ALLOW_LONG_AND_ITS_VALUE   0x4000
+#define UTF8_GOT_LONG_WITH_VALUE        UTF8_ALLOW_LONG_AND_ITS_VALUE

 /* For back compat, these old names are misleading for overlongs and
 * UTF_EBCDIC. */