81629 Commits

Author SHA1 Message Date
Leon Timmermans
ba04a9040a Stop calling Perl_sv_catpvf manually
Call sv_catpvf instead
2025-03-18 04:16:51 +01:00
Leon Timmermans
410115a66c Dont call Perl_warn manually in core
Just call warn instead, we've been able to do that for vararg functions
since d933027ef0a56c99aee8cc3c88ff4f9981ac9fc2
2025-03-18 04:16:51 +01:00
Leon Timmermans
dbd0f2f14f Avoid calling Perl_croak_nocontext from core
In core we almost always have a context, or we can easily get one.
2025-03-18 04:16:51 +01:00
Leon Timmermans
453b1c0d2b Use croak_no_modify directly
There never was any good reason to call it by its long name.
2025-03-18 04:16:51 +01:00
Leon Timmermans
06c3a62f12 Dont call Perl_croak manually in core
Just call croak instead, we've been able to do that for vararg functions
since d933027ef0a56c99aee8cc3c88ff4f9981ac9fc2
2025-03-18 04:16:51 +01:00
Karl Williamson
24ec8e7f78 pv_escape: Use utf8_to_uv, preferred to utf8_to_uvchr_buf 2025-03-17 19:22:10 -06:00
Karl Williamson
a307b4d27d isFOO_utf8_lc: Use utf8_to-uv_or_die not utf8_to_uvchr_buf 2025-03-17 19:21:07 -06:00
Karl Williamson
7097dec4d8 regmatch: Use utf8_to-uv_or_die not utf8_to_uvchr_buf 2025-03-17 19:21:07 -06:00
Karl Williamson
d22f3fb956 find_by_class: Use utf8_to-uv_or_die not utf8_to_uvchr_buf 2025-03-17 19:21:07 -06:00
Karl Williamson
4c15e2931d _generic_GET_BREAK_VAL_UTF8: Use utf8_to-uv_or_die not utf8_to_uvchr_buf 2025-03-17 19:21:07 -06:00
Karl Williamson
c469b4e35c Convert leading underscore to trailing in internal global macro
Leading underscores of global names are reserved for the C implmentation
itself. We are gradually fixing ours to conform.
2025-03-17 19:20:35 -06:00
Karl Williamson
dd0471ca54 Turn is_utf8_common() into a macro
This function is now trivial; no need to have it a function
2025-03-17 19:20:04 -06:00
Karl Williamson
3af31f1410 check_utf8_print: Use utf8_to_uv_flags.
This replaces the old-style utf8n_to_uvchr()
2025-03-17 19:19:32 -06:00
Karl Williamson
7b3314d0a2 utf8.c: Fill in commit number in comment
These comments left the commit number vacant until we actually had one
2025-03-17 11:53:53 -06:00
Karl Williamson
a1805b9cc6 Merge branch 'Fix utf8 corner cases' into blead
There are around 20 different functions that take a UTF-8 sequence of
bytes and try to find the ordinal code point represented by them. It was
becoming clear that the existing tests in our suite were inadequate, not
finding glaring bugs. And UTF-8 handling is important, with failures in
it having been exploited by hackers in various products over the years
for various nefarious purposes.

I set out to improve the tests, spending way too much time before
realizing that adding band aids to the current scheme was not going to
work out. So I undertook rewriting the tests. This turned out to be way
harder and time consuming than I expected. And it still isn't ready to
go into blead. But along the way, I discovered that it was finding
corner case bugs that I would never have anticipated. This series of
commits fixes those, while simplifying the code and reducing redundancy.

The new test file needs clean-up, and probably ways to make it faster,
but it is finally far enough along that I believe it has caught most of
the bugs out there. So I'm submitting these now to get into v5.42. The
deadline for the test file is later in the development process.
2025-03-17 08:42:50 -06:00
Karl Williamson
cab4c62820 utf8_to_uv_msgs: Assert against both returning and warning
This asserts against the flags to the call of this function being
contradictory, in that it is boths
    1) to warn and/or die if anything goes wrong; and
    2) not to warn under any circumstances but instead to return to the
       caller objects describing what it would have otherise warned.

In a non-DEBUGGING build, the warn/die flags are ignored
2025-03-17 08:40:53 -06:00
Karl Williamson
b0ac0a6283 utf8.c: White-space only
Outdent after removing enclosing braces
2025-03-17 08:40:53 -06:00
Karl Williamson
6b30aa3061 utf8_to_uv_msgs: Use already computed value
Instead of doing the subtraction again, use the variable that already
contains the desired value.
2025-03-17 08:40:53 -06:00
Karl Williamson
5104b9a382 utf8_to_uv_msgs: Add, clarify comments 2025-03-17 08:40:53 -06:00
Karl Williamson
e6951798a3 utf8_to_uv_msgs: Remove redundant conditionals
The comments added to the code in this commit explain that to get here,
something needs to be done; no need to test again.
2025-03-17 08:40:53 -06:00
Karl Williamson
061644d72c utf8.c: Remove no longer used #define 2025-03-17 08:40:53 -06:00
Karl Williamson
b48d541824 Reinstate utf8 translation testing
The previous commit fixed the remaining problems that this test finds,
and so it can be turned on again.
2025-03-17 08:40:53 -06:00
Karl Williamson
238a42b9ab utf8_to_uv_msgs: Revamp handling of above-Unicode code points
As stated in a recent commit message, this is complex and problematic.
This commit revamps it, simplifying it and fixing the known remaining
bugs.
2025-03-17 08:40:53 -06:00
Karl Williamson
22d8ec6da1 utf8_to_uv_msgs: Create another common macro
This new macro allows two more case statements in the switch to have
a common macro at their beginnings, instead of having to repeat code.
2025-03-17 08:40:53 -06:00
Karl Williamson
51fbc1cf7b utf8_to_uv_msgs: Convert switch case to use macro
By changing flags earlier in the function, we can convert this case in a
switch to use the macro introduced in the previous commit
2025-03-17 08:40:53 -06:00
Karl Williamson
2a00b11801 utf8_to_uv_msgs: Create a common macro
Previous commits have allowed the beginning of several of the case
statements in this switch() to have the same code.  This commit creates
a macro encapsulating that code and changes the cases to use it.

The macro continues the enclosing loop if no message needs to be
generated.  This allows the removal of various conditional blocks.  And
it means that these conditions don't break to the bottom of the switch()
if no message is needed.

Braces are needed in one case: so as to not run afoul of C++
initialization crossing
2025-03-17 08:40:53 -06:00
Karl Williamson
3aca733b2a perlapi: DIE_IF_MALFORMED overrides CHECK_ONLY
This documents the change in the previous commit
2025-03-17 08:40:53 -06:00
Karl Williamson
9540191231 utf8_to_uv_msgs: Revise and rename macro
This macro is used to hide the details of determining if an abnormal
condition should raise a warning or not.  But I found it more convenient
to expand the macro to return the packed warnings category(ies) if a
warning should be raised or not.  That information is known inside the
macro and was being discarded, and then having to be recalculated.  The
new name reflects its expanded purpose, PACK_WARN.  0 is returned if no
warnings need be raised; and importantly fixing a bug in the old code,
it returns < 0 if no warning should be raised directly, but that an
entry needs to be added to the AV array returned by the function (if the
parameter requesting that has been passed in)

But Encode, for which this form of the translation function was created,
and may be the only user of it, depends on not getting a zero return.
So this has an override until Encode can be fixed.

I introduced the DIE_IF_MALFORMED flag in the previous development
release, making it subservient to the CHECK_ONLY flag.  I have since
realized that the precedence should be reversed.  If a developer
inadvertently passes both flags, it is better to honor the one saying
you need to quit, than the one saying ignore any problems.
2025-03-17 08:40:52 -06:00
Karl Williamson
78cced2399 Swap comment order
It is more easily understood reversed
2025-03-17 08:40:52 -06:00
Karl Williamson
d1fba02797 utf8_to_uv_msgs: De-duplicate some more code
This moves a conditional found in all cases in a switch() to just before
the switch, so the code is not duplicated.
2025-03-17 08:40:52 -06:00
Karl Williamson
b1a21fc853 utf8_to_uv_msgs: Fix handling of too-short malformations
At this point in the code we know that the input sequence is shorter
than a full character and that it is the legal beginning of a sequence that
could evaluate to a code point that is of interest to the caller of this
function.  It turns out that in some cases any filling out of the input
to a full character must lead to a code point that the caller is
interested in.  That interest has been signalled by flags passed to this
function.

In the past, we filled out the sequence with the minimum legal
continuation byte, but that is wrong for some cases.  This commit fixes
that.

Certain start bytes require the second byte to be higher than the
minimum, or else it is an overlong.  Prior to this commit, we could
generate overlongs.  This commit avoids that pitfall.

It also moves the complex analysis away from the comments in the code,
and to this commit message, adding even more analysis.

There are four classes of code points that the caller can have signalled
to this function that it is interested in.

The noncharacter code point class always needs a full sequence to
determine, and the conditionals prevent the code this analasys is about
from being executed.

Use of Perl extended-UTF-8 is determinable from the first byte in the
input sequence, and that has already been determined.

Both of the other two sequences don't have to be fully filled out in
order to determine if a partial sequence would lead to them or not.

Consider first, the sequences that evaluate to an above-Unicode code
point, charmingly named "supers" by Perl's poetic coders.
                ASCII platforms          EBCDIC I8
     U+10FFFF: \xF4\x8F\xBF\xBF    \xF9\xA1\xBF\xBF\xBF
     0x110000: \xF4\x90\x80\x80    \xF9\xA2\xA0\xA0\xA0
                 *
(Continuation byte range):
                \x80 to \xbf           \xa0 to \xbf

On ASCII platforms, any start byte \xf3 and below can't be for a super,
and any non-overlong sequence \xf5 and above has to be for a super.  If
the start byte is \xf4, we need a second byte to resolve the ambiguity.
But it takes just the one, or possibly two bytes to make the
determination.  It's similar on EBCDIC, but with different values.

And a similar situation exists for the surrogates.  The range of
non-overlong surrogates is:
     ASCII platforms                  EBCDIC I8
     "\xed\xa0\x80"               "\xf1\xb6\xa0\xa0"
to   "\xed\xbf\xbf".              "\xf1\xb7\xbf\xbf"

In both platforms, if we have the first two bytes, we can tell if it is
a surrogate or not, as all legal continuations in the rest of the byte
positions are for surrogates.  If we have only one byte, we can't tell,
so we have to assume it isn't a surrogate.

Overlongs don't meaningfully change things.  The shortest ASCII overlong
for the first surrogate is          "\xf0\x8d\xa0\x80"
and for the highest surrogate it is "\xf0\x8d\xbf\xbf".

Note that only the first byte has been changed, into two bytes.  All but
the first byte is the same for any overlong of any code point in either
ASCII or EBCDIC.

This means the algorithm for filling things out works for these two
classes in all cases.  Note also that the upper end of the range
conveniently works out without any extra effort needed.  The highest
surrogate corresponds to the highest continuation bytes.  And the
highest super that fits in the platform will also use the highest
continuation bytes.

The start bytes that need to have the fix in this commit are the ones
that could be the start of overlongs, minus the lower ones which can
represent only code points smaller than any of the ones the caller can
flag as being "interesting" (U+D800 is that value), and minus 0xFF.
Hence 0xE0 can have overlongs, but it and its overlongs can only
represent code points lower than 0xD800.  So we don't have to worry
about it or any smaller start byte.

But the reason 0xFF doesn't have to be considered is more complex.
It isn't the second byte in a sequence beginning with FF that needs to
be higher than the minimum continuation, but one further in.  This
would make things harder except that any sequence beginning with 0xFF is
Perl-extended UTF-8, and has already been considered earlier in this
function.  This code is only executed when 'must_be_super' is false.
'must_be_super' is set true if the sequence overflows or there is no
detectable overlong.  By DeMorgan's laws, this means to get here, it
doesn't overflow, and must be overlong.  To know that it is overlong, we
must have seen enough bytes to get past the point where we need a higher
continuation byte to legally fill it out.  So we can just fill the rest
with the minimum continuation.

(Note that the same reasoning would apply to 0xFE on ASCII platforms.
That is also used only by Perl-extended UTF-8, so would have been
considered earlier, and to get here we know it has to be overlong, and
so we've already seen enough bytes to not need to handle it specially.
But it fits into the same paradigm as the lower start bytes with just
the second byte needing to be higher, and there is no extra code
required to handle it besides including a case: for it in the switch().
This works in both ASCII and EBCDIC.)
2025-03-17 08:40:52 -06:00
Karl Williamson
6507d4a56a utf8_to_uv_msgs: Reverse order of finding overflow/extended UTF-8
This begins the process of fixing the current problematic behavior of
handling UTF-8 that is for code points above the Unicode maximum.

The lowest of these are considered SUPERs, but if you go high enough, it
takes Perl's extended UTF-8 to represent them.  Higher still, and the
extended UTF-8 can represent code points that don't fit in the current
platform's word size.

A complication is overlongs, where the representation for a seemingly
large code point can reduce down to something much smaller; even 0.
Such sequences are considered invalid by fiat from Unicode due to
successful hacker attacks using them.  But Perl has traditionally
allowed XS code to allow them, with flags passed to the translation
functions.  So it is important to get it right.

A sequence that overflows by necessity is using Perl's extended UTF-8,
as that kicks in below a 32 bit word.  This commit reverses the prior
order of testing for overflow and extended UTF-8.  Steps can be saved
because we now test for Perl-extended first, which is a lot more likely
to happen than overflow.
2025-03-17 08:40:52 -06:00
Karl Williamson
3786151d3e Skip testing utf8 translating for the next few commits
The next few commits will fail these tests.  I could squash them all
together, but that would hide the step by step change progess.

This should allow future bisecting to not fail in this commit window.
2025-03-17 08:40:52 -06:00
Karl Williamson
e0627d5bc8 utf8_to_uv_msgs: De-duplicate common code
This removes the duplicate code from many of the case statements in a
switch to be common before the switch, with a single conditional
controlling them
2025-03-17 08:40:52 -06:00
Karl Williamson
c4df0807ee utf8_to_uv_msgs: Move conditional to earlier to avoid work
By checking before we go to the trouble to do something, rather than in
the middle of it, we can save some work.

The new test looks at the source UTF-8; the previous one looked at the
code point calculated from it
2025-03-17 08:40:52 -06:00
Karl Williamson
71c5788cff utf8_to_uv_msgs: Swap order of switch() cases
The overlong cases more logically belong with the other conditions that
are rejected by default.

Future commits will simplify this to look much more like those other
conditions.
2025-03-17 08:40:52 -06:00
Karl Williamson
7c94d73940 utf8_to_uv_msgs: Revise assert
More extensive testing revealed that more conditions than this assert
previously contained are legitimate.  This requireb defining the name
for a flag
2025-03-17 08:40:52 -06:00
Karl Williamson
8d31475943 utf8_to_uv_msgs: Simplfiy checking for overlong
Prior to this commit, there were two different methods for doing this
check; one if no malformations have been found so far, and the other if
some had been found.  The latter method is valid in both cases, and is
just as fast or faster than the first method.  So change to always use
it
2025-03-17 08:40:52 -06:00
Karl Williamson
fa3575aa7c utf8_to_uv_msgs: Add safety assignment
This sets the accumulated code point to UV_MAX when overflow is
detected.  Much further below the REPLACEMENT CHARACTER is returned
instead; but this makes sure that code in between doesn't get confused
by an intermediate value
2025-03-17 08:40:52 -06:00
Karl Williamson
ff7238915c utf8_to_uv_msgs: Make code less brittle
Processing the overlong malformation needed to be last because it likely
would overwrite the calculated UV.  Other cases also overwrote that.
This is unnecessarily brittle, as we can simply store the UV before
processing any cases, and then refer to that copy.
2025-03-17 08:40:52 -06:00
Karl Williamson
4c68b37735 utf8_to_uv_msgs: Extract redundant code to common
This case: has two occurrences of the same statement, within two
different conditionals.  But the case: doesn't get executed unless at
least one of those conditionals is known to be true.  Therefore the
statement is guaranteed to be executed at least once; no need to have
two copies.
2025-03-17 08:40:52 -06:00
Karl Williamson
c2d3dc14ba utf8_to_uvchr_buf: Remove assertion
This variable is no longer accessed directly by this function.  Any
assertion about it should come from the function this passes it to.
2025-03-17 08:40:52 -06:00
Karl Williamson
fda991d326 APItest: Skip some utf8n_to_utf8_msgs tests
This function returns values in an AV instead of raising warnings.  It
turns out that this test file gets some of it wrong.  And this test file
turns out to be inadequate in other ways.  I have rewritten the test
file, but there isn't time to get it in before the code-complete
deadline.  Fixes here will end up being discarded.

In order to get the code that is actually part of the perl interpreter
into this release, I've skipped the test that would fail here, and made
sure it all passes the rewritten test.
2025-03-17 08:40:52 -06:00
Lukas Mai
6a4f62c873 t/porting/diag.t: fix oversights in message extraction regex
- recognize the short form() as well as Perl_form()
- accept/ignore spaces between `Perl_croak(` and `aTHX_`

With this change, diag.t now recognizes several diagnostic messages that
went undetected previously (note the space before `aTHX_`):

- perlio.c

            Perl_croak( aTHX_
                "%s (%" UVuf ") does not match %s (%" UVuf ")",

                Perl_croak( aTHX_
                    "%s (%" UVuf ") smaller than %s (%" UVuf ")",

- regcomp_trie.c

                        Perl_croak( aTHX_ "error creating/fetching widecharmap entry for 0x%" UVXf, uvc );

            default: Perl_croak( aTHX_ "panic! In trie construction, unknown node type %u %s", (unsigned) flags, REGNODE_NAME(flags) );

                            Perl_croak( aTHX_ "panic! In trie construction, no char mapping for %" IVdf, uvc );

This PR partially overlaps with #23017. Merging either will cause
conflicts in the other that will have to be resolved manually.

(In particular, if this PR is merged first, the diag.t changes from
#23017 can be dropped, as can some of the perldiag.pod additions. But
that PR also modifies the perlio.c messages, so their old forms added
here ("%s (%d) does not match %s (%d)", "%s (%d) smaller than %s (%d)")
will have to be deleted.)
2025-03-17 08:14:12 +01:00
Lukas Mai
71d1d453e7 turn croak("%s", "foo") into croak("foo")
There is no point in using a separate format string if the whole error
message is written right next to it. Not only does this change lead to
simpler code (passing one argument instead of two), it also exposes more
error messages to t/porting/diag.t, which relies on croak's first
argument to supply the message template.

Also make an equivalent change to S_open_script, which passes a constant
string in the form of an err variable that is not used anywhere else.
(It used to be, but that code was deleted in commit 5bc7d00e3e.)
2025-03-17 08:13:51 +01:00
Lukas Mai
a453807506 fields: use block eval instead of string eval
... and delete pointless 'require 5.005' as we already have 'use 5.008'
three lines up.
2025-03-17 08:12:56 +01:00
Lukas Mai
8c06163e19 Thread::Semaphore: clean up some tests
Some of the tests (02_errs.t, 03_nothreads.t) already use Test::More
unconditionally, so remove the conditional loading code from the other
tests.
2025-03-16 16:58:34 +01:00
Lukas Mai
6d84e5da23 perldelta for visible Search::Dict changes 2025-03-16 16:57:36 +01:00
Lukas Mai
c17d196c55 Search::Dict: clean up code
- Remove 'require 5.000'. In theory, this would give a nice runtime
  error message when run under perl4; in practice, this file doesn't
  even parse as perl4 due to 'use strict', 'our', and '->' method calls.
- Use numeric comparison with $], not string comparison. (In practice,
  this would probably only start failing once we reach perl 10, but
  still.)
- Don't repeatedly check $fc_available at runtime. Just define a
  fallback fc() in terms of lc() if CORE::fc is not available.
- Add missing $key argument to sample code in SYNOPSIS. This fixes
  <https://rt.cpan.org/Ticket/Display.html?id=97189>.
2025-03-16 16:57:36 +01:00
Lukas Mai
68944d4edf Benchmark: don't import Time::HiRes::time; we don't use it
Also:

- use block eval, not string eval
- BEGIN, not sub BEGIN
2025-03-16 16:55:10 +01:00