2253 Commits

Author SHA1 Message Date
Karl Williamson
24c7fb4c21 Convert Perl utf16 to utf8 functions to macros
These functions are hereby removed in favor of calling the plain macros
that already exist
2025-12-27 21:24:47 -07:00
Karl Williamson
7e1ae0c850 Remove SBOX case statements from external visibility
I'm pretty sure there is no use case for these, and very unlikely to
have any actual uses.
2025-12-10 08:50:19 -07:00
Karl Williamson
ebbe6ac0f7 Remove a few more macros from being visible to XS code
These are a few macros dealing with inversion lists that were never
intended to be visible to general XS code, and they actually can't be in
use in cpan because the mechanisms to create inversion lists are private
to perl.
2025-12-10 08:50:19 -07:00
Karl Williamson
92dcf59a90 Gain control of macro namespace visibility
This commit adds the capability to undefine macros that are visible to
XS code but shouldn't be.  This can be used to stop macro namespace
pollution by perl.

It works by changing embed.h to have two modes, controlled by a #ifdef
that is set by perl.h.  perl.h now #includes embed.h twice.  The first
time works as it always has.  The second sets the #ifdef, and causes
embed.h to #undef the macros that shouldn't be visible.  This call is
just before perl.h returns to its includer, so that these macros have
come and gone before the file that #included perl.h is affected by them.
It comes after the inline headers get included, so they have access to
all the symbols that are defined.

The list of macros is determined by the visibility given by the apidoc
lines documenting them, plus several exception lists that allow a symbol
to be visible even though it is not documented as such.

In this commit, the main exception list contains everything that is
currently visible outside the Perl core, so this should not break any
code.  But it means that the visibility control is established for
future changes to our code base.  New macros will not be visible except
when documented as needing to be such.  We can no longer inadvertently
add new names to pollute the user's.

I expect that over time, the exception list will become smaller, as we
go through it and remove the items that really shouldn't be visible.  We
can then see via smoking if someone is actually using them, and either
decide that these should be visible, or work with the module author for
another way to accomplish their needs.  (I would hope this would lead to
proper documentation of the ones that need to be visible.)

There are currently four lists of symbols.

One list is for symbols that are used by libc functions, and that Perl
may redefine (usually so that code doesn't have to know if it is running
on a platform that is lacking the given feature.)  The algorithm added
here catches most of these and keeps them visible, but there are a few
items that currently must be manually listed.

A second list is of symbols that the re extension to Perl requires, but
no one else needs to.  This list is currently empty, as everything
initially is in the main exception list.

A third list is for items that other Perl extensions require, but no one
else needs to.  This list is currently empty, as everything initially is
in the main exception list.

The final list is for items that currently are visible to the whole
world.  It contains thousands of items.  This list should be examined
for:

    1) Names that shouldn't be so visible; and
    2) Names that need to remain visible but should be changed so they
       are less likely to clash with anything the user might come up
       with.

I have wanted this ability to happen for a long time; and now things
have come together to enable it.

This allows us to have a clear-cut boundary with CPAN.

It means you can add macros that have internal-only use without having
to worry about making them likely not to clash with user names.

It shows precisely in one place what our names are that are visible to
CPAN.
2025-12-10 08:50:19 -07:00
Karl Williamson
838b774823 Move hv_stores() declaration from embed.fnc to hv.h
This is required for the next few commits that start automatically
creating long Perl_name functions for the elements in embed.fnc that are
macros and don't already have them in the source.

Only macros can take a parameter that has to be a literal string, so
don't fit with the next few commits.  This is the only case in embed.fnc
like that, so I'm deferring dealing with it for now.
2025-12-10 08:50:19 -07:00
Karl Williamson
32aaa22eec embed.fnc: Drop Perl_ on do_aexec my_stat my_lstat
These macros are not for external use, so don't need a Perl_ prefix
2025-12-10 08:50:19 -07:00
Karl Williamson
4092daf53e Remove some special EBCDIC code
The 'variant_byte_number' function was written to find the byte number
in a word of the first byte whose meaning varies depending on if the
string it is part of is encoded in UTF-8 or not.  On ASCII machines,
that is simply when the upper bit is set.  On EBCDIC machines, there is
no similar pattern, so this function hasn't been compiled on those.

A long time ago, I realized that this function could also handle binary
data by coercing that binary data into having the form of having that
bit set or not depending on the pattern being looked for, and then
calling that function.

But I actually hadn't realized until now that it was binary data not
tied to a character set that was being worked on.  This commit rectifies
that.  A new alias is added for that function that emphasizes that it
works on binary data, the function is now compiled for EBCDIC, and the
EBCDIC-only code that avoided using it is now removed.
2025-11-01 21:02:37 -06:00
Paul "LeoNerd" Evans
f1a8d7d883 Implement named parameters in signatures (PPC0024)
This adds a major new ability to subroutine signatures, allowing callers
to pass parameters by name/value pairs rather than by position.

  sub f ($x, $y, :$alpha, :$beta = undef) { ... }

  f( 123, 456, alpha => 789 );

Originally specified in

  https://github.com/Perl/PPCs/blob/main/ppcs/ppc0024-signature-named-parameters.md

This feature is currently considered experimental.
2025-10-31 11:31:29 +00:00
Branislav Zahradník
147d5f1b9e [parser] new_block_statement - deduplicate "a block is a loop that happens once" 2025-10-22 17:23:56 +01:00
Branislav Zahradník
e6a443b294 [parser] package - deduplicate coupled call sequence
Function combines call of original `package` and `package_version` when
new namespace statement is detected.

Instead of required three statements usage now consists of single function call.
2025-10-22 17:23:56 +01:00
Karl Williamson
935cdb76e8 embed.fnc: mv definition of more_sv
This was in a #ifdef of being in sv.c, which it is, but since it is
public, it needs to be moved out of this.  This removes the need for a
copy of its prototype to be in sv_inline.h
2025-10-21 18:58:48 -06:00
Karl Williamson
2e142e0d27 regen/embed.pl: Avoid use of hard-coded list
The list consists of exactly the functions that have the O flag set in
embed.fnc.  No need to keep this data twice.  The entries are trivially
generatable from existing entries as we go along

And those generated entries have the added advantage of not using the
short name, so potentially less name space pollution
2025-10-21 18:58:48 -06:00
Karl Williamson
bd4c0d1fc2 Add S_parse_ident_no_copy()
This new function is for callers that are merely checking if the string
being parsed is a legal identifier or not, and arent interested in the
normalized version of the identifier that parse_indent() generates.

This new function allows callers to not have to think about this buffer;
it just wraps plain parse_ident() using a throw-away buffer to hold the
returned normalized text.  This avoids introducing a bunch of
conditionals inside parse_ident.
2025-10-17 12:26:04 -06:00
Karl Williamson
735e7cc211 toke.c: Change parse_ident to take any string
Prior to this commit, the string passed to this function had to be
pointing to somewhere in PL_bufptr.  But this is only because it assumed
that the initial position is less than PL_bufend.  By passing the upper
bound in, that assumption is automatically removed.
2025-10-17 12:26:00 -06:00
Karl Williamson
e4be402477 toke.c: Use flags parameter to S_parse_ident
This makes it clearer at each call point what is happening, and prepares
for future commits where more flags will be passed to this function.
2025-10-17 12:26:00 -06:00
Karl Williamson
bfbd5f7e35 toke.c: Use flags parameter for S_force_word
This makes it clear at each call point what is happening, instead of
having to jump to the S_force_word definition to know what 'false, true'
vs 'true, false' actually means.

And this prepares for future commits.
2025-10-17 12:25:59 -06:00
Karl Williamson
3450d19250 intuit_more: 'use strict' allows much better handling
Most code these days runs under 'use strict'.  That allows us to resolve
ambiguity without resorting to heuristics in far more cases than before.

This commit adds a parameter to intuit_more() that gives the context it
is being called from.  And when that call is to resolve what $foo[...]
is supposed to mean, we can look up foo to see if it is an array or a
scalar.  If the former, the "..." must be a subscript; if a scalar, it
must be a charclass.

Only if there is both a $foo and an @foo is there ambiguity.  If so, we
drop down to using the heuristics
2025-10-17 12:09:03 -06:00
Karl Williamson
aa93969e9c toke.c: Create function to see if an identifier is known
This checks first if there is a lexical variable in scope with the given
name, and if not, if there is a global
2025-10-17 12:09:03 -06:00
Karl Williamson
9fc9ec2818 Change invlist function names to be legal
This continues the process started in #23592 to change names with
leading underscores to be legal C.  See that p.r. or
4bb3572f7a1c1f3944b7f58b22b6e7a9ef5faba6 for extensive discussion.

This commit simply moves the leading underscore to be trailing
2025-10-12 16:56:21 -06:00
Karl Williamson
59bca40fd0 S_scan_ident: Convert parameter to bool
All calls to it set it to TRUE or FALSE
2025-10-07 11:48:47 -06:00
Paul "LeoNerd" Evans
215e36f380 Add cop_*_warning() API
This adds three new API functions: a pair to modify a COP by enabling or
disabling a single warning bit within it, and a query function to ask if
a given warning is already enabled.

This API is provided for CPAN modules to use to modify the set of
warnings present in a COP during compile-time. Currently modules need to
use the `new_warnings_bitfield()` function, which was recently hidden by
09a0707. That change broke the `Syntax::Keyword::Try` module, as
reported in https://github.com/Perl/perl5/issues/23609.
2025-09-23 13:43:47 +01:00
Karl Williamson
c14d142701 Make die() always expand to Perl_die_nocontext()
See 03f24b8a082948e5b437394fa33d0af08d7b80b6 for the motivation.

This commit changes plain die() to not use a thread context parameter.
It and die_nocontext() now behave identically.
2025-09-21 06:55:45 -06:00
Karl Williamson
2cb0034ef5 Unroll valid_utf8_to_uv loop
This gives a bit of performance boost in this function that can be
called during pattern matching.

Here are some cachegrind comparisons with blead:

Key:
    Ir   Instruction read
    Dr   Data read
    Dw   Data write
    COND conditional branches
    IND  indirect branches

The numbers represent relative counts per loop iteration, compared to
blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.

               GCC                     CLANG

valid_utf8_to_uv(0x007f), length is 1

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.69        Ir 100.00  99.11
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      101.20        COND 100.00 100.00
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x07ff), length is 2

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.68        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.74
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.40        COND 100.00 101.23
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xfffd), length is 3

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.83        Ir 100.00  99.04
    Dr 100.00      101.47        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      102.99        COND 100.00 101.84
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0xffffd), length is 4

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      100.91        Ir 100.00  99.13
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      103.59        COND 100.00 102.45
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x3ffffff), length is 5

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00      101.28        Ir 100.00  99.29
    Dr 100.00      101.46        Dr 100.00  99.75
    Dw 100.00      100.00        Dw 100.00  99.57
  COND 100.00      104.19        COND 100.00 103.07
   IND 100.00      100.00        IND 100.00  94.12

valid_utf8_to_uv(0x7fffffff), length is 6

        blead      hacked        blead      hacked
       ------ -----------        ------     ------
    Ir 100.00       89.83        Ir 100.00  88.83
    Dr 100.00       95.22        Dr 100.00  92.94
    Dw 100.00       92.44        Dw 100.00  91.63
  COND 100.00       86.21        COND 100.00  87.11
   IND 100.00      100.00        IND 100.00  88.89

Clang gives slightly worse results than gcc.  But there is an
improvement in both cases for conditionals for two-byte and longer
characters..

This shows that the performance is significantly worse for code points
that take 6 bytes (or more, which I didn't include) to represent.  These
are all well outside the Unicode range; hence are very rarely
encountered.  Performance is improved a bit for the typical cases.

The algorithm used could handle 6 and 7 byte characters, but that
increases memory usage, and can lead to the compiler choosing to not
inline this function.  In blead, experiments with clang gave these
results
    Max bytes inlined   Instances in the code where not inlined
        3                 14
        4                 19
        5                 19
        6                 19
        7                 57

We really need to accomodate any Unicode code point, which is 4 bytes (5
on EBCDIC).  But the others we don't care about.  Even though 6 bytes
doesn't show as being worse than 4, I chose to not include it, because
we don't care about performance for these rare non-Unicode code points,
and it just might cause non-inlining for different compilers or clang
versions.
2025-09-20 10:21:33 -06:00
Karl Williamson
03f24b8a08 Make croak() always expand to Perl_croak_nocontext()
Perl almost always opts for saving time over saving space.  Hence, we
have croak() that saves time at the expense of space, but needs thread
context available; and croak_no_context() that doesn't need that, but
takes extra time

But, when we are about to die, time isn't that important.  Even if we
are doing eval after eval in a tight loop, the potential time savings of
passing the thread context to Perl_croak is insignificant compared to
the tear-down that follows.  My claim then is that croak() never needed
a thread context parameter to save a bit of time just before death.  It
is an optimization that isn't worth it.  And having it do so required
the invention of croak_nocontext(), and the extra cognitive load
associated with two methods for the same task.

This commit changes plain croak() to not use a thread context parameter.
It and croak_nocontext() now behave identically.  That means that going
forward, people will likely choose croak() which requires less typing
and occupies fewer columns on the screen, and they won't have to
remember which form to use when.
2025-09-12 14:47:53 -06:00
Karl Williamson
8444d54d4b Move prototype definition of SvPV_helper to embed.fnc
It's usually a bad idea to try to work around a limitation in common
code by copy-pasting and then modifiying to taste.  Fixes/improvements
to the common code rarely get propagated to the outlier.

I wrote code in 1ef9039bccb that did just this for the prototype
definition of SvPV_helper, because the place where it really belongs,
embed.fnc, couldn't (and still doesn't) handle function pointers as
arguments (patches welcome).

I should have at least added a comment to the common code noting the
existence of this outlier.

It turns out that that limitation can be worked around by declaring a
typedef of the pointer, and then using that in embed.fnc.

That's what this commit does.

This commit removes the final instance of duplicating the work of
embed.fnc in the core, except for some in the regex system whose
comments say the reason is to avoid making a typedef public.  I haven't
investigated these further.
2025-09-01 10:50:08 -06:00
Karl Williamson
d8012228a9 Convert _is_utf8_FOO to legal name 2025-09-01 08:12:24 -06:00
Karl Williamson
8de60a95d1 Convert _is_uni_FOO to legal name 2025-09-01 08:12:23 -06:00
Karl Williamson
8b91a7e5f4 Convert _is_utf8_perl_idcont to legal name 2025-09-01 08:12:23 -06:00
Karl Williamson
ffc38ee761 Convert _is_uni_perl_idcont to legal name 2025-09-01 08:12:22 -06:00
Karl Williamson
9f11f6a038 Convert _is_utf8_perl_idstart to legal name 2025-09-01 08:12:21 -06:00
Karl Williamson
eb3ee9300b Convert _is_uni_perl_idstart to legal name 2025-09-01 08:12:21 -06:00
Karl Williamson
81e1cbe370 Convert _to_utf8_case to legal name 2025-09-01 08:12:20 -06:00
Karl Williamson
8efe6a1425 Convert _to_utf8_upper_flags to legal name 2025-09-01 08:12:20 -06:00
Karl Williamson
a2f5678d13 Convert _to_utf8_title_flags to legal name 2025-09-01 08:12:19 -06:00
Karl Williamson
f79fa08ae1 Convert _to_upper_title_latin1 to legal name 2025-09-01 08:12:18 -06:00
Karl Williamson
f5f6a1be9e Convert _to_utf8_lower_flags to legal name 2025-09-01 08:12:18 -06:00
Karl Williamson
309431c01c onvert _to_utf8_fold_flags to legal name 2025-09-01 08:12:17 -06:00
Karl Williamson
d6909d9413 Convert _tofold_latin1 to legal name 2025-09-01 08:12:17 -06:00
Karl Williamson
5bda2037de Convert _inverse_folds to legal name 2025-09-01 08:12:15 -06:00
Karl Williamson
8bdb0ad55c Convert _to_uni_fold_flags to legal name 2025-09-01 08:12:15 -06:00
Karl Williamson
8decb8ab1a Convert _byte_dump_string() to legal name 2025-09-01 08:12:10 -06:00
Karl Williamson
6a9f2d68fa Expose some short form macros unconditionally
Until C99 we couldn't use the type of macro we have that hides the need
for thread context to call a function that needed both a thread context
parameter and a format with varying numbers of parameters.  Therefore
you had to call the function directly with aTHX_.  For some such
functions, there were parallel functions created that omitted the thread
context parameter (re-deriving it themselves).  And there were
compatibility macros created that called these.  So, for example warn()
would call Perl_warn_nocontext().

That changed in C99, and the calls in core to such functions were
changed to use the macro that now expanded to Perl_warn().

Not all functions with this problem had '_nocontext()' versions.  It
turns out that the way the macros were #defined in embed.h, a definition
existed for core, and non-threaded builds, but not threaded ones.  This
meant that, likely unknown to you, if you wrote an XS module, and used
one of those macros, such as ck_warner(), it would compile and run on
a non-threaded system, but would not compile on a threaded build.

Commits 13e5ba49b2cfe0add44db552ecbebb2f785aecbc and
d933027ef0a56c99aee8cc3c88ff4f9981ac9fc2 did not affect the
'_nocontext()' versions.  This commit exposes their macros to the
public.  There is no need to worry about breaking existing code, as
these macros existed only on non-threaded builds, and they still work
there.  They now work on threaded builds as well, as long as you have an
aTHX variable available.  This is no different than any newly created
macro for which we are also requiring aTHX availability.
2025-08-27 07:30:04 -06:00
Richard Leach
79b32d926e sv.c: Add Perl_newSVsv_flags_NN and static helpers
Perl_newSVsv_flags_NN creates a fresh SV that contains the values of its
source SV argument. It's like calling `new_SV(dsv)` followed by
`sv_setsv_flags(dsv, ssv, flags`, but is optimized for a brand new
destination SV and the most common code paths.

The intended initial users for this new function were:
* Perl_sv_mortalcopy_flags (still in sv.c)
* Perl_newSVsv_flags (now a simple function in sv_inline.h)

Perl_newSVsv_flags_NN prioritises the following hot cases:
* SVt_IV containing an IV
* SVt_IV containing an RV
* SVt_NV containing an NV
* SVt_PV containing a PV

It will then check for:
* SVt_NULL
* SVt_IV containing a UV
* SVt_LAST

The helper function S_newSVsv_flags_NN_PVxx is called for everything else.
It will use Perl_sv_setsv_flags as a fallback for rare or tricky cases.

S_newSVsv_flags_NN_POK is a dedicated helper for string swipe/COW/copy
logic and is called from both Perl_newSVsv_flags_NN and
S_newSVsv_flags_NN_PVxx.

With these changes compared with the previous commit:

* `perl -e 'for (1..100_000_0) { my $x = { (1) x 1000 }; }'` runs about 20% faster

* `perl -e 'for (1..100_000_0) { my $x = { ("Perl") x 250 }' runs about 40% faster

* `perl -e 'for (1..100_000_0) { my $x = { a => 1, b => 2, c => 3, d => 4, e => 5 }; }'`
   is a touch faster, but within the margin for error

* `perl -e 'for (1..100_000_0) { my $x = { a => "Perl", b => "Perl", c => "Perl", d => "Perl", e => "Perl" } ; }'`
   runs about 17% faster
2025-08-23 17:44:29 +01:00
Karl Williamson
e9d09605b8 Add detail to -Dy debugging
Commit 6ceb4087860c6ef8e86e0c252feb738d635e9e3f added a way to cleanly
output UTF-8 tr/// values.  This commit uses that to improve the debug
output of compiling and running tr///.

For a simple tr of of transliterating Greek capital letters to
lowercase, the output of 'perl -Dy' has these added lines:

 > op.c: 6553: Compiling tr/*t/*r/; /c=0; /d=0; /s=0
 > *t is '\x{391}-\x{3a9}'
 > *r is '\x{3b1}-\x{3c9}'

Before the aforementioned commit the minus sign indicating a range would
not have rendered properly; so things like that were omitted from the
debug output.

The output also now includes special mention of the special casing where
the input is complemented, and/or some characters not being translated
or get deleted.
2025-08-23 07:54:00 -06:00
Karl Williamson
ef2c06ab92 Create embed.fnc entry for pv_display_flags_
This creates an ARGS_ASSERT for this function.  Previously, the code was
using the one for plain pv_display(), which is kind of ugly.  Now there
is a macro for each function
2025-08-23 07:54:00 -06:00
Karl Williamson
8543a7ac33 Add valid_utf8_to_uv()
This is identical to valid_utf8_to_uvchr(). They are both internal
functions designed for when you are certain that the utf8 string to be
translated is well formed; generally you created it yourself earlier.

The only reason for this new synonym is to lessen the cognitive load on
programmers who should be using the "_uv" suffix functions, and not the
"_uvchr" suffix ones for these sorts of tasks. By having this synonym,
one doesn't have to learn that there are two.
2025-08-21 13:52:26 -06:00
Karl Williamson
738383d65e Revert wrongly named "Hide function prototyes from ... "
This reverts commit ba4fa056e4e86ad40aee006b0ddd37951f723787 due to a
completely wrong commit title and message.  The next commit will reapply
it with the correct information.
2025-08-21 13:52:26 -06:00
Karl Williamson
ba4fa056e4 Hide function prototyes from unauthorized callers
0351a629e71de127cbfd1b142e9eaa6069deabf5 extended hiding private
functions from callers into the gcc world.

Some functions are allowed only in extensions; so can not be marked as
hidden; this commit discourages their use however, by hiding their
prototypes to all but the core and extensions.

It turns out that four functions were being used in modules we ship
with that were marked as extensions-only; so they had to be made
globally accessible.
2025-08-21 13:31:23 -06:00
Paul "LeoNerd" Evans
4b060dfa97 Add a subsignature_append_fence_op()
A "fence op" is a miscellaneous op fragment that performs some work for
side-effects during processing of a subroutine signature. In terms of
timing, it will run at some time after any previously-defined arguments
have been assigned from argument values passed in by the caller, but
before any defaulting expressions for parameters that come after it are
run.

We specifically make no guarantees about whether parameters defined
after this op have had their values assigned, nor whether defaulting
expressions of earlier parameters have already been invoked. This is
intentional because upcoming changes will change the order of these.

The intention here is that method subroutines will use a fence op for
the `OP_METHSTART` behaviour, ensuring that subsequent defaulting
expressions can see the values of field bindings established by
processing the `$self` parameter.
2025-08-14 17:06:03 +01:00
Karl Williamson
211c07b6fd Convert Perl_uvoffuni_to_utf8_flags to a macro
The function is hereby removed in favor of calling the plain
uvoffuni_to_utf8_flags macro that already exists
2025-08-07 08:05:34 -06:00