428 Commits

Author SHA1 Message Date
Karl Williamson
4f5164ad13 Convert all core uses of _ASSERT__() to assert()
The former symbol is undefined behavior in C and C++.
2025-09-03 19:46:22 -06:00
Karl Williamson
e830b1872d regcomp.h: Convert _reg_ac_data to legal name 2025-09-01 08:12:09 -06:00
Karl Williamson
1176a99221 regcomp.h: Convert _reg_trie_data to legal name 2025-09-01 08:12:08 -06:00
Karl Williamson
03875508c8 regcomp.h: Convert _reg_trie_state to legal name 2025-09-01 08:12:07 -06:00
Karl Williamson
162b970532 regcomp.h: Convert _reg_trie_trans_list_elem to legal name 2025-09-01 08:12:07 -06:00
Karl Williamson
9e08ae35a0 regcomp.h: Convert _reg_trie_trans to legal name 2025-09-01 08:12:06 -06:00
Tony Cook
3b03ffb49b allow perl to build with the re extension is static
Previously configuring with -Uusedl built successfully, but didn't
with -Dstatic_ext=re, now both build successfully.

Fixes #21550
2024-04-17 09:42:30 +10:00
Yves Orton
ba6e2c38aa regcomp*.c, regexec.c - fixup regex engine build under -Uusedl
The regex engine is built a bit different from most of the perl
codebase. It is compiled as part of the main libperl.so and it is
also compiled (with DEBUGGING enabled) as part of the re extension.
When perl itself is compiled with DEBUGGING enabled then the code
in the re.so extension and the code in libperl.so is the same.

This all works fine and dandy until you have a static build where the
re.so is linked into libperl.so, which results in duplicate symbols
being defined. These symbols come in two flaviours: "auxiliary" and
"debugging" related symbols.

We have basically three cases:

1. USE_DYNAMIC_LOADING is defined. In this case we are doing a dynamic
   build and re.so will be separate from libperl.so, so it even if this
   is a DEBUGGING enabled build debug and auxiliary functions can be
   compiled into *both* re.so and libperl.so. This is basically the
   "standard build".

2. USE_DYNAMIC_LOADING is not defined, and DEBUGGING is not defined
   either. In this case auxiliary functions should only be compiled in
   libperl.so, and the debug functions should only be compiled into
   re.so

3. USE_DYNAMIC_LOADING is not defined, and DEBUGGING *is* defined. In
   this case auxiliary functions AND debug functions should only be
   compiled into libperl.so

It is possible to detect the different build options by looking at the
defines 'USE_DYNAMIC_LOADING', 'PERL_EXT_RE_DEBUG' and
'DEBUGGING_RE_ONLY'. 'USE_DYNAMIC_LOADING' is NOT defined when we are
building a static perl. 'PERL_EXT_RE_DEBUG' is defined only when we are
building re.so, and 'DEBUGGING_RE_ONLY' is defined only when we are
building re.so in a perl that is not itself already a DEBUGGING enabled
perl. The file ext/re/re_top.h responsible for setting up
DEBUGGING_RE_ONLY.

This patch uses 'PERL_EXT_RE_DEBUG', 'DEBUGGING_RE_ONLY' and
'USE_DYNAMIC_LOADING' to define in regcomp.h two further define flags
'PERL_RE_BUILD_DEBUG' and 'PERL_RE_BUILD_AUX'.

The 'PERL_RE_BUILD_DEBUG' flag determines if the debugging functions
should be compiled into libperl.so or re.so or both. The
'PERL_RE_BUILD_AUX' flag determines if the auxiliary functions should be
compiled into just libperl.so or into it and re.so. We then use these
flags to guard the different types of functions so that we can build in
all three modes without duplicate symbols.
2023-08-03 15:25:02 +02:00
Elvin Aslanov
493e62880e Remove duplicate "the" in comments
Fix spelling on various files pertaining to core Perl.
2023-05-03 11:29:53 -06:00
Yves Orton
53175c6044 replace "define\t" with "define " in most "normal" core files.
The main exceptions being dist/, ext/, and Configure related
files, which will be updated in a subsequent commit. Files in the cpan/
directory are also omitted as they are not owned by the core.

'#define' has seven characters, so following it with a \t makes it look
like '#define ' when it is not, which then frustrates attempts to find
where a given define is. If you *know* then you do a

    git grep -P 'define\s+WHATEVER'

but if don't or you forget, you can get very confused trying to find
where a given define is located. This fixes all such cases so they
actually are 'define WHATEVER' instead.

If this patch is getting in your way with blame analysis then view it
with the -w option to blame.
2023-04-29 09:09:53 +02:00
Yves Orton
44eb4cdc27 regcomp.h - use a common union for head and args across all regnodes.
This helps with HPUX builds where we need to ensure everything
is aligned the same (on 32 bit boundaries). It also strongly
encourages everything to use the accessor macros and not access
the members directly.

By using a union for the variadic fields we make it more obvious
that some regops use the field in different ways. This patch
also converts all the arg unions into a standardized union with
standardized member names.
2023-03-29 20:54:49 +08:00
Yves Orton
b292ecb4e4 regcomp.h - use different struct member names for U8 vs U32 str_len
It is confusing to have two different members, at different struct
offsets and with different sizes, with the same name. So rename them
so they have different names that include their size so it is obvious
what is going on.
2023-03-29 20:54:49 +08:00
Yves Orton
b16c8aa582 regcomp.h - document RE_PESSIMISTIC_PARENS and VOLATILE_REF defines
These two defines are related to each other, and even though
VOLATILE_REF is not explicitly used in regexec.c which would require
it being placed in regcomp.h, it is implicitly, and RE_PESSIMISTIC_PARENS
*is* used in regexec.c. So put them both in regcomp.h and document them
together. This adds copious documentation for what they both are for.

RE_PESSIMISTIC_PARENS is effectively a "build option" (although intended
for debugging regex engine bugs only). VOLATILE_REF is the name of a
flag which is used to mark REF nodes as requiring special backtracking
support in regexec.c
2023-03-19 05:27:01 +08:00
Lukas Mai
e34ab2f783 regcomp.h: give names to anonymous union members
Anonymous unions/structs are a C11 feature (previously a GNU extension)
and not available in C90 or C99.

Fixes #20932.
2023-03-14 20:05:29 +08:00
Lukas Mai
b58c18a31b regcomp.h: fix names of regnode_charclass union members 2023-03-14 20:05:29 +08:00
Yves Orton
17e3e02ad1 regex engine - simplify regnode structures and make them consistent
This eliminates the regnode_2L data structure, and merges it with the older
regnode_2 data structure. At the same time it makes each "arg" property of the
various regnode types that have one be consistently structured as an anonymous
union like this:

    union {
        U32 arg1u;
        I32 arg2i;
        struct {
            U16 arg1a;
            U16 arg1b;
        };
    };

We then expose four macros for accessing each slot: ARG1u() ARG1i() and
ARG1a() and ARG1b(). Code then explicitly designates which they want. The old
logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1,
which was confusing to say the least. The regnode_2L structure had a U32 arg1,
and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new
set of macros we use the regnode_2 for both, and use the appropriate macros to
show whether we want to signed or unsigned values.

This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit
args". However as each slot can also store two U16s, a regnode_3 can hold up
to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes
use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count
and ARG3a() and ARG3b() for parens before and inside the quantifier.

It also changes the functions reganode() to reg1node() and changes reg2Lanode()
to reg2node(). The 2L thing was just confusing.
2023-03-13 21:26:08 +08:00
Yves Orton
59db194299 regexec.c - make REF into a backtracking state
This way we can do the required paren restoration only when it is in use. When
we match a REF type node which is potentially a reference to an unclosed paren
we push the match context information, currently for "everything", but in a
future patch we can teach it to be more efficient by adding a new parameter to
the REF regop to track which parens it should save.

This converts the backtracking changes from the previous commit, so that it is
run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which
is by default 0. We don't make the new fields in the struct conditional as the
stack frames are large and our changes don't make any real difference and it
keeps things simpler to not have conditional members, especially since some of
the structures have to line up with each other.

If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means
something is sensitive to us not necessarily restoring the parens properly on
failure. We make some assumptions that the paren state after a failing state
will be corrected by a future successful state, or that the state of the
parens is irrelevant as we will fail anyway. This can be made not true by
EVAL, backrefs, and potentially some other scenarios. Thus I have left this
inefficient logic in place but guarded by the flag.
2023-03-13 21:26:08 +08:00
Yves Orton
acababb42b regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers
In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.

We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.

It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)

This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.
2023-03-13 21:26:08 +08:00
Yves Orton
05b13cf680 regcomp.c - track parens related to CURLYX and CURLYM
This was originally a patch which made somewhat drastic changes to how
we represent capture buffers, which Dave M and I and are still
discussing offline and which has a larger impact than is acceptable to
address at the current time. As such I have reverted the controversial
parts of this patch for now, while keeping most of it intact even if in
some cases the changes are unused except for debugging purposes.

This patch still contains valuable changes, for instance teaching CURLYX
and CURLYM about how many parens there are before the curly[1] (which
will be useful in follow up patches even if stricly speaking they are
not directly used yet), tests and other cleanups. Also this patch is
sufficiently large that reverting it out would have a large effect on
the patches that were made on top of it.

Thus keeping most of this patch while eliminating the controversial
parts of it for now seemed the best approach, especially as some of the
changes it introduces and the follow up patches based on it are very
useful in cleaning up the structures we use to represent regops.

[1] Curly is the regexp internals term for quantifiers, named after
x{min,max} "curly brace" quantifiers.
2023-03-13 21:26:08 +08:00
Yves Orton
c224bbd5d1 regcomp.c - add optimistic eval (*{ ... }) and (**{ ... })
This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and
(??{ ... }). The only difference being that the star variants are
"optimisitic" and are defined to never disable optimisations. This is
especially relevant now that use of (?{ ... }) prevents important
optimisations anywhere in the pattern, instead of the older and inconsistent
rules where it only affected the parts that contained the EVAL.

It is also very useful for injecting debugging style expressions to the
pattern to understand what the regex engine is actually doing. The older
style (?{ ... }) variants would change the regex engines behavior, meaning
this was not as effective a tool as it could have been.

Similarly it is now possible to test that a given regex optimisation
works correctly using (*{ ... }), which was not possible with (?{ ... }).
2023-01-19 18:44:49 +08:00
Yves Orton
0678333e68 regcomp.c - increase size of CURLY nodes so the min/max is a I32
This allows us to resolve a test inconsistency between CURLYX and CURLY
and CURLYM, which have different maximums. We use I32 and not U32 because
the existing count logic uses -1 internally and using an I32 for the min/max
prevents warnings about comparing signed and unsigned values when the
count is compared against the min or max.
2023-01-15 17:21:12 +01:00
Yves Orton
b1ad323637 regcomp.h - get rid of EXTRA_STEP defines
They are unused these days.
2023-01-15 13:46:02 +01:00
James E Keenan
0c6362adf0 Correct typos as per GH 20435
In GH 20435 many typos in our C code were corrected.  However, this pull
request was not applied to blead and developed merge conflicts.  I
extracted diffs for the individual modified files and applied them with
'git apply', excepting four files where patch conflicts were reported.
Those files were:

        handy.h
        locale.c
        regcomp.c
        toke.c

We can handle these in a subsequent commit. Also, had to run these two
programs to keep 'make test_porting' happy:

        $ ./perl -Ilib regen/uconfig_h.pl
        $ ./perl -Ilib regen/regcomp.pl regnodes.h
2022-12-29 09:39:58 -05:00
Yves Orton
85900e28cc regcomp.c - decompose into smaller files
This splits a bunch of the subcomponents of the regex engine into
smaller files.

       regcomp_debug.c
       regcomp_internal.h
       regcomp_invlist.c
       regcomp_study.c
       regcomp_trie.c

The only real change besides to the build machine to achieve the split
is to also adds some new defines which can be used in embed.fnc to control
exports without having to enumerate /every/ regex engine file. For
instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used
in embed.fnc to manage exports.
2022-12-09 16:19:29 +01:00
Yves Orton
6a6e5d037d regex engine - cleanup internal tabs and ws (use -w to ignore)
Having internal tabs causes confusion in diffs and reviews. In the
following patch I will move a lot of code around, creating new files
and they will all be whitespace clean: no trailing whitespace,
tabs expanded to the next tabstop properly, and no trailing empty
lines at the bottom of the file.

This patch prepares for that split, and future splits and changes to
the regex engine by precleaning the main regex engine files with the
same rules.

It should show no changes under '-w'.
2022-12-09 16:19:29 +01:00
Yves Orton
d7c0b58cf6 regcomp.c - add a PARNO() macro to wrap the ARG() macro
We used the ARG() macro to access the parno data for the OPEN
and CLOSE regops. This made it difficult to find what needed to
change when the type and size or location of this data in the
node was modified. Replacing this access with a specific macro
makes the code more legible and future proof.

This was actually backported from finding everything that broke
by changing the regnode type for OPEN and CLOSE to 2L and moving
the paren parameter to the 2L slot. We might do something like this
in the future and separating the PARNO() macros from their
implementation will make it easier.
2022-11-10 08:53:27 +01:00
Yves Orton
7f7274faab regcomp.h - put STMT_START on its own line and lined up with STMT_END 2022-09-07 09:02:11 +02:00
Tony Cook
c870f3e459 avoid dereferencing prog which may be NULL
CID 353002
2022-08-08 15:15:02 +10:00
Yves Orton
12d173c94b regex engine - replace many attribute arrays with one
This replaces PL_regnode_arg_len, PL_regnode_arg_len_varies,
PL_regnode_off_by_arg and PL_regnode_kind with a single PL_regnode_info
array, which is an array of struct regnode_meta, which contains the same
data but as a struct. Since PL_regnode_name is only used in debugging
builds of the regex engine we keep it separate. If we add more debug
properties it might be good to create a PL_regnode_debug_info[] to hold
that data instead.

This means when we add new properties we do not need to modify any
secondary sources to add new properites, just the struct definition
and regen/regcomp.pl
2022-08-06 11:32:34 +02:00
Yves Orton
cbf5c5ba5f regex engine - wrap PL_regnode_name with macro REGNODE_NAME() 2022-08-06 11:32:34 +02:00
Yves Orton
79a585d60b regex engine - wrap PL_regnode_arg_len_varies with macro REGNODE_ARG_LEN_VARIES() 2022-08-06 11:32:34 +02:00
Yves Orton
e28d2a3533 regex engine - wrap PL_regnode_arg_len with macro REGNODE_ARG_LEN() 2022-08-06 11:32:34 +02:00
Yves Orton
1489b465ff regex engine - wrap PL_regnode_off_by_arg with macro REGNODE_OFF_BY_ARG() 2022-08-06 11:32:34 +02:00
Yves Orton
20f4775e6e regex engine - wrap PL_regnode_kind with macro REGNODE_TYPE()
The code confusing uses type and kind as synonyms. Lets end that bad habit
2022-08-06 11:32:34 +02:00
Yves Orton
182f0ba91d regex engine - improved comments explaining REGNODE_AFTER()
This rewrites one comment to include more explanation of the difference
between Perl_regnext() and REGNODE_AFTER().
2022-08-03 11:07:09 +02:00
Yves Orton
1db310d044 regex engine - integrate regnode_after() support for EXACTish nodes
This adds REGNODE_AFTER_varies() which is used when the called *knows*
that the current regnode is variable length. We then use it to handle
EXACTish style nodes as determined by PL_regnode_arg_len_varies.

As part of this patch Perl_regnext() Perl_regnode_after() and
Perl_check_regnode_after() are moved to reginline.h, which is loaded via
regcomp.c only when we are compiling the regex engine.
2022-08-03 11:07:09 +02:00
Yves Orton
19a5f8d316 regex engine - rename REGNODE_AFTER_dynamic() REGNODE_AFTER()
Now that REGNODE_AFTER() can handle all cases it makes sense
to remove the dynamic() suffix.
2022-08-03 11:07:09 +02:00
Yves Orton
3bfb2e3bfb regex engine - Rename PL_regkind to PL_regnode_kind 2022-08-03 11:07:09 +02:00
Yves Orton
83ca6c9dc5 regex engine - Rename PL_regarglen to PL_regnode_arg_len 2022-08-03 11:07:09 +02:00
Yves Orton
0e48b698ea regcomp.c - rename NEXTOPER to REGNODE_AFTER and related logic
It is really easy to get confused about the difference between
NEXTOPER() and regnext() of a regnode. The two concepts are related,
similar, but importantly distinct. NEXTOPER() is also defined in such a
way that it is easy to abuse and misunderstand and encourages producing
code that is fragile to larger change, effectively "baking in"
assumptions to the code that are difficult to discover by searching.
Changing the type and storage requirements of a regnode may break things
in subtle and hard to debug ways.

An example of how NEXTOPER() is problematic is that this:
NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the
branch node", it means "jump forward by a regnode which happens to be
two regnodes large". In other words NEXTOPER is just a fancy way of
writing "node+1".

This patch replaces NEXTOPER() with three new macros:

    REGNODE_AFTER_dynamic(node)
    REGNODE_AFTER_opcode(node,op)
    REGNODE_AFTER_type(node,tregnode_OPNAME)

The first is the most generic case, it jumps forward by the size of the
node, and determines that size by consulting OP(node). The second is
where you have already extracted OP(node), and the third is where you
know the actual structure that you want to jump forward by. Every
regnode type has a corresponding type, which is known at compile time,
so using the third will produce the most efficient code. However in many
cases the code operates on one of several types, whose size may be the
same now, but may change in the future, in which case one of the other
forms is preferred. The run time logic in regexec.c should probably
only use the REGNODE_AFTER_type() interface.

Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(),
which is used in a specific piece of legacy logic but should not be
used otherwise. It is not safe to go backwards from an arbitrary node,
we simply have no way to know how large the previous node is and thus
where it starts.

This patch includes some logic that validates assumptions during DEBUG
mode which should catch errors from resizing regnodes.

After this patch changing the size of an existing regnode should be
relatively safe and errors related to sizing should trigger assertion
fails.

This patch includes changes to perlreguts.pod to explain this stuff
better.
2022-08-03 11:07:09 +02:00
Yves Orton
f946e55ad0 regen/regcomp.pl - Make regarglen available as PL_regarglen in regexec.c
In a follow up patch we will use this data from regexec.c which
currently cannot see the variable.

This changes a comment in regen/mk_invlists.pl which necessitated
rebuilding several files related to unicode. Only the hashes associated
with mk_invlists.pl were changed.
2022-08-03 11:07:09 +02:00
Yves Orton
24a3add986 regcomp.h: deal with 64 bit aligned pointer data in regex program.
We cannot safely store 64 bit aligned data in a regnode structure due
to the implicit 32 bit alignment the base structure forces on the
data. Thanks to Tony Cook for the suggestion on how to cleanly support
variable sized pointers without alignment issues.

I am pretty sure we should not be storing pointers in the regexp program
like this. In most cases where we need an SV attached to a regnode
structure we store it in the 'data' array which part of the regexp
structure, and then store an index to that item in the regnode. This
allows the use of a smaller member for the index instead.

This was identified by running "make test_reonly" under the ubsan build:

    ./Configure -d -Doptimize=-g -Dusedevel -DDEBUGGING \
    -Accflags='-fsanitize=address -fsanitize=undefined \
    -ggdb3' -Aldflags='-Wl,--no-as-needed -lasan -lubsan' \
    -Dcc=ccache\ gcc -Dld=gcc
2022-07-15 17:25:20 +02:00
Karl Williamson
4c8c99df3c regex: Add optimizing regnode
It turns out that any character class whose UTF-8 representation is two
bytes long, and where all elements share the same first byte can be
represented by a compact, fast regnode designed for the purpose.

This commit adds that regnode, ANYOFHbbm.  ANYOFHb already exists for
classes where all elements have the same first byte, and this just
changes the two-byte ones to use a bitmap instead of an inversion list.

The advantages of this are that no conversion to code point is required
(the continuation byte is just looked up in the bitmap) and no inversion
list is needed.  The inversion list would occupy more space, from 4 to
34 extra 64-bit words, plus an AV and SV, depending on what elements the
class matches.

Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic,
and several other (lesser-known) scripts are of this form.

It would be possible to extend this technique to larger bitmaps, but
this commit is a start.
2022-07-12 05:14:35 -06:00
Karl Williamson
ff37df4b7e regcomp.h: Make bitmap lookups more general
This introduces a new macro and converts to use it so that bitmaps other
than the traditional ones in ANYOF nodes may be defined in a common
manner.
2022-07-12 05:14:35 -06:00
Karl Williamson
bcdc9e1e7a regex: Refactor bitmap vs non-bitmap of qr/[]/
A bracketed character class in a pattern is generally represented by
some form of ANYOF node, with matches of characters in the Latin1 range
handled by a bitmap, and an inversion list for higher code point
matches.  But some patterns only have low matches, and some only high,
and some match everything that is high.

This commit refactors a little so that the distinction between nothing
high matches vs everything high matches is done through the same
technique.  Previously one was indicated by a flag, and the other by a
special value in the node's structure.  Now there are two special
values, and the flag is freed up for a potential future use.  In the
past the meaning of the flags has had to be overloaded go accommodate
all the needs.  freeing of a flag means

This all allows for some slight simplicfications.
2022-07-10 11:56:49 -06:00
Karl Williamson
6947c4eb35 regex: Refactor a shared flag
In ANYOF nodes (generated for qr/[]/), there is a bitmap component, and
possibly a non-bitmap component.  It turns out that a single flag can be
used to indicate the existence of the latter.  When looked at this way,
the name of the flag becomes simpler, and incorporates the meaning of
another bit, which was previously shared with yet another meaning.  Thus
that other meaning can become an unshared bit.

This allows for some simplification, and being able to handle the
uncommon Turkish locale with fewer main-line conditionals being executed
at runtime.
2022-07-10 11:56:49 -06:00
Karl Williamson
c1387cbdcd regcomp.h: Use mnemonic instead of literal+constant
This is a small thing, but might as well use the mnemonic.
2022-07-05 10:12:32 -06:00
Karl Williamson
ff49def3c5 regcomp.h: Add comments better explaining ANYOF nodes 2022-07-03 19:39:25 -06:00
Karl Williamson
3960d57abe regex: Change some internal macro names for clarity
These long names are designed to remind the coder that they have
multiple meanings.  But move the reminder text to the end, as it
obscures the purposes.

And some have two halves for the separate meanings; change the names so
the halves are split by two underscores to visually emphasize this.
2022-07-03 19:39:25 -06:00
Karl Williamson
35b455082a Revert "regex: Add POSIXA1R node"
This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38.

As explained in its commit message.  It adds some comments to point out
that the commit exists, for the curious.
2022-07-01 11:16:39 -06:00