mirror of
https://github.com/Perl/perl5.git
synced 2026-01-26 16:39:36 +00:00
Previously, it was considered a macro. But the previous commit created a new flag for elements like this, which this commit now changes to use.
960 lines
35 KiB
Plaintext
960 lines
35 KiB
Plaintext
=head1 NAME
|
|
|
|
perlreapi - Perl regular expression plugin interface
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
As of Perl 5.9.5 there is a new interface for plugging and using
|
|
regular expression engines other than the default one.
|
|
|
|
Each engine is supposed to provide access to a constant structure of the
|
|
following format:
|
|
|
|
typedef struct regexp_engine {
|
|
REGEXP* (*comp) (pTHX_
|
|
const SV * const pattern, const U32 flags);
|
|
I32 (*exec) (pTHX_
|
|
REGEXP * const rx,
|
|
char* stringarg,
|
|
char* strend, char* strbeg,
|
|
SSize_t minend, SV* sv,
|
|
void* data, U32 flags);
|
|
char* (*intuit) (pTHX_
|
|
REGEXP * const rx, SV *sv,
|
|
const char * const strbeg,
|
|
char *strpos, char *strend, U32 flags,
|
|
struct re_scream_pos_data_s *data);
|
|
SV* (*checkstr) (pTHX_ REGEXP * const rx);
|
|
void (*free) (pTHX_ REGEXP * const rx);
|
|
void (*numbered_buff_FETCH) (pTHX_
|
|
REGEXP * const rx,
|
|
const I32 paren,
|
|
SV * const sv);
|
|
void (*numbered_buff_STORE) (pTHX_
|
|
REGEXP * const rx,
|
|
const I32 paren,
|
|
SV const * const value);
|
|
I32 (*numbered_buff_LENGTH) (pTHX_
|
|
REGEXP * const rx,
|
|
const SV * const sv,
|
|
const I32 paren);
|
|
SV* (*named_buff) (pTHX_
|
|
REGEXP * const rx,
|
|
SV * const key,
|
|
SV * const value,
|
|
U32 flags);
|
|
SV* (*named_buff_iter) (pTHX_
|
|
REGEXP * const rx,
|
|
const SV * const lastkey,
|
|
const U32 flags);
|
|
SV* (*qr_package)(pTHX_ REGEXP * const rx);
|
|
#ifdef USE_ITHREADS
|
|
void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
|
|
#endif
|
|
REGEXP* (*op_comp) (...);
|
|
|
|
|
|
=for apidoc_section $regexp
|
|
=for apidoc Ay||regexp_engine
|
|
|
|
When a regexp is compiled, its C<engine> field is then set to point at
|
|
the appropriate structure, so that when it needs to be used Perl can find
|
|
the right routines to do so.
|
|
|
|
In order to install a new regexp handler, C<$^H{regcomp}> is set
|
|
to an integer which (when casted appropriately) resolves to one of these
|
|
structures. When compiling, the C<comp> method is executed, and the
|
|
resulting C<regexp> structure's engine field is expected to point back at
|
|
the same structure.
|
|
|
|
The pTHX_ symbol in the definition is a macro used by Perl under threading
|
|
to provide an extra argument to the routine holding a pointer back to
|
|
the interpreter that is executing the regexp. So under threading all
|
|
routines get an extra argument.
|
|
|
|
=head1 Callbacks
|
|
|
|
=head2 comp
|
|
|
|
REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags);
|
|
|
|
Compile the pattern stored in C<pattern> using the given C<flags> and
|
|
return a pointer to a prepared C<REGEXP> structure that can perform
|
|
the match. See L</The REGEXP structure> below for an explanation of
|
|
the individual fields in the REGEXP struct.
|
|
|
|
The C<pattern> parameter is the scalar that was used as the
|
|
pattern. Previous versions of Perl would pass two C<char*> indicating
|
|
the start and end of the stringified pattern; the following snippet can
|
|
be used to get the old parameters:
|
|
|
|
STRLEN plen;
|
|
char* exp = SvPV(pattern, plen);
|
|
char* xend = exp + plen;
|
|
|
|
Since any scalar can be passed as a pattern, it's possible to implement
|
|
an engine that does something with an array (C<< "ook" =~ [ qw/ eek
|
|
hlagh / ] >>) or with the non-stringified form of a compiled regular
|
|
expression (C<< "ook" =~ qr/eek/ >>). Perl's own engine will always
|
|
stringify everything using the snippet above, but that doesn't mean
|
|
other engines have to.
|
|
|
|
The C<flags> parameter is a bitfield which indicates which of the
|
|
C<msixpn> flags the regex was compiled with. It also contains
|
|
additional info, such as if C<use locale> is in effect.
|
|
|
|
The C<eogc> flags are stripped out before being passed to the comp
|
|
routine. The regex engine does not need to know if any of these
|
|
are set, as those flags should only affect what Perl does with the
|
|
pattern and its match variables, not how it gets compiled and
|
|
executed.
|
|
|
|
By the time the comp callback is called, some of these flags have
|
|
already had effect (noted below where applicable). However most of
|
|
their effect occurs after the comp callback has run, in routines that
|
|
read the C<< rx->extflags >> field which it populates.
|
|
|
|
In general the flags should be preserved in C<< rx->extflags >> after
|
|
compilation, although the regex engine might want to add or delete
|
|
some of them to invoke or disable some special behavior in Perl. The
|
|
flags along with any special behavior they cause are documented below:
|
|
|
|
The pattern modifiers:
|
|
|
|
=over 4
|
|
|
|
=item C</m> - RXf_PMf_MULTILINE
|
|
|
|
If this is in C<< rx->extflags >> it will be passed to
|
|
C<Perl_fbm_instr> by C<pp_split> which will treat the subject string
|
|
as a multi-line string.
|
|
|
|
=for apidoc Amnh||RXf_PMf_EXTENDED
|
|
=for apidoc_item RXf_PMf_FOLD
|
|
=for apidoc_item RXf_PMf_KEEPCOPY
|
|
=for apidoc_item RXf_PMf_MULTILINE
|
|
=for apidoc_item RXf_PMf_SINGLELINE
|
|
|
|
=item C</s> - RXf_PMf_SINGLELINE
|
|
|
|
=item C</i> - RXf_PMf_FOLD
|
|
|
|
=item C</x> - RXf_PMf_EXTENDED
|
|
|
|
If present on a regex, C<"#"> comments will be handled differently by the
|
|
tokenizer in some cases.
|
|
|
|
TODO: Document those cases.
|
|
|
|
|
|
=item C</p> - RXf_PMf_KEEPCOPY
|
|
|
|
TODO: Document this
|
|
|
|
=item Character set
|
|
|
|
The character set rules are determined by an enum that is contained
|
|
in this field. This is still experimental and subject to change, but
|
|
the current interface returns the rules by use of the in-line function
|
|
C<get_regex_charset(const U32 flags)>. The only currently documented
|
|
value returned from it is REGEX_LOCALE_CHARSET, which is set if
|
|
C<use locale> is in effect. If present in C<< rx->extflags >>,
|
|
C<split> will use the locale dependent definition of whitespace
|
|
when RXf_SKIPWHITE or RXf_WHITE is in effect. ASCII whitespace
|
|
is defined as per L<isSPACE|perlapi/isSPACE>, and by the internal
|
|
macros C<is_utf8_space> under UTF-8, and C<isSPACE_LC> under C<use
|
|
locale>.
|
|
|
|
=for apidoc Avnh||REGEX_LOCALE_CHARSET
|
|
|
|
=back
|
|
|
|
Additional flags:
|
|
|
|
=over 4
|
|
|
|
=item RXf_SPLIT
|
|
|
|
This flag was removed in perl 5.18.0. C<split ' '> is now special-cased
|
|
solely in the parser. RXf_SPLIT is still #defined, so you can test for it.
|
|
This is how it used to work:
|
|
|
|
If C<split> is invoked as C<split ' '> or with no arguments (which
|
|
really means C<split(' ', $_)>, see L<split|perlfunc/split>), Perl will
|
|
set this flag. The regex engine can then check for it and set the
|
|
SKIPWHITE and WHITE extflags. To do this, the Perl engine does:
|
|
|
|
if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ')
|
|
r->extflags |= (RXf_SKIPWHITE|RXf_WHITE);
|
|
|
|
=back
|
|
|
|
These flags can be set during compilation to enable optimizations in
|
|
the C<split> operator.
|
|
|
|
=for apidoc Amnh||RXf_NO_INPLACE_SUBST
|
|
=for apidoc_item RXf_NULL
|
|
=for apidoc_item RXf_SKIPWHITE
|
|
=for apidoc_item RXf_SPLIT
|
|
=for apidoc_item RXf_START_ONLY
|
|
=for apidoc_item RXf_WHITE
|
|
|
|
=over 4
|
|
|
|
=item RXf_SKIPWHITE
|
|
|
|
This flag was removed in perl 5.18.0. It is still #defined, so you can
|
|
set it, but doing so will have no effect. This is how it used to work:
|
|
|
|
If the flag is present in C<< rx->extflags >> C<split> will delete
|
|
whitespace from the start of the subject string before it's operated
|
|
on. What is considered whitespace depends on if the subject is a
|
|
UTF-8 string and if the C<RXf_PMf_LOCALE> flag is set.
|
|
|
|
If RXf_WHITE is set in addition to this flag, C<split> will behave like
|
|
C<split " "> under the Perl engine.
|
|
|
|
|
|
=item RXf_START_ONLY
|
|
|
|
Tells the split operator to split the target string on newlines
|
|
(C<\n>) without invoking the regex engine.
|
|
|
|
Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp
|
|
== '^'>), even under C</^/s>; see L<split|perlfunc>. Of course a
|
|
different regex engine might want to use the same optimizations
|
|
with a different syntax.
|
|
|
|
=item RXf_WHITE
|
|
|
|
Tells the split operator to split the target string on whitespace
|
|
without invoking the regex engine. The definition of whitespace varies
|
|
depending on if the target string is a UTF-8 string and on
|
|
if RXf_PMf_LOCALE is set.
|
|
|
|
Perl's engine sets this flag if the pattern is C<\s+>.
|
|
|
|
=item RXf_NULL
|
|
|
|
Tells the split operator to split the target string on
|
|
characters. The definition of character varies depending on if
|
|
the target string is a UTF-8 string.
|
|
|
|
Perl's engine sets this flag on empty patterns, this optimization
|
|
makes C<split //> much faster than it would otherwise be. It's even
|
|
faster than C<unpack>.
|
|
|
|
=item RXf_NO_INPLACE_SUBST
|
|
|
|
Added in perl 5.18.0, this flag indicates that a regular expression might
|
|
perform an operation that would interfere with inplace substitution. For
|
|
instance it might contain lookbehind, or assign to non-magical variables
|
|
(such as $REGMARK and $REGERROR) during matching. C<s///> will skip
|
|
certain optimisations when this is set.
|
|
|
|
=back
|
|
|
|
=head2 exec
|
|
|
|
I32 exec(pTHX_ REGEXP * const rx,
|
|
char *stringarg, char* strend, char* strbeg,
|
|
SSize_t minend, SV* sv,
|
|
void* data, U32 flags);
|
|
|
|
Execute a regexp. The arguments are
|
|
|
|
=over 4
|
|
|
|
=item rx
|
|
|
|
The regular expression to execute.
|
|
|
|
=item sv
|
|
|
|
This is the SV to be matched against. Note that the
|
|
actual char array to be matched against is supplied by the arguments
|
|
described below; the SV is just used to determine UTF8ness, C<pos()> etc.
|
|
|
|
=item strbeg
|
|
|
|
Pointer to the physical start of the string.
|
|
|
|
=item strend
|
|
|
|
Pointer to the character following the physical end of the string (i.e.
|
|
the C<\0>, if any).
|
|
|
|
=item stringarg
|
|
|
|
Pointer to the position in the string where matching should start; it might
|
|
not be equal to C<strbeg> (for example in a later iteration of C</.../g>).
|
|
|
|
=item minend
|
|
|
|
Minimum length of string (measured in bytes from C<stringarg>) that must
|
|
match; if the engine reaches the end of the match but hasn't reached this
|
|
position in the string, it should fail.
|
|
|
|
=item data
|
|
|
|
Optimisation data; subject to change.
|
|
|
|
=item flags
|
|
|
|
Optimisation flags; subject to change.
|
|
|
|
=back
|
|
|
|
=head2 intuit
|
|
|
|
char* intuit(pTHX_
|
|
REGEXP * const rx,
|
|
SV *sv,
|
|
const char * const strbeg,
|
|
char *strpos,
|
|
char *strend,
|
|
const U32 flags,
|
|
struct re_scream_pos_data_s *data);
|
|
|
|
Find the start position where a regex match should be attempted,
|
|
or possibly if the regex engine should not be run because the
|
|
pattern can't match. This is called, as appropriate, by the core,
|
|
depending on the values of the C<extflags> member of the C<regexp>
|
|
structure.
|
|
|
|
Arguments:
|
|
|
|
rx: the regex to match against
|
|
sv: the SV being matched: only used for utf8 flag; the string
|
|
itself is accessed via the pointers below. Note that on
|
|
something like an overloaded SV, SvPOK(sv) may be false
|
|
and the string pointers may point to something unrelated to
|
|
the SV itself.
|
|
strbeg: real beginning of string
|
|
strpos: the point in the string at which to begin matching
|
|
strend: pointer to the byte following the last char of the string
|
|
flags currently unused; set to 0
|
|
data: currently unused; set to NULL
|
|
|
|
|
|
=head2 checkstr
|
|
|
|
SV* checkstr(pTHX_ REGEXP * const rx);
|
|
|
|
Return a SV containing a string that must appear in the pattern. Used
|
|
by C<split> for optimising matches.
|
|
|
|
=head2 free
|
|
|
|
void free(pTHX_ REGEXP * const rx);
|
|
|
|
Called by Perl when it is freeing a regexp pattern so that the engine
|
|
can release any resources pointed to by the C<pprivate> member of the
|
|
C<regexp> structure. This is only responsible for freeing private data;
|
|
Perl will handle releasing anything else contained in the C<regexp> structure.
|
|
|
|
=head2 Numbered capture callbacks
|
|
|
|
Called to get/set the value of C<$`>, C<$'>, C<$&> and their named
|
|
equivalents, ${^PREMATCH}, ${^POSTMATCH} and ${^MATCH}, as well as the
|
|
numbered capture groups (C<$1>, C<$2>, ...).
|
|
|
|
The C<paren> parameter will be C<1> for C<$1>, C<2> for C<$2> and so
|
|
forth, and have these symbolic values for the special variables:
|
|
|
|
${^PREMATCH} RX_BUFF_IDX_CARET_PREMATCH
|
|
${^POSTMATCH} RX_BUFF_IDX_CARET_POSTMATCH
|
|
${^MATCH} RX_BUFF_IDX_CARET_FULLMATCH
|
|
$` RX_BUFF_IDX_PREMATCH
|
|
$' RX_BUFF_IDX_POSTMATCH
|
|
$& RX_BUFF_IDX_FULLMATCH
|
|
|
|
=for apidoc Amnh||RX_BUFF_IDX_CARET_FULLMATCH
|
|
=for apidoc_item RX_BUFF_IDX_CARET_POSTMATCH
|
|
=for apidoc_item RX_BUFF_IDX_CARET_PREMATCH
|
|
=for apidoc_item RX_BUFF_IDX_FULLMATCH
|
|
=for apidoc_item RX_BUFF_IDX_POSTMATCH
|
|
=for apidoc_item RX_BUFF_IDX_PREMATCH
|
|
|
|
Note that in Perl 5.17.3 and earlier, the last three constants were also
|
|
used for the caret variants of the variables.
|
|
|
|
The names have been chosen by analogy with L<Tie::Scalar> methods
|
|
names with an additional B<LENGTH> callback for efficiency. However
|
|
named capture variables are currently not tied internally but
|
|
implemented via magic.
|
|
|
|
=head3 numbered_buff_FETCH
|
|
|
|
void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren,
|
|
SV * const sv);
|
|
|
|
Fetch a specified numbered capture. C<sv> should be set to the scalar
|
|
to return, the scalar is passed as an argument rather than being
|
|
returned from the function because when it's called Perl already has a
|
|
scalar to store the value, creating another one would be
|
|
redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and
|
|
friends, see L<perlapi>.
|
|
|
|
This callback is where Perl untaints its own capture variables under
|
|
taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch>
|
|
function in F<regcomp.c> for how to untaint capture variables if
|
|
that's something you'd like your engine to do as well.
|
|
|
|
=head3 numbered_buff_STORE
|
|
|
|
void (*numbered_buff_STORE) (pTHX_
|
|
REGEXP * const rx,
|
|
const I32 paren,
|
|
SV const * const value);
|
|
|
|
Set the value of a numbered capture variable. C<value> is the scalar
|
|
that is to be used as the new value. It's up to the engine to make
|
|
sure this is used as the new value (or reject it).
|
|
|
|
Example:
|
|
|
|
if ("ook" =~ /(o*)/) {
|
|
# 'paren' will be '1' and 'value' will be 'ee'
|
|
$1 =~ tr/o/e/;
|
|
}
|
|
|
|
Perl's own engine will croak on any attempt to modify the capture
|
|
variables, to do this in another engine use the following callback
|
|
(copied from C<Perl_reg_numbered_buff_store>):
|
|
|
|
void
|
|
Example_reg_numbered_buff_store(pTHX_
|
|
REGEXP * const rx,
|
|
const I32 paren,
|
|
SV const * const value)
|
|
{
|
|
PERL_UNUSED_ARG(rx);
|
|
PERL_UNUSED_ARG(paren);
|
|
PERL_UNUSED_ARG(value);
|
|
|
|
if (!PL_localizing)
|
|
Perl_croak(aTHX_ PL_no_modify);
|
|
}
|
|
|
|
Actually Perl will not I<always> croak in a statement that looks
|
|
like it would modify a numbered capture variable. This is because the
|
|
STORE callback will not be called if Perl can determine that it
|
|
doesn't have to modify the value. This is exactly how tied variables
|
|
behave in the same situation:
|
|
|
|
package CaptureVar;
|
|
use parent 'Tie::Scalar';
|
|
|
|
sub TIESCALAR { bless [] }
|
|
sub FETCH { undef }
|
|
sub STORE { die "This doesn't get called" }
|
|
|
|
package main;
|
|
|
|
tie my $sv => "CaptureVar";
|
|
$sv =~ y/a/b/;
|
|
|
|
Because C<$sv> is C<undef> when the C<y///> operator is applied to it,
|
|
the transliteration won't actually execute and the program won't
|
|
C<die>. This is different to how 5.8 and earlier versions behaved
|
|
since the capture variables were READONLY variables then; now they'll
|
|
just die when assigned to in the default engine.
|
|
|
|
=head3 numbered_buff_LENGTH
|
|
|
|
I32 numbered_buff_LENGTH (pTHX_
|
|
REGEXP * const rx,
|
|
const SV * const sv,
|
|
const I32 paren);
|
|
|
|
Get the C<length> of a capture variable. There's a special callback
|
|
for this so that Perl doesn't have to do a FETCH and run C<length> on
|
|
the result, since the length is (in Perl's case) known from an offset
|
|
stored in C<< rx->offs >>, this is much more efficient:
|
|
|
|
I32 s1 = rx->offs[paren].start;
|
|
I32 s2 = rx->offs[paren].end;
|
|
I32 len = t1 - s1;
|
|
|
|
This is a little bit more complex in the case of UTF-8, see what
|
|
C<Perl_reg_numbered_buff_length> does with
|
|
L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>.
|
|
|
|
=head2 Named capture callbacks
|
|
|
|
Called to get/set the value of C<%+> and C<%->, as well as by some
|
|
utility functions in L<re>.
|
|
|
|
There are two callbacks, C<named_buff> is called in all the cases the
|
|
FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks
|
|
would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the
|
|
same cases as FIRSTKEY and NEXTKEY.
|
|
|
|
The C<flags> parameter can be used to determine which of these
|
|
operations the callbacks should respond to. The following flags are
|
|
currently defined:
|
|
|
|
Which L<Tie::Hash> operation is being performed from the Perl level on
|
|
C<%+> or C<%+>, if any:
|
|
|
|
RXapif_FETCH
|
|
RXapif_STORE
|
|
RXapif_DELETE
|
|
RXapif_CLEAR
|
|
RXapif_EXISTS
|
|
RXapif_SCALAR
|
|
RXapif_FIRSTKEY
|
|
RXapif_NEXTKEY
|
|
|
|
=for apidoc Amnh ||RXapif_ALL
|
|
=for apidoc_item RXapif_CLEAR
|
|
=for apidoc_item RXapif_DELETE
|
|
=for apidoc_item RXapif_EXISTS
|
|
=for apidoc_item RXapif_FETCH
|
|
=for apidoc_item RXapif_FIRSTKEY
|
|
=for apidoc_item RXapif_NEXTKEY
|
|
=for apidoc_item RXapif_ONE
|
|
=for apidoc_item RXapif_REGNAME
|
|
=for apidoc_item RXapif_REGNAMES
|
|
=for apidoc_item RXapif_REGNAMES_COUNT
|
|
=for apidoc_item RXapif_SCALAR
|
|
=for apidoc_item RXapif_STORE
|
|
|
|
If C<%+> or C<%-> is being operated on, if any.
|
|
|
|
RXapif_ONE /* %+ */
|
|
RXapif_ALL /* %- */
|
|
|
|
If this is being called as C<re::regname>, C<re::regnames> or
|
|
C<re::regnames_count>, if any. The first two will be combined with
|
|
C<RXapif_ONE> or C<RXapif_ALL>.
|
|
|
|
RXapif_REGNAME
|
|
RXapif_REGNAMES
|
|
RXapif_REGNAMES_COUNT
|
|
|
|
|
|
Internally C<%+> and C<%-> are implemented with a real tied interface
|
|
via L<Tie::Hash::NamedCapture>. The methods in that package will call
|
|
back into these functions. However the usage of
|
|
L<Tie::Hash::NamedCapture> for this purpose might change in future
|
|
releases. For instance this might be implemented by magic instead
|
|
(would need an extension to mgvtbl).
|
|
|
|
=head3 named_buff
|
|
|
|
SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key,
|
|
SV * const value, U32 flags);
|
|
|
|
=head3 named_buff_iter
|
|
|
|
SV* (*named_buff_iter) (pTHX_
|
|
REGEXP * const rx,
|
|
const SV * const lastkey,
|
|
const U32 flags);
|
|
|
|
=head2 qr_package
|
|
|
|
SV* qr_package(pTHX_ REGEXP * const rx);
|
|
|
|
The package the qr// magic object is blessed into (as seen by C<ref
|
|
qr//>). It is recommended that engines change this to their package
|
|
name for identification regardless of if they implement methods
|
|
on the object.
|
|
|
|
The package this method returns should also have the internal
|
|
C<Regexp> package in its C<@ISA>. C<< qr//->isa("Regexp") >> should always
|
|
be true regardless of what engine is being used.
|
|
|
|
Example implementation might be:
|
|
|
|
SV*
|
|
Example_qr_package(pTHX_ REGEXP * const rx)
|
|
{
|
|
PERL_UNUSED_ARG(rx);
|
|
return newSVpvs("re::engine::Example");
|
|
}
|
|
|
|
Any method calls on an object created with C<qr//> will be dispatched to the
|
|
package as a normal object.
|
|
|
|
use re::engine::Example;
|
|
my $re = qr//;
|
|
$re->meth; # dispatched to re::engine::Example::meth()
|
|
|
|
To retrieve the C<REGEXP> object from the scalar in an XS function use
|
|
the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP
|
|
Functions>.
|
|
|
|
void meth(SV * rv)
|
|
PPCODE:
|
|
REGEXP * re = SvRX(sv);
|
|
|
|
=head2 dupe
|
|
|
|
void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param);
|
|
|
|
On threaded builds a regexp may need to be duplicated so that the pattern
|
|
can be used by multiple threads. This routine is expected to handle the
|
|
duplication of any private data pointed to by the C<pprivate> member of
|
|
the C<regexp> structure. It will be called with the preconstructed new
|
|
C<regexp> structure as an argument, the C<pprivate> member will point at
|
|
the B<old> private structure, and it is this routine's responsibility to
|
|
construct a copy and return a pointer to it (which Perl will then use to
|
|
overwrite the field as passed to this routine.)
|
|
|
|
This allows the engine to dupe its private data but also if necessary
|
|
modify the final structure if it really must.
|
|
|
|
On unthreaded builds this field doesn't exist.
|
|
|
|
=head2 op_comp
|
|
|
|
This is private to the Perl core and subject to change. Should be left
|
|
null.
|
|
|
|
=head1 The REGEXP structure
|
|
|
|
The REGEXP struct is defined in F<regexp.h>.
|
|
All regex engines must be able to
|
|
correctly build such a structure in their L</comp> routine.
|
|
|
|
=for apidoc Ayh||struct regexp
|
|
=for apidoc Ayh||REGEXP
|
|
|
|
The REGEXP structure contains all the data that Perl needs to be aware of
|
|
to properly work with the regular expression. It includes data about
|
|
optimisations that Perl can use to determine if the regex engine should
|
|
really be used, and various other control info that is needed to properly
|
|
execute patterns in various contexts, such as if the pattern anchored in
|
|
some way, or what flags were used during the compile, or if the
|
|
program contains special constructs that Perl needs to be aware of.
|
|
|
|
In addition it contains two fields that are intended for the private
|
|
use of the regex engine that compiled the pattern. These are the
|
|
C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to
|
|
an arbitrary structure, whose use and management is the responsibility
|
|
of the compiling engine. Perl will never modify either of these
|
|
values.
|
|
|
|
/* copied from: regexp.h */
|
|
typedef struct regexp {
|
|
/*----------------------------------------------------------------------
|
|
* Fields required for compatibility with SV types
|
|
*/
|
|
XPV_HEAD_;
|
|
|
|
/*----------------------------------------------------------------------
|
|
* Operational fields
|
|
*/
|
|
const struct regexp_engine* engine; /* what engine created this regexp? */
|
|
REGEXP *mother_re; /* what re is this a lightweight copy of? */
|
|
HV *paren_names; /* Optional hash of paren names */
|
|
|
|
/*----------------------------------------------------------------------
|
|
* Information about the match that the perl core uses to manage things
|
|
*/
|
|
|
|
/* see comment in regcomp_internal.h about branch reset to understand
|
|
the distinction between physical and logical capture buffers */
|
|
U32 nparens; /* physical number of capture buffers */
|
|
U32 logical_nparens; /* logical_number of capture buffers */
|
|
I32 *logical_to_parno; /* map logical parno to first physical */
|
|
I32 *parno_to_logical; /* map every physical parno to logical */
|
|
I32 *parno_to_logical_next; /* map every physical parno to the next
|
|
physical with the same logical id */
|
|
|
|
SSize_t maxlen; /* maximum possible number of chars in string to match */
|
|
SSize_t minlen; /* minimum possible number of chars in string to match */
|
|
SSize_t minlenret; /* minimum possible number of chars in $& */
|
|
STRLEN gofs; /* chars left of pos that we search from */
|
|
/* substring data about strings that must appear in
|
|
* the final match, used for optimisations */
|
|
|
|
struct reg_substr_data *substrs;
|
|
|
|
/* private engine specific data */
|
|
|
|
void *pprivate; /* Data private to the regex engine which
|
|
* created this object. */
|
|
U32 extflags; /* Flags used both externally and internally */
|
|
U32 intflags; /* Engine Specific Internal flags */
|
|
|
|
/*----------------------------------------------------------------------
|
|
* Data about the last/current match. These are modified during matching
|
|
*/
|
|
|
|
U32 lastparen; /* highest close paren matched ($+) */
|
|
U32 lastcloseparen; /* last close paren matched ($^N) */
|
|
regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
|
|
char **recurse_locinput; /* used to detect infinite recursion, XXX: move to internal */
|
|
|
|
|
|
/*---------------------------------------------------------------------- */
|
|
|
|
/* offset from wrapped to the start of precomp */
|
|
PERL_BITFIELD32 pre_prefix:4;
|
|
|
|
/* original flags used to compile the pattern, may differ from
|
|
* extflags in various ways */
|
|
PERL_BITFIELD32 compflags:9;
|
|
|
|
/*---------------------------------------------------------------------- */
|
|
|
|
char *subbeg; /* saved or original string so \digit works forever. */
|
|
SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
|
|
SSize_t sublen; /* Length of string pointed by subbeg */
|
|
SSize_t suboffset; /* byte offset of subbeg from logical start of str */
|
|
SSize_t subcoffset; /* suboffset equiv, but in chars (for @-/@+) */
|
|
|
|
/*----------------------------------------------------------------------
|
|
* More Operational fields
|
|
*/
|
|
|
|
CV *qr_anoncv; /* the anon sub wrapped round qr/(?{..})/ */
|
|
} regexp;
|
|
|
|
Most of the fields contained in this structure are accessed via macros
|
|
with a prefix of C<RX_> or C<RXp_>. The fields are discussed in more detail
|
|
below:
|
|
|
|
=head2 C<engine>
|
|
|
|
This field points at a C<regexp_engine> structure which contains pointers
|
|
to the subroutines that are to be used for performing a match. It
|
|
is the compiling routine's responsibility to populate this field before
|
|
returning the regexp object.
|
|
|
|
Internally this is set to C<NULL> unless a custom engine is specified in
|
|
C<$^H{regcomp}>, Perl's own set of callbacks can be accessed in the struct
|
|
pointed to by C<RE_ENGINE_PTR>.
|
|
|
|
=for apidoc Amnh||SV_SAVED_COPY
|
|
|
|
=head2 C<mother_re>
|
|
|
|
This is a pointer to another struct regexp which this one was derived
|
|
from. C<qr//> objects means that the same regexp pattern can be used in
|
|
different contexts at the same time, and as long as match status
|
|
information is stored in the structure (there are plans to change this
|
|
eventually) we need to support having multiple copies of the structure
|
|
in use at the same time. The fields related to the regexp program itself
|
|
are copied from the mother_re, and owned by the mother_re, whereas the
|
|
match state variables are owned by the struct itself.
|
|
|
|
=head2 C<extflags>
|
|
|
|
This will be used by Perl to see what flags the regexp was compiled
|
|
with, this will normally be set to the value of the flags parameter by
|
|
the L<comp|/comp> callback. See the L<comp|/comp> documentation for
|
|
valid flags.
|
|
|
|
=head2 C<minlen> C<minlenret>
|
|
|
|
The minimum string length (in characters) required for the pattern to match.
|
|
This is used to
|
|
prune the search space by not bothering to match any closer to the end of a
|
|
string than would allow a match. For instance there is no point in even
|
|
starting the regex engine if the minlen is 10 but the string is only 5
|
|
characters long. There is no way that the pattern can match.
|
|
|
|
C<minlenret> is the minimum length (in characters) of the string that would
|
|
be found in $& after a match.
|
|
|
|
The difference between C<minlen> and C<minlenret> can be seen in the
|
|
following pattern:
|
|
|
|
/ns(?=\d)/
|
|
|
|
where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
|
|
required to match but is not actually
|
|
included in the matched content. This
|
|
distinction is particularly important as the substitution logic uses the
|
|
C<minlenret> to tell if it can do in-place substitutions (these can
|
|
result in considerable speed-up).
|
|
|
|
=head2 C<gofs>
|
|
|
|
Left offset from pos() to start match at.
|
|
|
|
=head2 C<substrs>
|
|
|
|
Substring data about strings that must appear in the final match. This
|
|
is currently only used internally by Perl's engine, but might be
|
|
used in the future for all engines for optimisations.
|
|
|
|
=head2 C<nparens>, C<logical_nparens>
|
|
|
|
|
|
These fields are used to keep track of the number of physical and logical
|
|
paren capture groups there are in the pattern, which may differ if the
|
|
pattern includes the use of the branch reset construct C<(?| ... | ... )>.
|
|
For instance the pattern C</(?|(foo)|(bar))/> contains two physical capture
|
|
buffers, but only one logical capture buffer. Most internals logic in the
|
|
regex engine uses the physical capture buffer ids, but the user exposed
|
|
logic uses logical capture buffer ids. See the next section for data-structures
|
|
that allow mapping from one to the other.
|
|
|
|
=head2 C<logical_to_parno>, C<parno_to_logical>, C<parno_to_logical_next>
|
|
|
|
These fields facilitate mapping between logical and physical capture
|
|
buffer numbers. C<logical_to_parno> is an array whose Kth element
|
|
contains the lowest physical capture buffer id for the Kth logical
|
|
capture buffer. C<parno_to_logical> is an array whose Kth element
|
|
contains the logical capture buffer associated with the Kth physical
|
|
capture buffer. C<parno_to_logical_next> is an array whose Kth element
|
|
contains the next physical capture buffer with the same logical id, or 0
|
|
if there is none.
|
|
|
|
Note that all three of these arrays are ONLY populated when the pattern
|
|
includes the use of the branch reset concept. Patterns which do not use
|
|
branch-reset effectively have a 1:1 to mapping between logical and
|
|
physical so there is no need for this meta-data.
|
|
|
|
The following table gives an example of how this works.
|
|
|
|
Pattern /(a) (?| (b) (c) (d) | (e) (f) | (g) ) (h)/
|
|
Logical: $1 $2 $3 $4 $2 $3 $2 $5
|
|
Physical: 1 2 3 4 5 6 7 8
|
|
Next: 0 5 6 0 7 0 0 0
|
|
|
|
Also note that the 0th element of any of these arrays is not used as it
|
|
represents the "entire pattern".
|
|
|
|
=head2 C<lastparen>, and C<lastcloseparen>
|
|
|
|
These fields are used to keep track of: which was the highest paren to
|
|
be closed (see L<perlvar/$+>); and which was the most recent paren to be
|
|
closed (see L<perlvar/$^N>).
|
|
|
|
=head2 C<intflags>
|
|
|
|
The engine's private copy of the flags the pattern was compiled with. Usually
|
|
this is the same as C<extflags> unless the engine chose to modify one of them.
|
|
|
|
=head2 C<pprivate>
|
|
|
|
A void* pointing to an engine-defined
|
|
data structure. The Perl engine uses the
|
|
C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
|
|
engine should use something else.
|
|
|
|
=head2 C<offs>
|
|
|
|
A C<regexp_paren_pair> structure which defines offsets into the string being
|
|
matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
|
|
C<regexp_paren_pair> struct is defined as follows:
|
|
|
|
typedef struct regexp_paren_pair {
|
|
I32 start;
|
|
I32 end;
|
|
} regexp_paren_pair;
|
|
|
|
=for apidoc Ayh||regexp_paren_pair
|
|
|
|
If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
|
|
capture group did not match.
|
|
C<< ->offs[0].start/end >> represents C<$&> (or
|
|
C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where
|
|
C<$paren >= 1>.
|
|
|
|
=head2 C<RX_PRECOMP> C<RX_PRELEN>
|
|
|
|
Used for optimisations. C<RX_PRECOMP> holds a copy of the pattern that
|
|
was compiled and C<RX_PRELEN> its length. When a new pattern is to be
|
|
compiled (such as inside a loop) the internal C<regcomp> operator
|
|
checks if the last compiled C<REGEXP>'s C<RX_PRECOMP> and C<RX_PRELEN>
|
|
are equivalent to the new one, and if so uses the old pattern instead
|
|
of compiling a new one.
|
|
|
|
In older perls these two macros were actually fields in the structure
|
|
with the names C<precomp> and C<prelen> respectively.
|
|
|
|
=head2 C<paren_names>
|
|
|
|
This is a hash used internally to track named capture groups and their
|
|
offsets. The keys are the names of the buffers the values are dualvars,
|
|
with the IV slot holding the number of buffers with the given name and the
|
|
pv being an embedded array of I32. The values may also be contained
|
|
independently in the data array in cases where named backreferences are
|
|
used.
|
|
|
|
=head2 C<substrs>
|
|
|
|
Holds information on the longest string that must occur at a fixed
|
|
offset from the start of the pattern, and the longest string that must
|
|
occur at a floating offset from the start of the pattern. Used to do
|
|
Fast-Boyer-Moore searches on the string to find out if its worth using
|
|
the regex engine at all, and if so where in the string to search.
|
|
|
|
=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>
|
|
|
|
Used during the execution phase for managing search and replace patterns,
|
|
and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
|
|
buffer (either the original string, or a copy in the case of
|
|
C<RX_MATCH_COPIED(rx_sv)>), and C<sublen> is the length of the buffer. The
|
|
C<RX_OFFS_START(rx_sv,n)> and C<RX_OFFS_END(rx_sv,n)> macros index into this
|
|
buffer. as does the data structure returned by C<RX_OFFSp(rx_sv)> but you
|
|
should not use that directly.
|
|
|
|
=for apidoc Amh||RX_MATCH_COPIED|const REGEXP * rx_sv
|
|
|
|
In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
|
|
the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
|
|
can choose not to copy the full buffer (although it must still do so in
|
|
the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
|
|
C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the
|
|
number of bytes from the logical start of the buffer to the physical start
|
|
(i.e. C<subbeg>). It should also set C<subcoffset>, the number of
|
|
characters in the offset. The latter is needed to support C<@-> and C<@+>
|
|
which work in characters, not bytes.
|
|
|
|
=for apidoc Amnh ||REXEC_COPY_SKIP_POST
|
|
=for apidoc_item ||REXEC_COPY_SKIP_PRE
|
|
=for apidoc_item ||REXEC_COPY_STR
|
|
|
|
=head2 C<RX_WRAPPED> C<RX_WRAPLEN>
|
|
|
|
Macros which access the string the C<qr//> stringifies to. The Perl
|
|
engine for example stores C<(?^:eek)> in the case of C<qr/eek/>.
|
|
|
|
When using a custom engine that doesn't support the C<(?:)> construct
|
|
for inline modifiers, it's probably best to have C<qr//> stringify to
|
|
the supplied pattern, note that this will create undesired patterns in
|
|
cases such as:
|
|
|
|
my $x = qr/a|b/; # "a|b"
|
|
my $y = qr/c/i; # "c"
|
|
my $z = qr/$x$y/; # "a|bc"
|
|
|
|
There's no solution for this problem other than making the custom
|
|
engine understand a construct like C<(?:)>.
|
|
|
|
=head2 C<RX_REFCNT()>
|
|
|
|
The number of times the structure is referenced. When this falls to 0,
|
|
the regexp is automatically freed by a call to C<pregfree>. This should
|
|
be set to 1 in each engine's L</comp> routine. Note that in older perls
|
|
this was a member in the struct called C<refcnt> but in more modern
|
|
perls where the regexp structure was unified with the SV structure this
|
|
is an alias to SvREFCNT().
|
|
|
|
=head1 HISTORY
|
|
|
|
Originally part of L<perlreguts>.
|
|
|
|
=head1 AUTHORS
|
|
|
|
Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
|
|
Bjarmason.
|
|
|
|
=head1 LICENSE
|
|
|
|
Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
|
|
|
|
This program is free software; you can redistribute it and/or modify it under
|
|
the same terms as Perl itself.
|
|
|
|
=cut
|