468 Commits

Author SHA1 Message Date
Ricardo Signes
14d04a3346 update the editor hints for spaces, not tabs
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
2012-05-29 21:53:17 -04:00
Karl Williamson
a027039367 utf8.c: Add nomix-ASCII option to to_fold functions
Under /iaa regex matching, folds that cross the ASCII/non-ASCII
boundary are prohibited.  This changes _to_uni_fold_flags() and
_to_utf8_fold_flags() functions to take a new flag which, when set,
tells them to not accept such folds.

This allows us to later move the intelligence for handling this
situation to these centralized functions.
2012-05-22 08:24:21 -06:00
Karl Williamson
50ba90ffe5 utf8.c: Add assertion 2012-05-22 08:24:21 -06:00
Karl Williamson
4190d317a5 utf8.c: Re-order if branches for speed
Probably the C optimizer does this anyway, but do the uncomplicated test
before the (mutually exclusive) complicated test (though the
complications are hidden in a macro).  The new first test is a
pre-requisite for the new 2nd test anyway.
2012-05-22 08:24:20 -06:00
Karl Williamson
4230354419 utf8.c: Add comment 2012-05-22 08:24:20 -06:00
Karl Williamson
b5b9af0457 utf8n_to_uvuni(): Add a few compiler hints
Tell the compiler that malformed input is not likely, so it can optimize
accordingly.
2012-05-22 08:24:19 -06:00
Karl Williamson
2ff6c1911c utf8.c: Skip extraneous function call
This eliminates an intermediate function call by calling the base level
one directly.
2012-05-22 08:24:19 -06:00
Karl Williamson
3986bb7cb3 utf8.c: Remove unnecessary validation
These two functions are to be called only on strings known to be valid,
so we can skip the validation.
2012-05-22 08:24:19 -06:00
Karl Williamson
979f77b669 utf8.c: Extra branch to avoid others in the typical case
This test eliminates all code points less than U+D800 from having to be
checked more than once, at the expense of an extra test for code points
that are larger
2012-05-22 08:24:19 -06:00
Karl Williamson
2f8f112e03 utf8n_to_uvuni(): Fix broken malformation interactions
All code points whose UTF-8 representations start with a byte containing
either \xFE or \xFF are considered problematic because they are not
portable.  There are many such code points that are too large to
represent on a 32 or even a 64 bit platform.  Commit
eb83ed87110e41de6a4cd4463f75df60798a9243 failed to properly catch
overflow when the input flags to this function say to warn on, but
otherwise accept FE and FF sequences.  Now overflow is checked for
unconditionally.
2012-05-01 19:08:57 -04:00
Karl Williamson
cd7e6c884f is_utf8_char_slow(): Avoid accepting overlongs
There are possible overlong sequences that this function blindly
accepts.  Instead of developing the code to figure this out, turn this
function into a wrapper for utf8n_to_uvuni() which already has this
check.
2012-04-26 11:58:57 -06:00
Karl Williamson
524080c4d3 perlapi: Update for changes in utf8 decoding 2012-04-26 11:58:57 -06:00
Karl Williamson
f555bc6353 utf8.c: White-space only
This outdents to account for the removal of a surrounding block.
2012-04-26 11:58:57 -06:00
Karl Williamson
eb83ed8711 utf8.c: refactor utf8n_to_uvuni()
The prior version had a number of issues, some of which have been taken
care of in previous commits.

The goal when presented with malformed input is to consume as few bytes
as possible, so as to position the input for the next try to the first
possible byte that could be the beginning of a character.  We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters.  (This
is forbidden by the Unicode standard because of security
considerations.)  The previous code could do both of these under various
circumstances.

In some cases it took as a given that the first byte in a character is
correct, and skipped looking at the rest of the bytes in the sequence.
This is wrong when just that first byte is garbled.  We have to look at
all bytes in the expected sequence to make sure it hasn't been
prematurely terminated from what we were led to expect by that first
byte.

Likewise when we get an overflow: we have to keep looking at each byte
in the sequence.  It may be that the initial byte was garbled, so that
it appeared that there was going to be overflow, but in reality, the
input was supposed to be a shorter sequence that doesn't overflow.  We
want to have an error on that shorter sequence, and advance the pointer
to just beyond it, which is the first position where a valid character
could start.

This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.

And, the old algorithm for finding overflow failed to detect it on some
inputs.  This was spotted by Hugo van der Sanden, who suggested the new
algorithm that this commit uses, and which should work in all instances.
For example, on a 32-bit machine, any string beginning with "\xFE" and
having the next byte be either "\x86" or \x87 overflows, but this was
missed by the old algorithm.

Another bug was that the code was careless about what happens when a
malformation occurs that the input flags allow. For example, a sequence
should not start with a continuation byte.  If that malformation is
allowed, the code pretended it is a start byte and extracts the "length"
of the sequence from it.  But pretending it is a start byte is not the
same thing as it actually being a start byte, and so there is no
extractable length in it, so the number that this code thought was
"length" was bogus.

Yet another bug fixed is that if only the warning subcategories of the
utf8 category were turned on, and not the entire utf8 category itself,
warnings were not raised that should have been.

And yet another change is that given malformed input with warnings
turned off, this function used to return whatever it had computed so
far, which is incomplete or erroneous garbage.  This commit changes to
return the REPLACEMENT CHARACTER instead.

Thanks to Hugo van der Sanden for reviewing and finding problems with an
earlier version of these commits
2012-04-26 11:58:57 -06:00
Karl Williamson
0b8d30e8ba utf8n_to_uvuni: Avoid reading outside of buffer
Prior to this patch, if the first byte of a UTF-8 sequence indicated
that the sequence occupied n bytes, but the input parameters indicated
that fewer were available, all n were attempted to be read
2012-04-26 11:58:57 -06:00
Karl Williamson
746afd533c utf8.c: Clarify and correct pod
Some of these were spotted by Hugo van der Sanden
2012-04-26 11:58:56 -06:00
Karl Williamson
99ee1dcd04 utf8.c: Use macros instead of if..else.. sequence
There are two existing macros that do the job that this longish sequence
does.  One, UTF8SKIP(), does an array lookup and is very likely to be in
the machine's cache as it is used ubiquitously when processing UTF-8.
The other is a simple test and shift.  These simplify the code and
should speed things up as well.
2012-04-26 11:58:56 -06:00
Karl Williamson
9d50113356 PATCH: [perl #112530] Panic with inversion lists
The code assumed that all property definitions would be well-formed,
meaning, in part, that they would be numerically sorted by code point,
with each range disjoint from all others.  So, the code was just
appending each range as it is found to the inversion list it is
building.

This assumption is true for all definitions generated by mktables, but
it might not be true for user-defined ones.  The solution is merely to
change from calling the function that appends to instead call the
existing function that handles the more general case.

However, that function was not previously used outside the file it was
defined in, so must now be made public.  Also, this whole interface is
considered volatile, so the names of the public functions in it begin
with an underscore to further discourage XS writers from using them.
Therefore the more general add function is renamed to begin with an
underscore.

And, the append function is no longer needed outside the file it is
defined in, so again to keep XS writers from using it, this commit makes
it static.
2012-04-23 11:01:02 -06:00
Karl Williamson
a099aed4f4 PATCH: [perl #111338] Warnings in utf8 subcategories do nothing in isolation
This was the result of assuming that these would not be on unless
the main category was also on.
2012-04-17 14:06:53 -06:00
Karl Williamson
39e518fd05 utf8.c: Add back inadvertently deleted pod text
This was deleted by mistake in commit
4b88fb76efce8c436e63b907c9842345d4fa77c7
2012-03-30 21:40:35 -06:00
Karl Williamson
6bd1c396b2 Use remove more uses of utf8_to_uvchr()
Commit 4b88fb76efce8c436e63b907c9842345d4fa77c7 missed 2 occurrences of
this, one of which is #ifdef'd out.
2012-03-30 21:40:35 -06:00
Karl Williamson
977c1d31ff Deprecate utf8_to_uvchr() and utf8_to_uvuni()
These functions can read beyond the end of their input strings if
presented with malformed UTF-8 input.  Perl core code has been converted
to use other functions instead of these.
2012-03-19 18:23:44 -06:00
Karl Williamson
4b88fb76ef Use the new utf8 to code point functions
These functions should be used in preference to the old ones which can
read beyond the end of the input string.
2012-03-19 18:23:44 -06:00
Karl Williamson
27d6c58a7e utf8.c: Add valid_utf8_to_uvuni() and valid_utf8_to_uvchr()
These functions are like utf8_to_uvuni() and utf8_to_uvchr(), but their
name implies that the input UTF-8 has been validated.

They are not currently documented, as it's best for XS writers to call
the functions that do validation.
2012-03-19 18:23:44 -06:00
Karl Williamson
ec5f19d099 utf8.c: Add utf8_to_uvchr_buf() and utf8_to_uvuni_buf()
The existing functions (utf8_to_uvchr and utf8_to_uvuni) have a
deficiency in that they could read beyond the end of the input string if
given malformed input.  This commit creates two new functions which
behave as the old ones did, but have an extra parameter each, which
gives the upper limit to the string, so no read beyond it is done.
2012-03-19 18:23:44 -06:00
Karl Williamson
d0460f306d utf8.c: pod clarification 2012-03-19 18:23:44 -06:00
Karl Williamson
a1433954f5 utf8.c: pod (mostly formatting) + comments changes 2012-03-19 18:23:44 -06:00
Karl Williamson
2e2b25717d perl #77654: quotemeta quotes non-ASCII consistently
As described in the pod changes in this commit, this changes quotemeta()
to consistenly quote non-ASCII characters when used under
unicode_strings.  The behavior is changed for these and UTF-8 encoded
strings to more closely align with Unicode's recommendations.

The end result is that we *could* at some future point start using other
characters as metacharacters than the 12 we do now.
2012-02-15 18:02:35 -07:00
Karl Williamson
f7d739d151 is_utf8_char_slow(): Make constistent, correct docs.
This function is only used by the Perl core for very large code points,
though it is designed to be able to be used for all code points.

For any variant code points, it doesn't succeed unless the passed in
length is exactly the same as the number of bytes the code point
occupies.  The documentation says it succeeds if the length is at least
that number.  This commit updates the documentation to match the
behavior.

Also, for an invariant code point, it succeeds no matter what the
passed-in length says.  This commit changes this to be consistent with
the behavior for all other code points.
2012-02-13 13:42:54 -07:00
Karl Williamson
768483871f Deprecate is_utf8_char()
This function assumes that there is enough space in the buffer to read
however many bytes are indicated by the first byte in the alleged UTF-8
encoded string.  This may not be true, and so it can read beyond the
buffer end.  is_utf8_char_buf() should be used instead.
2012-02-11 14:35:46 -07:00
Karl Williamson
492a624f4a Add is_utf8_char_buf()
This function is to replace is_utf8_char(), and requires an extra
parameter to ensure that it doesn't read beyond the end of the buffer.

Convert is_utf8_char() and the only place in the Perl core to use the
new one, assuming in each that there is enough space.

Thanks to Jarkko Hietaniemi for suggesting this function name
2012-02-11 14:35:46 -07:00
Karl Williamson
d11155ec2b Unicode::UCD::prop_invmap(): New improved API
Thanks to Tony Cook for suggesting this.

The API is changed from returning deltas of code points, to storing the
actual correct values, but requiring adjustments for the non-initial
elements in a range, as explained in the pod.

This makes the data less confusing to look at, and gets rid of
inconsistencies if we didn't make the same sort of deltas for entries
that were, e.g. arrays of code points.
2012-02-10 15:54:26 -07:00
Karl Williamson
ea317ccb31 regcomp.c: Use compiled-in inversion lists
This uses the compiled inversion lists to generate Posix character
classes and things like \v, \s inside bracketed character classes.

This paves the way for future optimizations, and fixes the bug which has
no formal bug number that /[[:ascii:]]/i matched non-Ascii characters,
such as the Kelvin sign, unlike /\p{ascii}/i.
2012-02-09 10:13:58 -07:00
Karl Williamson
a9d188b349 utf8.c: white-space only
This adds an indent now that the code is in a newly created block
2012-02-04 16:29:32 -07:00
Karl Williamson
f90a9a0230 utf8.c: Use the new compact case mapping tables
This changes the Perl core when looking up the
upper/lower/title/fold-case of a code point to use the newly created
more compact tables.  Currently the look-up is done by a linear search,
and the new tables are 54-61% of the size of the old ones, so that on
average searches are that much shorter
2012-02-04 16:29:32 -07:00
Karl Williamson
cdc18eb6b4 mktables: Add duplicate tables
This is for backwards compatibility.  Future commits will change these
tables that are generated by mktables to be more efficient.  But the
existence of them was advertised in v5.12 and v5.14, as something a Perl
program could use because the Perl core did not provide access to their
contents.  We can't change the format of those without some notice.

The solution adopted is to have two versions of the tables, one kept in
the original file name has the original format; and the other is free to
change formats at will.

This commit just creates copies of the original, with the same format.
Later commits will change the format to be more efficient.

We state in v5.16 that using these files is now deprecated, as the
information is now available through Unicode::UCD in a stable API.  But
we don't test for whether someone is opening and reading these files; so
the deprecation cycle should be somewhat long;  they will be unused, and
the only drawbacks to having them are some extra disk space and the time
spent in having to generate them at Perl build time.

This commit also changes the Perl core to use the original tables, so
that the new format can be gradually developed in a series of patches
without having to cut over the whole thing at once.
2012-02-04 16:29:29 -07:00
Nicholas Clark
5637ef5b34 Provide as much diagnostic information as possible in "panic: ..." messages.
The convention is that when the interpreter dies with an internal error, the
message starts "panic: ". Historically, many panic messages had been terse
fixed strings, which means that the out-of-range values that triggered the
panic are lost. Now we try to report these values, as such panics may not be
repeatable, and the original error message may be the only diagnostic we get
when we try to find the cause.

We can't report diagnostics when the panic message is generated by something
other than croak(), as we don't have *printf-style format strings. Don't
attempt to report values in panics related to *printf buffer overflows, as
attempting to format the values to strings may repeat or compound the
original error.
2012-01-16 23:04:12 +01:00
Karl Williamson
e0aa61c655 utf8.c: fix typo in pod 2012-01-13 09:58:38 -07:00
Karl Williamson
88d45d285b regcomp.c: Optimize a single Unicode property in a [character class]
All Unicode properties actually turn into bracketed character classes,
whether explicitly done or not.  A swash is generated for each property
in the class.  If that is the only thing not in the class's bitmap, it
specifies completely the non-bitmap behavior of the class, and can be
passed explicitly to regexec.c.  This avoids having to regenerate the
swash.  It also means that the same swash is used for multiple instances
of a property.  And that means the number of duplicated data structures
is greatly reduced.  This currently doesn't extend to cases where
multiple Unicode properties are used in the same class
[\p{greek}\p{latin}] will not share the same swash as another character
class with the same components.  This is because I don't know of a
an efficient method to determine if a new class being parsed has the
same components as one already generated.  I suppose some sort of
checksum could be generated, but that is for future consideration.
2012-01-13 09:58:36 -07:00
Karl Williamson
69794297b0 utf8.c: White-space only
As a result of previous commits adding and removing if() {} blocks,
indent and outdent and reflow comments and statements to not exceed 80
columns.
2012-01-13 09:58:36 -07:00
Karl Williamson
9a53f6cf43 utf8.c: Add ability to pass inversion list to _core_swash_init()
Add a new parameter to _core_swash_init() that is an inversion list to
add to the swash, along with a boolean to indicate if this inversion
list is derived from a user-defined property.  This capability will prove
useful in future commits
2012-01-13 09:58:35 -07:00
Karl Williamson
934970aa10 utf8.c: Add flag to swash_init() to not croak on error
This adds the capability, to be used in future commits, for swash_ini()
to return NULL instead of croaking if it can't find a property, so that
the caller can choose how to handle the situation.
2012-01-13 09:58:35 -07:00
Karl Williamson
fd05e0032c utf8.c: Prevent reading before buffer start
Make sure there is something before the character being read before
reading it.
2012-01-13 09:58:34 -07:00
Karl Williamson
36eb48b449 Utf8.c: Generate and use inversion lists for binary swashes
Prior to this patch, every time a code point was matched against a swash,
and the result was not previously known, a linear search through the
swash was performed.  This patch changes that to generate an inversion
list whenever a swash for a binary property is created.  A binary search
is then performed for missing values.

This change does not have much effect on the speed of Perl's regression
test suite, but the speed-up in worst-case scenarios is huge.  The
program at the end of this commit is crafted to avoid the caching that
hides much of the current inefficiencies.  At character classes of 100
isolated code points, the new method is about an order of magnitude
faster; two orders of magnitude at 1000 code points.  The program at the
end of this commit message took 97s to execute on my box using blead,
and 1.5 seconds using this new scheme.  I was surprised to see that even
with classes containing fewer than 10 code points, the binary search
trumped, by a little, the linear search

Even after this patch, under the current scheme, one can easily run out
of memory due to the permanent storing of results of swash lookups in
hashes.  The new search mechanism might be fast enough to enable the
elimination of that memory usage.  Instead, a simple cache in each
inversion list that stored its previous result could be created, and
that checked to see if it's still valid before starting the search,
under the assumption, which the current scheme also makes, that probes
will tend to be clustered together, as nearby code points are often in
the same script.
===============================================
 # This program creates longer and longer character class lists while
 # testing code points matches against them.  By adding or subtracting
 # 65 from the previous member, caching of results is eliminated (as of
 # this writing), so this essentially tests for how long it takes to
 # search through swashes to see if a code point matches or not.

use Benchmark ':hireswallclock';

my $string = "";
my $class_cp = 2**30;   # Divide the code space in half, approx.
my $string_cp = $class_cp;
my $iterations = 10000;
for my $j (1..2048) {

    # Append the next character to the [class]
    my $hex_class_cp = sprintf("%X", $class_cp);
    $string .= "\\x{$hex_class_cp}";
    $class_cp -= 65;

    next if $j % 100 != 0;  # Only test certain ones

    print "$j: lowest is [$hex_class_cp]: ";

    timethis(1, "no warnings qw(portable non_unicode);my \$i = $string_cp; for (0 .. $iterations) { chr(\$i) =~ /[$string]/; \$i+= 65 }");
    $string_cp += ($iterations + 1) * 65;
}
2012-01-13 09:58:34 -07:00
Karl Williamson
786861f559 utf8.c: Refactor code slightly in prep
Future commits will split up the necessary initialization into two
components.  This patch prepares for that without adding anything new.
2012-01-13 09:58:34 -07:00
Karl Williamson
c4a5db0c44 utf8.c: New function to retrieve non-copy of swash
Currently, swash_init returns a copy of the swash it finds.  The core
portions of the swash are read-only, and the non-read-only portions are
derived from them.  When the value for a code point is looked up, the
results for it and adjacent code points are stored in a new element,
so that the lookup never has to be performed again.  But since a copy is
returned, those results are stored only in the copy, and any other uses
of the same logical stash don't have access to them, so the lookups have
to be performed for each logical use.

Here's an example.  If you have 2 occurrences of /\p{Upper}/ in your
program, there are 2 different swashes created, both initialized
identically.  As you start matching against code points, say "A" =~
/\p{Upper}/, the swashes diverge, as the results for each match are
saved in the one applicable to that match.  If you match "A" in each
swash, it has to be looked up in each swash, and an (identical) element
will be saved for it in each swash.  This is wasteful of both time and
memory.

This patch renames the function and returns the original and not a copy,
thus eliminating the overhead for stashes accessed through the new
interface.  The old function name is serviced by a new function which
merely wraps the new name result with a copy, thus preserving the
interface for existing calls.

Thus, in the example above, there is only one swash, and matching "A"
against it results in only one new element, and so the second use will
find that, and not have to go out looking again.  In a program with lots
of regular expressions, the savings in time and memory can be quite
large.

The new name is restricted to use only in regcomp.c and utf8.c (unless
XS code cheats the preprocessor), where we will code so as to not
destroy the original's data.  Otherwise, a change to that would change
the definition of a Unicode property everywhere in the program.

Note that there are no current callers of the new interface; these will
be added in future commits.
2012-01-13 09:58:34 -07:00
Karl Williamson
b0e3252edb utf8.c: Change name of static function
This function has always confused me, as it doesn't return a swash, but
a swatch.
2012-01-13 09:58:33 -07:00
Karl Williamson
8ed25d5335 utf8.c: Move test out of loops
We set the upper limit of the loops before entering them to the min of
the two possible limits, thus avoiding a test each time through
2012-01-13 09:58:33 -07:00
Karl Williamson
dbe7a39153 Comment additions, typos, white-space.
And the reordering for clarity of one test
2012-01-13 09:58:32 -07:00
Father Chrysostomos
dcbac5bbcd diag_listed_as galore
In two instances, I actually modified to code to avoid %s for a
constant string, as it should be faster that way.
2011-12-28 22:58:52 -08:00