mirror of
https://https.git.savannah.gnu.org/git/grep.git
synced 2026-01-26 15:39:06 +00:00
pcre: use UCP in UTF mode
This fixes a serious bug affecting word-boundary and word-constituent regular expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. * THANKS.in: Add Gro-Tsen and Karl Petterson. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777 via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185 This bug was present from grep-2.5, when --perl-regexp (-P) support was added.
This commit is contained in:
parent
45e1158a4b
commit
5e3b760f65
6
NEWS
6
NEWS
@ -4,6 +4,12 @@ GNU grep NEWS -*- outline -*-
|
||||
|
||||
** Bug fixes
|
||||
|
||||
With -P, some non-ASCII UTF8 characters were not recognized as
|
||||
word-constituent due to our omission of the PCRE2_UCP flag. E.g.,
|
||||
given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
|
||||
this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
|
||||
After the fix, it prints the correct results: "rú:ú".
|
||||
|
||||
When given multiple patterns the last of which has a back-reference,
|
||||
grep no longer sometimes mistakenly matches lines in some cases.
|
||||
[Bug#36148#13 introduced in grep 3.4]
|
||||
|
||||
@ -35,6 +35,7 @@ Gerald Stoller gerald_stoller@hotmail.com
|
||||
Grant McDorman grant@isgtec.com
|
||||
Greg Boyd gboyd.ccsf@gmail.com
|
||||
Greg Louis glouis@dynamicro.on.ca
|
||||
Gro-Tsen https://twitter.com/gro_tsen
|
||||
Guglielmo 'bond' Bondioni g.bondioni@libero.it
|
||||
H. Merijn Brand h.m.brand@hccnet.nl
|
||||
Harald Hanche-Olsen hanche@math.ntnu.no
|
||||
@ -50,6 +51,7 @@ Joel N. Weber II devnull@gnu.org
|
||||
John Hughes john@nitelite.calvacom.fr
|
||||
Jorge Stolfi stolfi@dcc.unicamp.br
|
||||
Karl Heuer kwzh@gnu.org
|
||||
Karl Petterson karl.pettersson@klpn.se
|
||||
Kaveh R. Ghazi ghazi@caip.rutgers.edu
|
||||
Kazuro Furukawa furukawa@apricot.kek.jp
|
||||
Keith Bostic bostic@bsdi.com
|
||||
|
||||
@ -149,7 +149,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
|
||||
{
|
||||
if (! localeinfo.using_utf8)
|
||||
die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
|
||||
flags |= PCRE2_UTF;
|
||||
flags |= (PCRE2_UTF | PCRE2_UCP);
|
||||
#if 0
|
||||
/* Do not match individual code units but only UTF-8. */
|
||||
flags |= PCRE2_NEVER_BACKSLASH_C;
|
||||
|
||||
@ -147,6 +147,7 @@ TESTS = \
|
||||
pcre-jitstack \
|
||||
pcre-o \
|
||||
pcre-utf8 \
|
||||
pcre-utf8-w \
|
||||
pcre-w \
|
||||
pcre-wx-backref \
|
||||
pcre-z \
|
||||
|
||||
28
tests/pcre-utf8-w
Executable file
28
tests/pcre-utf8-w
Executable file
@ -0,0 +1,28 @@
|
||||
#!/bin/sh
|
||||
# Ensure non-ASCII UTF-8 characters are correctly identified as word-consituent
|
||||
#
|
||||
# Copyright (C) 2023 Free Software Foundation, Inc.
|
||||
#
|
||||
# Copying and distribution of this file, with or without modification,
|
||||
# are permitted in any medium without royalty provided the copyright
|
||||
# notice and this notice are preserved.
|
||||
|
||||
. "${srcdir=.}/init.sh"; path_prepend_ ../src
|
||||
require_en_utf8_locale_
|
||||
LC_ALL=en_US.UTF-8
|
||||
export LC_ALL
|
||||
require_pcre_
|
||||
|
||||
fail=0
|
||||
|
||||
echo 'Perú'> in || framework_failure_
|
||||
|
||||
echo 'ú' > exp || framework_failure_
|
||||
grep -Po '.\b' in > out || fail=1
|
||||
compare exp out || fail=1
|
||||
|
||||
echo 'rú' > exp || framework_failure_
|
||||
grep -Po 'r\w' in > out || fail=1
|
||||
compare exp out || fail=1
|
||||
|
||||
Exit $fail
|
||||
Loading…
x
Reference in New Issue
Block a user