pcre: use UCP in UTF mode

This fixes a serious bug affecting word-boundary and word-constituent regular expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. * THANKS.in: Add Gro-Tsen and Karl Petterson. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777 via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185 This bug was present from grep-2.5, when --perl-regexp (-P) support was added.
2026-01-26 15:39:06 +00:00 · 2023-01-06 19:34:56 -08:00 · 2023-01-06 19:34:56 -08:00 · 5e3b760f65
commit 5e3b760f65
parent 45e1158a4b
5 changed files with 38 additions and 1 deletions
--- a/6
+++ b/6
@ -4,6 +4,12 @@ GNU grep NEWS                                    -*- outline -*-

 ** Bug fixes

+  With -P, some non-ASCII UTF8 characters were not recognized as
+  word-constituent due to our omission of the PCRE2_UCP flag. E.g.,
+  given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
+  this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
+  After the fix, it prints the correct results: "rú:ú".
+
  When given multiple patterns the last of which has a back-reference,
  grep no longer sometimes mistakenly matches lines in some cases.
  [Bug#36148#13 introduced in grep 3.4]
--- a/THANKS.in
+++ b/THANKS.in
@ -35,6 +35,7 @@ Gerald Stoller                      gerald_stoller@hotmail.com
 Grant McDorman                      grant@isgtec.com
 Greg Boyd                           gboyd.ccsf@gmail.com
 Greg Louis                          glouis@dynamicro.on.ca
+Gro-Tsen                            https://twitter.com/gro_tsen
 Guglielmo 'bond' Bondioni           g.bondioni@libero.it
 H. Merijn Brand                     h.m.brand@hccnet.nl
 Harald Hanche-Olsen                 hanche@math.ntnu.no
@ -50,6 +51,7 @@ Joel N. Weber II                    devnull@gnu.org
 John Hughes                         john@nitelite.calvacom.fr
 Jorge Stolfi                        stolfi@dcc.unicamp.br
 Karl Heuer                          kwzh@gnu.org
+Karl Petterson                      karl.pettersson@klpn.se
 Kaveh R. Ghazi                      ghazi@caip.rutgers.edu
 Kazuro Furukawa                     furukawa@apricot.kek.jp
 Keith Bostic                        bostic@bsdi.com
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@ -149,7 +149,7 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
    {
      if (! localeinfo.using_utf8)
        die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
-      flags |= PCRE2_UTF;
+      flags |= (PCRE2_UTF | PCRE2_UCP);
 #if 0
      /* Do not match individual code units but only UTF-8.  */
      flags |= PCRE2_NEVER_BACKSLASH_C;
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@ -147,6 +147,7 @@ TESTS =						\
  pcre-jitstack					\
  pcre-o					\
  pcre-utf8					\
+  pcre-utf8-w					\
  pcre-w					\
  pcre-wx-backref				\
  pcre-z					\
--- a/tests/pcre-utf8-w
+++ b/tests/pcre-utf8-w
@ -0,0 +1,28 @@
+#!/bin/sh
+# Ensure non-ASCII UTF-8 characters are correctly identified as word-consituent
+#
+# Copyright (C) 2023 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+LC_ALL=en_US.UTF-8
+export LC_ALL
+require_pcre_
+
+fail=0
+
+echo 'Perú'> in || framework_failure_
+
+echo 'ú' > exp || framework_failure_
+grep -Po '.\b' in > out || fail=1
+compare exp out || fail=1
+
+echo 'rú' > exp || framework_failure_
+grep -Po 'r\w' in > out || fail=1
+compare exp out || fail=1
+
+Exit $fail