Regex [a-z] Do Not Recognize Local Characters
Solution 1:
The problem is that Ş
is not in the range [A-Z]
. That range is the class of all characters whose codepoints lie U+0040 and U+005A (inclusive). (If you were using bytes-mode, it would be all bytes between 0x40 and 0x5A.) And Ş
is U+0153 (or, e.g., 0xAA in bytes, assuming latin2). Which isn't in that range.
And using a locale won't change that. As re.LOCALE
explains, all it does is:
Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale.
Also, you almost never want to use re.LOCALE
. As the docs say:
The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales.
If you only care about a single script, you can build a class of the appropriate ranges for that script.
If you want to work with all scripts, you need to build a class out of a Unicode character class like Lu
for "all uppercase letters". Unfortunately, Python's re
doesn't have a mechanism for doing this directly. You can build a giant class out of the information in unicodedata
, but that's pretty annoying:
Lu = '[' + ''.join(chr(c) for c inrange(0, 0x10ffff)
if unicodedata.category(chr(c)) == 'Lu') + ']'
And then:
pattern = re.compile(r"([\w]{1})()(" + Lu + r"{1})", re.U)
… or maybe:
pattern = re.compile(rf"([\w]{{1}})()({Lu}{{1}})", re.U)
But the good news is that part of the reason re
doesn't have any way to specify Unicode classes is that for a long time, the plan was to replace re
with a new module, so many suggested new features for re
were rejected. But the good news is that the intended new module is available as a third-party library, regex
. It works just fine, and is a near drop-in replacement for re
; it was just improving too quickly to lock it down to the slower Python release schedule. If you install it, then you can write your code this way:
import regex
corp = "minikŞeker bir kedi"
pattern = regex.compile(r"([\w]{1})()(\p{Lu}{1})", re.U)
corp = regex.sub(pattern, r"\1 \3", corp)
print(corp)
The only change I made was to replace re
with regex
, and then use \p{Lu}
instead of [A-Z]
.
There are, of course, lots of other regex engines out there, and many of them also support Unicode character classes. Most of those that do follow some variation on the same \p
syntax. (They all copied it from Perl, but the details differ—e.g., regex
's idea of Unicode classes comes from the unicodedata
module, while PCRE
and PCRE2
attempt to be as close to Perl as possible, and so on.)
Solution 2:
abarnet's answer is great, but if all you want to do is find upper case characters, str.isupper()
works without the need for an extra module.
>>>foo = "minikŞeker bir kedi">>>for i, c inenumerate(foo):...if c.isupper():...print(foo[i-1:i+2])...break...
kŞe
or perhaps
>>>foo = "minikŞeker bir kedi">>>''.join((' 'if c.isupper() else'') + c for c in foo)
'minik Şeker bir kedi'
Post a Comment for "Regex [a-z] Do Not Recognize Local Characters"