💥 Regular Expressions
Today-I-learned-and-wish-I-hadn't
2022-01-21
One statement, no questions,
Put your fist up for your regular expressions.
—Dual Core, Regular Expressions
Regular expressions are powerful.
Regular expressions are fun.
Right up until they are not. Today was one such day.
At some point, anyone who's had to work with regular expressions has needed to match a specific range of characters. Let's say—for no particular reason—alpha characters. A to Z. Or a to z. The basics, none of those fancy Unicode ones, thankyouverymuch.
$ grep --perl-regexp 'i' <<< "testing"
testing
$ grep --perl-regexp 'a' <<< "testing"
To match all characters, you'd need to include them all somehow. You could, of course, go all-in:
$ grep --perl-regexp '[abcdefghijklmnopqrstuvwxyz]' <<< "testing"
testing
…which would work, but it's not very concise, is it? You could, of course, use a character class and match any word character:
$ grep --perl-regexp '\w' <<< "testing"
testing
…which would, again, work but you could accidentally match a non-ASCII and you wouldn't want that, would you?
Of course, you could also use one of the convenient character ranges:
$ grep --perl-regexp '[a-z]' <<< "testing"
testing
Wonderful! But that won't match upper-case characters, will it?
grep --perl-regexp '^[a-z]+$' <<< "Testing"
Hmmmm…no. Using the upper-case variant would leave us in the same pickle, no?
grep --perl-regexp '^[A-Z]+$' <<< "Testing"
Curses! But, of course, there's another option: a combination of the two! How convenient!
grep --perl-regexp '^[A-z]+$' <<< "Testing"
Testing
It worked perfectly! Matching only alpha. characters and both casing variants! All is well, right?
Right?!
No
grep --perl-regexp '^[A-z]+$' <<< "Test[\]^_`ing"
Test[\]^_`ing
Errrr…what? A-z
. That's fairly clear, right?
No. No, it is not. Because A-z
is not, in fact, shorthand for "the range of
characters from a to a, upper- and lower-case." It is, in fact, the
shorthand for "the range of ASCII characters from A
to z
".
$ python3 -c 'print("".join(chr(c) for c in range(ord("A"), ord("z") + 1)))'
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
Oh dear. Oh dear, oh dear.
Quite what was going on at the American Standards Association between Z
and
a
, I'd hate to guess. Presumably their cat walked across the keyboard while
they were celebrating the addition of the capital letters.
But yes, there's a whole bunch of punctuation (^
and ` are technically
symbols, not punctuation; fun fact) smack in the middle of those character
ranges which will get matched should you ever opt to use A-z
in a regular
expression.
Which no one would, right?
Right?!