💥 Regular Expressions

Today-I-learned-and-wish-I-hadn't

2022-01-21

#regular-expressions #gotcha

One statement, no questions,
Put your fist up for your regular expressions.
—Dual Core, Regular Expressions

Regular expressions are powerful.

Regular expressions are fun.

xkcd: Regular Expressions

Right up until they are not. Today was one such day.

At some point, anyone who's had to work with regular expressions has needed to match a specific range of characters. Let's say—for no particular reason—alpha characters. A to Z. Or a to z. The basics, none of those fancy Unicode ones, thankyouverymuch.

$ grep --perl-regexp 'i' <<< "testing"
testing

$ grep --perl-regexp 'a' <<< "testing"

To match all characters, you'd need to include them all somehow. You could, of course, go all-in:

$ grep --perl-regexp '[abcdefghijklmnopqrstuvwxyz]' <<< "testing"
testing

…which would work, but it's not very concise, is it? You could, of course, use a character class and match any word character:

$ grep --perl-regexp '\w' <<< "testing"
testing

…which would, again, work but you could accidentally match a non-ASCII and you wouldn't want that, would you?

Of course, you could also use one of the convenient character ranges:

$ grep --perl-regexp '[a-z]' <<< "testing"
testing

Wonderful! But that won't match upper-case characters, will it?

grep --perl-regexp '^[a-z]+$' <<< "Testing"

Hmmmm…no. Using the upper-case variant would leave us in the same pickle, no?

grep --perl-regexp '^[A-Z]+$' <<< "Testing"

Curses! But, of course, there's another option: a combination of the two! How convenient!

grep --perl-regexp '^[A-z]+$' <<< "Testing"
Testing

It worked perfectly! Matching only alpha. characters and both casing variants! All is well, right?

Right?!

No

grep --perl-regexp '^[A-z]+$' <<< "Test[\]^_`ing"
Test[\]^_`ing

Errrr…what? A-z. That's fairly clear, right?

No. No, it is not. Because A-z is not, in fact, shorthand for "the range of characters from a to a, upper- and lower-case." It is, in fact, the shorthand for "the range of ASCII characters from A to z".

$ python3 -c 'print("".join(chr(c) for c in range(ord("A"), ord("z") + 1)))'
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy

Oh dear. Oh dear, oh dear.

Quite what was going on at the American Standards Association between Z and a, I'd hate to guess. Presumably their cat walked across the keyboard while they were celebrating the addition of the capital letters.

But yes, there's a whole bunch of punctuation (^ and ` are technically symbols, not punctuation; fun fact) smack in the middle of those character ranges which will get matched should you ever opt to use A-z in a regular expression.

Which no one would, right?

Right?!