🗃️ Wordle on Sunday

Wordle seems different…but do the data back it up?

2022-03-10

#web-archiving #wordle

Wordle is everywhere.

Wordle is everything.

Soon all will be Wordle.

Or, at least, it's beginning to feel that way around here. Whether it's the midnight die-hards, the early risers or—like me—the filthy casuals, nary an hour goes past without a post to our dedicated Teams channel…

"I think the NYT are using fancier words."—Anna
"Oh, that feels like a fun Sunday activity…to the Internet Archive!"—me.

Posted early one recent Sunday morning, the above seems to be a popular sentiment (the "fancier" bit, not that this is fun), that things are "different" since the New York Times purchased it in January 2022. "Pretentious" is a word that's been bandied around more than a little. But do the data back up such a sentiment? Have the New York Times actually changed the word list? If only we had a time machine…

To the time machine

The Internet Archive have been archiving web pages (and associated resources) for over a quarter of a century now and thankfully, that includes Wordle, both the original and at its new home. So we have our time machine: we can view Wordle's word list at various points in its history and compare them, both along its journey and at its currently location.

Both the previous and current version of Wordle have the entire word-list contained in a JavaScript file downloaded when viewing the page:

…where X was some hex value which would occasionally change.

The Internet Archive provides an API which allows you to query its CDX1 index. Its usage is better documented elsewhere but for our intentions, the versions of all previous Wordle word-lists can be quered via the URL: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.powerlanguage.co.uk%2Fwordle%2Fmain.&matchType=prefix&limit=1000.

from urllib.parse import quote_plus


wordle_prefix = "https://www.powerlanguage.co.uk/wordle/main."
r = requests.get(
    f"http://web.archive.org/cdx/search/cdx?url={quote_plus(wordle_prefix)}&matchType=prefix&limit=1000",
)
print(len(r.text.splitlines()))

That gives us (at time of writing) 396 archived instances of the Wordle JavaScript. 396 is perhaps a bit much2.

The fourth field in each result shows the type of each. Most are, as expected, application/javascript, indicating that they are indeed JavaScript files. However, there are several of the type warc/revisit3 also: these are simply metadata records indicating that the discovered version was unchanged since the previous visit so no new version was archived. So…

results = list(filter(
    lambda line: line[3] == "application/javascript",
    (line.split() for line in r.text.splitlines()),
))
print(len(results))

That still gives us 206. Okay, I'm regretting this now.

The sixth field in the results is the hash of the content. If two files share the same hash, they can be (with reasonable certainty) said to be the same file. So…

unique_hashes = set(line[5] for line in results)
unique_results = []
for result in results:
    if result[5] in unique_hashes:
        unique_results.append(result)
        unique_hashes.remove(result[5])
print(len(unique_results))

That leaves us only 19 results. Okay, that's much more manageable. Need to get the word list out of each of those archived JavaScript files. They're declared as a JavaScript array but still, they're still—for our purposes at least—a blob of text.

Time for some regular expressions!

import re


def get_js_words(text: str) -> typing.List[str]:
    for line in text.splitlines():
        if match := re.match('^.+var [A-Z][a-z]=(\[[^\]]+\]).+$', line):
            (words,) = match.groups()
            words = eval(words)
            return words


wordle_words_list = []
for result in unique_results:
    r = requests.get(f"https://web.archive.org/web/{result[1]}/{result[2]}")
    # Some actually return a 404. Great.
    if r.status_code != 200:
        wordle_words_list.append([])
        continue
    wordle_words_list.append(get_js_words(r.text))

The wordle_words_list variable now contains the 194 different (or not) word lists from each archived version.

We can easily see the differences between each version of the word-list by using Python's set() type:

previous_result = results[0]
previous_words = wordle_words_list[0]
for result, words in zip(unique_results[1:], wordle_words_list[1:]):
    # One result is missing so…
    if not words:
        continue
    added = set(words) - set(previous_words)
    removed = set(previous_words) - set(words)
    if added or removed:
        print(result[1])
    if added:
        print(f"Added: {', '.join(added)}")
    if removed:
        print(f"Removed: {', '.join(removed)}")
    previous_result = result
    previous_words = words

That actually gives us:

20211015045445
Added: based, bowed, gored, lough…  # lots more words
Removed: pixel, wight, inbox, wryly…  # lots more words
20211214013727
Added: pixel, wight, inbox, wryly…  # lots more words
Removed: based, bowed, gored, lough…  # lots more words

Although omitted for brevity, those are same list of words being added, then removed, then added, etc. It would appear that in its entire lifetime the word list would appear to have only changed twice but seemingly that was a change and a reversion.

Interesting.

How about the much-discussed changes during the New York Times' ownership? For that, the live site will present the most current version:

last_wordle_words = wordle_words_list[-1]

r = requests.get("https://www.nytimes.com/games/wordle/main.4d41d2be.js")
nyt_words = get_js_words(r.text)

added = set(nyt_words) - set(last_wordle_words)
removed = set(last_wordle_words) - set(nyt_words)

print(f"NYT added: {', '.join(added)}")
print(f"NYT removed: {', '.join(removed)}")

Interestingly, that shows:

NYT added: 
NYT removed: agora, pupal, fibre, slave, wench, lynch

Nope, no new words and a handful removed.


Footnotes

1

the CDX format is somewhat esoteric but details are here. It's basically a text file containing details of web resources.

2

I made an assumption that the CDX results would be sorted—they're not. I'll leave resolving that as an exercise for the reader but it didn't change the outcome: the results seems to flip-flop a few more times over Wordle's lifetime but the final result is the same.

3

"WARC" is an ISO standard, effectively a series of concatenated records either containing or pertaining to web resources. warc/revisit is a form of the latter indicating that a web resource has been seen again, unchanged.

4

actually slightly fewer: some of the responses were non-200s and were omitted.