🐛 Go Down the Rabbit Hole

On the em-dash, static site generators, Go and why you should really RTFM.

2019-02-13

#go #markdown #static-site-generator

I like a good em-dash. It's delightfully versatile, easily taking the place of other, lesser punctuation marks.

However, while writing my last blog post I ran into a problem. Originally written in Markdown, my beloved em-dash was causing all sorts of problems. Specifically, this paragraph:

Well that's a lot of results, 511 in fact (at time of writing). Of course, that doesn't mean that our password is actually in that list—*because, of course, no one would ever use it*—but we can double-check by seeing if the suffix from our SHA1 is in the list:

That little aside—not only parenthetical but italicised also (I was having one of those days, clearly)—wasn't displaying as intended, failing to lean to the right and instead keeping those silly asterisks (they're not even punctuation!). Why? Down the proverbial rabbit-hole we go…

Blog

In no way prompted by the incessant harassment gentle coaxing of our CEO, I've been trying to maintain a blog for a while now (and when I say "maintain", I largely mean "start"). To that end, I've installed, uninstalled and reinstalled all manner of static site generators in a handful of languages: Pelican (Python), Nikola (also Python), Cobalt (Rust)…before finally settling on Gutenberg (Rust). Then it went and changed its name to Zola (still Rust).

So, having finally settled on the right tool, I promptly thought "this seems like an opportune moment to learn some Go" and opted for Hugo (Go).

Having only dabbled so far, and finding that Hugo had seemingly opted for Markdown and not the infinitely superior reStructureText, I ran into the aforementioned problem.

Shall we check the documentation and see if this is known issue? "No!" I said to myself. "Let's learn some Go and try and get to the bottom of this", I decided.

Go

Currently the programming language of choice practically everywhere (definitely in some dark corners of our office), Go needs little introduction. It's snappy, it's relatively easy to read and it has had an amusing logo.

Hugo

Getting and building Hugo's source is easy enough:

git clone https://github.com/gohugoio/hugo.git
cd hugo
go install

Wait, what happened? No output. Did that work? Apparently that should have built a binary and put it in my path…

$ hugo
Total in 0 ms
Error: Error building site: open /home/roger/content: no such file or directory

Crikey, it worked. I'm not enjoying the lack of output but so far so good (that error coming from the fact I'm running this in Hugo's repository, not a site built by Hugo).

Blackfriday

Perusing the source code a little, it looks like Hugo is using a Markdown library called Blackfriday. Maybe that's the source of our problem? Let's write some Go!

package main

import (
    "fmt"
    "github.com/russross/blackfriday"
)

func main() {
    output := blackfriday.MarkdownBasic([]byte("\n\nThis is a sentence—*here's an italicised aside*—and here's the end.\n\n"))
    fmt.Println(string(output))
}

So with some furious Googling, we go build and run the resulting binary and we get…

<p>This is a sentence—<em>here's an italicised aside</em>—and here's the end.</p>

Damn; it worked (note the correct <em> tags around our aside). Maybe it's a version thing? This would seem an opportune moment to investigate how Go handles its dependencies.

Oh, but I really wish that weren't true…

Go Dependencies

Dependency management in Go appears to be in something of an "interesting" state at the time of writing. If you start looking at the documentation, you learn about dep:

dep was the "official experiment." The Go toolchain, as of 1.11, has (experimentally) adopted an approach that sharply diverges from dep.

So the "official experiment" got canned and there's an "experimental" replacement?

How is Hugo doing this?

Well there's a go.mod file and a go.sum file. Which apparently aren't Go files, they're Go module files. And what happens if we try and initialise our earlier code as a Go module?

go mod init test

Interesting: now I have a go.mod file. What happens when I do a go build again?

$ go build
# test
./test.go:9:15: undefined: blackfriday.MarkdownBasic

That doesn't look right. The contents of our new go.mod appear to hold some clues:

$ cat go.mod
module test

require (
    github.com/russross/blackfriday v2.0.0+incompatible
    github.com/shurcooL/sanitized_anchor_name v1.0.0 // indirect
)

It would appear that Go has helpfully installed the latest version of Blackfriday. What's Hugo running?

$ grep blackfriday go.mod
    github.com/russross/blackfriday v0.0.0-20180804101149-46c73eb196ba

Aha! So it's an older version; that must be the cause of our problem, right? Let's replace the version in our go.mod file with the above one from Hugo, re-run go build and run our binary!

$ ./test
<p>This is a sentence—<em>here's an italicised aside</em>—and here's the end.</p>

Oh fudge.

Debugging

For the purposes of this experiment, I'm running Jetbrains' GoLand. Thankfully their tooling caters to the likes of myself and their debug settings kindly inform me that I need to run delve, apparently.

GoLand Run/Debug Configuration

Following those useful instructions, we recompile Hugo with the information provided:

go install -gcflags "all=-N -l"

Then start Delve:

$ dlv --listen=:2345 --headless=true --api-version=2 exec /home/roger/go/bin/hugo -- server -D
API server listening at: [::]:2345

At this point we should cut to a kinetic montage—complete with inspirational power chords—while I haplessly click around, adding and removing breakpoints, trying to find a suitable point of ingress into Blackfriday's code.

Eventually I find myself in the emphasis() function, in inline.go. Seemingly, when inline() passes the text to emphasis() it's truncated to:

*because, of course, no one would ever use it*—but we can double-check by seeing if the suffix from our SHA1 is in the list:

Okay, fair enough: this is the first point in this block of text in which we encounter markup so we're processing the remaining text on that line.

A few more clicks and we're down into helperEmphasis().

Now helperFindEmphChar().

It's looking promising as this function correctly identifies the position of the next * as index 44. So the code is correctly identifying the start and end positions of the emphasis but still failing to italicise the text…?

We continue a little further, to find this block:

if p.flags&EXTENSION_NO_INTRA_EMPHASIS != 0 {  
   if !(i+1 == len(data) || isspace(data[i+1]) || ispunct(data[i+1])) {  
      continue  
  }  
}

At this point it slowly dawns on me: data here is an array of bytes; both the isspace() and ispunct() functions specifically compare a byte with a list of known, relevant characters. The latter, for instance, uses the static list of !"#$%&'()\*+,-./:;<=>?@[\\]^_`{|}~". The problem is that the em-dash is three bytes (226, 128 and 148 expressed as decimals if you're interested); this is never going to match.

That explains why Blackfriday is behaving this way—we're trying to make three bytes match one—but not why we're failing to see this error during the test code we wrote.

So what's different? If we wander back up the call stack to find the point at which Hugo is calling the Blackfriday library (at helpers/content.go:347) we find:

blackfriday.Markdown(ctx.Content,
    c.getHTMLRenderer(blackfriday.HTML_TOC, ctx),
    getMarkdownExtensions(ctx))

However, in my earlier example I was, as per the documentation, calling blackfriday.MarkdownBasic(). Great! So what's the difference in those calls?

In fact, if we try blackfriday.MarkdownCommon(), we can see the problem:

package main

import (
    "fmt"
    "github.com/russross/blackfriday"
)

func main() {
    output := blackfriday.MarkdownCommon([]byte("\n\nThis is a sentence—*with an italicised aside*—before carrying on.\n\n"))
    fmt.Println(string(output))
}

This gives us:

<p>This is a sentence—*here's an italicised aside*—and here's the end.</p>

So the problem is something common to blackfriday.MarkdownCommon() and Hugo's blackfriday.Markdown() call. The MarkdownCommon() function isn't too long:

func MarkdownCommon(input []byte) []byte {
    // set up the HTML renderer
    renderer := HtmlRenderer(commonHtmlFlags, "", "")
    return MarkdownOptions(input, renderer, Options{
        Extensions: commonExtensions})
}

Taking one look at the contents of the commonExtensions variable and, ladies and gentlemen, we have a winner: EXTENSION_NO_INTRA_EMPHASIS. It might not be immediately obvious but, according to the documentation:

Intra-word emphasis supression. The _ character is commonly used inside words when discussing code, so having markdown interpret it as an emphasis command is usually the wrong thing. Blackfriday lets you treat all emphasis markers as normal characters when they occur inside a word.

As Blackfriday isn't recognising the em-dash as a punctuation character, it's considering it part of the same word and, thanks to the above extension being active, not allowing emphasis in the middle of a word.

Surely enough, once we disable that extension in our Hugo config.toml file, everything renders as intended:

[blackfriday]
  extensionsmask = ["noIntraEmphasis"]

RTFM

It looks like this has already been flagged as an issue in GitHub.

Several times.

In fact, it's in the README.

Perhaps some more reading before beginning this little exercise might have been beneficial but wasn't it a learning experience? And isn't that what really matters?

Back to Daylight

The aforementioned fix does work in the short term but we do run the risk of the very problem the extension was intended to remedy. A longer-term fix, however, is likely to involve a fairly extensive rewrite of parts of Blackfriday.

There have been some great articles written on how Go handles, strings, bytes and runes. It's likely in this latter type that we'd find a solution—largely analogous to characters in ASCII but recognising multi-byte characters for Unicode.

Making some crude changes to the functions we found earlier (ispunct(), for instance) such that they accept rune (or []rune) data types does seem to fix the issue (I won't include the details here; it involves a lot of me swearing at Go's module management and how it handles local repositories) but there's likely a bigger picture to consider with the rest of the codebase.

So yeah. Go.

I'd better start blogging.