Pandocmonium

Published on 2024-04-27 by TomatoSoup

I switched the blog over to using Pandoc instead of commonmark. That's why the bash snippets on other pages are now syntax highlighted!

It's why I can do something like this!

local function throw3d6()
    return math.random(1,6) + math.random(1,6) + math.random(1,6)
end
local distribution = {}
for i=1,1000000 do
    local roll = throw3d6()
    distribution[roll] = (distribution[roll] or 0) + 1
end
for k,v in pairs(distribution) do
    print(k,v)
end

(And yes, that does mean that there's going to be dice stuff coming up.)

I'm specifically using pypandoc, which is pretty much just python bindings around the actual pandoc binary. While this is easier than invoking a shell from python, it doesn't make it as easy as it should be. You'd think it would just be a drop in replacement for commonmark, right?

I ran into four issues.

By default it wrapped lines at like... 75 characters. It was consistently far enough under 80, and at times it wrapped at 70 when the next line started with a short word like "and", that I'm comfortable claiming that it was trying to wrap before 80 lines. Well, easy, I just change

text = pypandoc.convert_text(content, "html", "md")

text = pypandoc.convert_text(content, "html", "md", extra_args=['--wrap=none'])

It was replacing all of the quotes, em-dashes, ellipses, and so forth, with their smart variants. On its own that would have just been annoying, but it was replacing them in such a way that they displayed as the dreaded �.

Well okay. A quick google says that I just need to

text = pypandoc.convert_text(content, "html", "md", extra_args=['--wrap=none', '--smart'])

And for our trouble we get

--smart/-S has been removed.  Use +smart or -smart extension instead.
For example: pandoc -f markdown+smart -t markdown-smart.

Okay, fair enough. Pandoc has a lot of extensions and supporting a unique flag for every one of them, possibly per-target, does sound like a maintenance nightmare. We're going from markdown and we want to disable the smart quotes, so we do

text = pypandoc.convert_text(content, "html", "md", extra_args=['--wrap=none', '-f markdown-smart'])

Guess what?

RuntimeError: Pandoc died with exitcode "21" during conversion: Unknown input format  markdown

Yeah okay. Like hell you don't know what format "markdown" is. So what's going on? This is where the problems of just being a thin wrapper reveal themselves. See, every C program has a structure like this:

int main(int argc, char **argv) {
    return 0;
}

That probably compiles. It's been a while since I wrote C. Those two parameters are the number of args being passed in and the pointer-to-pointer of characters. For those not familiar with languages as unsafe as C, that means that **argv is an array of pointers argc long (be careful not to iterate too far!). What do they point to? A pointer to a character! What good is a single character? Not much, but these are C style strings, which means you get to start reading character by character until you hit a 0 character which means stop reading here. You can go further, but here be dragons and unrelated memory.

What is in charge of populating these structures? The shell. It takes your input, such as --wrap=none -f markdown-smart and dices them up so that your program can read them.

So how many args is that? First consider the shell. How does the shell know that -f takes a parameter there and isn't just a flag like in tar -xzvf?

It doesn't. It splits on whitespace and populates the array. That's 3 args.

So while ['--wrap=none', '-f markdown-smart'].join(" ") == "--wrap=none -f markdown-smart", "--wrap=none -f markdown-smart".split(" ") != ['--wrap=none', '-f markdown-smart'] and that's why the correct way to invoke -f markdown-smart is

text = pypandoc.convert_text(content, "html", "md", extra_args=['--wrap=none', '-f', 'markdown-smart'])

It's the case that those extra_args are simply copied into the program that the wrapper invokes.

This took me long enough to figure out that I wrote and later deleted this:

text = text.replace('“', '"').replace('”', '"')
text = text.replace('‘', "'").replace('’', "'")
text = text.replace("…", "...")
text = text.replace("–", "-").replace("—", "--")

Incidentally, if I told pandoc to output to a file itself, I got the actual characters. The problem was when python was writing the file.

The file was getting way too many extra newlines. Every newline was duplicated, even inside the pre-formatted codeblocks. Inspecting the output with xxd revealed that the output contained 0d0d0a, or Carriage Return Carriage Return Line Feed. \r\r\n. Why was pandoc doing this? There was a lot of flailing on figuring this one out.

Because it wasn't pypandoc! Commonmark was outputting single \n's for it's newlines, which is the Linux standard. Pandoc was outputting \r\n's which is the Windows standard (and I am writing this on Windows). PYTHON of all things was doing this! Everyone knows that when you open a file you open it in one of read, write, or append, and optionally in binary mode. But when you open a file in python, well, let me just quote the docs.

When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

So I set it to \n and fixed that.

When pandoc outputs it's hightlighted syntax it marks it up with css classes. Makes sense. But it doesn't, by default, export those. It took a little bit of arcane incantations to get it to drop it's template.

First I made a file that just contained $highlighting-css$

Then I made a markdown file that would have highlighting.

```html
<p>hello</p>
```

Then I invoked the words of power: pandoc --template=highlighting-css test.md -o highlighting.css

And got the file with all the highlighting css.

As part of this upgrade I resolved the other issues I mentioned: I can now have it decide not to publish a file and I can have the name of the published file be different from the name of the source file. The process for figuring out a better way to fix that was looking at the code and asking a few key points "Is this generic code actually generic or handling one specific edgecase?" and the converse "Can this hardcoded rule be changed into something more generic?"

Most development is about finding the points of irreducible complexity and shrinking everything else around them.