Writing binary by hand

2022-09-24 | Martin Hoppenheit | 27 min read

This post is based on a talk I gave at iPres 2022 (slides). It explains how to read a file format specification (in this example, TIFF) and based on that build a minimal binary file by hand.

The file we are going to build will look rather boring, but that’s not the point because we are not so much interested in the image content but in the file structure. (Note that the image shown here is not a TIFF but a PNG file because that’s easier to display in most web browsers. The TIFF file is here.)

Why?

First off, why would anyone want to build binary files by hand? For me, the most important reason is learning about file formats. When I really want to know how a file format is structured, where and in which form it stores its payload and metadata, in which parts it is robust or fragile, where it has ambiguous edge cases, … then reading the file format specification alone is not enough. I want to build something from scratch then, play around, try out different variants, intentionally break (and fix) things. Here is a bunch of example files I built for that purpose. Apart from that, creating binary files by hand can be useful for all kinds of documentation like describing issues (example, another example), or to create test data for file format-related tools like JHOVE (example).

How?

The most general tool when working with binary files is a hex editor which displays binary data as a sequence of hex characters. (Look here if you need a primer or refreshment on binary/hex notation.) A hex editor is quite versatile and can be used to view, edit, and write files on the binary level. You could absolutely use one to follow along this tutorial. However, since it looks like this …

00000000: 4d4d 002a 0000 0008 0008 0100 0003 0000  MM.*............
00000010: 0001 0200 0000 0101 0003 0000 0001 0140  ...............@
00000020: 0000 0106 0003 0000 0001 0000 0000 0111  ................
00000030: 0003 0000 0003 0000 006e 0116 0003 0000  .........n......
00000040: 0001 0080 0000 0117 0003 0000 0003 0000  ................
00000050: 0074 011a 0005 0000 0001 0000 007a 011b  .t...........z..
00000060: 0005 0000 0001 0000 0082 0000 0000 008a  ................
00000070: 208a 408a 2000 2000 1000 0000 012c 0000   .@. . ......,..
00000080: 0001 0000 012c 0000 0001 ffff ffff ffff  .....,..........
00000090: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000a0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000b0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000c0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000d0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000e0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000000f0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000100: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000110: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000120: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000130: ffff ffff ffff ffff ffff ffff ffff ffff  ................

… I would like to propose a different approach: something I call Literate Binary. Literate Binary (a shameless rip-off of Donald Knuth’s Literate Programming) is a combination of Markdown and hex code. The hex code is (mostly) just like the hex code you see in a hex editor, but embedded in human-readable structure and plain text documentation expressed in Markdown, a lightweight markup syntax (think HTML without all those tags). It looks like this:

## Image File Header

Some text explaining what the file header contains.

4d4d 002a 00000008

Here you see three Markdown elements, which is all you need for this tutorial:

1. A (second-level) headline, introduced by ## signs.
2. A regular paragraph, delimited by blank lines.
3. A so-called code block, indented by four spaces. It contains a sequence of hex characters that represent binary content. Restricting the code block to hex syntax is what makes a Markdown document Literate Binary.

Markdown is widely supported (e.g., it renders nicely on GitHub) and there are many tools (e.g., Pandoc) that convert from Markdown to other document formats like PDF, HTML or Microsoft Word. Further more (and that’s the interesting part), a Literate Binary document can easily be turned into a proper binary file with a tool called lb that takes the hex code from all code blocks in a Markdown file and turns it into a binary. This is how it is used on the command line:

What’s next?

So you made it to the end of this post, congratulations! If you would like to play around some more on your own, here are some ideas for further tinkering:

• Switch PhotometricInterpretation to 1 (BlackIsZero) and see what happens.
• Read sections 4–6 of the TIFF spec to learn about grayscale and color images, building on the bilevel file we just created (it’s easy!).
• Switch the byte order to little endian (rather boring, but maybe you learn something).
• Find out about pad bytes and word boundaries (search the spec!).
• Move things around in the file and adapt pointers.
• Insert garbage (or hide data) between the regular file components.
• Try to violate the specification and raise validation errors with JHOVE or other tools, then fix them. Examples:
• wrong order of IFD entries (see page 15)
• duplicate pointers (see page 26, example, another example)
• everything in Section 7: Additional Baseline TIFF Requirements (page 26)
• Add example files or other information to the JHOVE wiki.
• Read the specification of a different file format and write Literate Binary based on that.

Whatever you do, if you create something interesting (or funny) please consider sending me a pull request on GitHub to extend my collection of Literate Binary examples files!