What's in a PDF

11 March 2024

If you’ve been a computer person long enough, then you’ll eventually come to terms with the fact that all the files on a computer are made up of the same ingredients: lists of bytes. Whether you save a text file or a movie or an executable on your PC, the circuitry in the storage drive has no idea what it is. Even the operating system is just guessing based on context. If you give a file the correct extension, and it hasn’t been corrupted, the operating system knows what to do with it. Jim Nielsen has a good write up about the reusability of files on his blog.

Technically you could create every file you’ve ever used by writing out raw hexidecimals and using the bash xxd command to convert it into binary. If you’re on windows like me, you can use git bash. Let’s make an ASCII text file by writing some hexidecimal by smashing this into a terminal:

    echo '4865 6c6c 6f20 776f 726c 6421' | xxd -r -p > test.txt

If you open the new test.txt file with a text editor, you should see a text file saying “hello world!”. You can use this chart to write out any ASCII letter you want. Congratulations, you now have an extremely tedious way to create a text file. UTF-8 is a bit more complicated as it has almost every written language and every emoji.

However you can print a skull emoji like this:

     echo 'f0 9f 92 80' | xxd -r -p > test10.txt

These are ways to create text files. But binary files are more of a mystery. They are files whose secrets are known by the programs that read them, spec-writers who define them, and those who are willing to go through the pain-staking work of reverse-engineering them.

We’re going to take a look at a type of binary file: the pdf. Parts of the pdf format are actually understandable as ASCII text, and we’re going to write a PDF from scratch to see what's in there. Beware though, writing a pdf from scratch is a tough task and certainly isn't like writing a programming language designed to be easy to edit with a text editor.

Let’s take a look at the sections of a PDF:

Sections of a PDF #

There are always 4 sections to every pdf: first the header, then the body, then the cross reference table and finally the footer. We’re going to go in a weird order: The body first, then the header, then the cross reference table, and finally the footer. This is because the body is where the most interesting and important stuff is.

Body #

The body of a PDF holds the parts that will actually be visible in the PDF, as opposed to boilerplate and metadata. The PDF body holds the text, pages, images, fonts, annotations and form controls. There’s a whole smorgasbord of other stuff you can include too.

Different elements are structured as a hierarchical tree of “PDF objects”. PDF objects are written in a syntax like this:

    1 0 obj
    <<
      /Type /Catalog
      /Pages 2 0 R
    >>
    endobj

Let’s have a look at what we’ve written: First, we have a couple of numbers. The first number is a unique identifier for the object (in this case 1), the second is the version of the object (in this case 0). The version number is used for change tracking. We’re going to leave the version as 0 for all objects in this article. After these numbers we have obj. This tells the PDF reader we’re about to write a PDF object definition.

The next section is wrapped in a pair of less than << and greater than >> signs. This denotes that we are about to write a dictionary of key/value pairs. This is a bit like a json object or a C# dictionary or a java hashmap. The order does not matter here, just keys and values.

In the above example we have two keys: Type and Pages. In this case the value of type is Catalog. This is the type we use for the root object of a PDF document, as explained in the PDF standard. The value of the pages key is a reference to another object. It points to another object just like the one we’ve written. The two numbers after the /Pages key, 2 and 0 will match the unique identifier and version respectively. Then a capital R confirms that this is referencing another object.

Does any of the whitespace matter? The newlines and spaces? No. They will make a difference to the cross reference table later but that’s getting ahead of ourselves.

We finally close off our dictionary with two greater-than signs then complete our object with endobj.

Pages
We’ve actually written a great start to a pdf document already. But that /Pages key is referencing a PDF object that doesn’t exist. So now let’s write a “pages” object for it to point to:

    2 0 obj
    <<
     /Type /Pages
     /Count 1
     /Kids [
      3 0 R
     ]
    >>
    endobj

Again, we start with the unique identifier, version and obj keyword. The identifier and version have to match the values we wrote in the pages key in the catalog object.

Just like the catalog object, we say it’s of /Type /Pages. Then we have a new concept: an array! A list of things! In this case the /Kids key holds a list of references to page objects. We also include a count to say how many pages there will be (because there’s no way software would be able to figure that out from the number of references 🤔).

We’re only going to make one page but for demonstration purposes if we wanted three pages then we’d have something like this:

    /Count 3
    /Kids [
      3 0 R
      4 0 R
      5 0 R
     ]

The array is actually the 4th data type we’ve seen. These are the data types we’ve seen so far:

name objects (these are the keys with a forward slash before them)
dictionaries
numbers
arrays

The other pdf data types are:

booleans
strings
null
streams (this is where we’re going to put all our text and where you’d put your images)

These make up everything you can put into the body section of a pdf.
Page object
This is what our page object will look like:

    3 0 obj
    <<
     /Type /Page
     /Parent 2 0 R
     /MediaBox [
      0
      0
      612
      792
     ]
      /Resources <<
        /Font <<
         /Helv <<
          /Type /Font
          /Subtype /Type1
          /BaseFont /Helvetica
         >>
        >>
       >>
     /Contents [
      4 0 R
     ]
    >>
    endobj

Again we start with our object an ID and version, then inside the dictionary we give it a type. We also hold a reference to the parent (the pages object from earlier).

Now we actually have a definition of something visual: the size of the page with the /MediaBox array! First with the x and y coordinates of the lower left corner with 0, 0, then the coordinates of the upper right corner with 612, 792. PDF measures size from left to right, which makes sense to me being a native English reader, however it also measures from bottom to top.

We could stop here and we’d have a blank page (after adding the header, cross-reference table and trailer), but let’s put something on the page. To do this we’re first going to add a resources object which will describe the helvetica font as above.

Then we’ll continue down our chain of references, this time pointing to an object reference in the /Contents key. Let’s define this object now:

Contents object

Here’s what our next object looks like:

    4 0 obj
    <<
     /Length 53
    >>
    stream
    1 0 0 1 72 708 cm BT /Helv 12 Tf (Hello world!) Tj ET
    endstream
    endobj

It starts pretty normal, with the identifier and dictionary. Here we define a length. This is the length in bytes of the stream, which, if we keep it simple and use ASCII, will be one byte per character. This will count all the characters in between the newline character after the stream keyword and the newline character before the endstream keyword. These newline characters are not counted, but any spaces or tabs or other newline characters between these keywords will be counted in the length. The length does include any spaces or other whitespace.

As you can see, we throw these stream keywords after the dictionary. What is in this stream? First we write the scale and skew of the text. 1 0 0 1 is the section defining this and tells it not to rotate, skew, or scale the text in any way. Have a play with these numbers if you want to see the effect each of them on screwing around with text. I want normal text so I’m gonna leave it as 1 0 0 1.

Then we place the text: 72 cm left and 708 cm up (that’s centimetres if you’re an imperial measurement person). The BT keyword means we’re starting to define some text content. We specify the font, the font size and then use TF to finish of the font declaration (this is reverse Polish notation, where you write your two arguments, then name the function, we’re doing the same thing with the object identifier syntax).

Inside some parentheses we write the actual text we want on the page. Then writing Tj will pass this text to the Tj function, which will take the string and paint the glyphs. We write ET to end the text and we’re done!

This is the body of our pdf completed!

The header of the pdf is really easy. It basically always looks like this:

    %PDF-1.7
    %âãÏÓ

First we define that it’s a pdf, define the version of the spec, then there’s some gibberish letters to try and convince people not to do exactly what we’re doing right now and dig into the pdf. As far as I have read, you don’t actually have to put this exact set of gibberish in there, but these characters are standard so it’s what we’ll do.

Cross-reference table #

The cross-reference table is quite interesting. Now we’re really getting into something that looks like a binary file.

The cross reference table goes after the body, and is a table that software will use to hold the byte address of different objects inside the PDF. Assuming a PDF is all ASCII, it’s basically a count of how many characters (including newlines) before the start of each object.

This is how ours is going to look:

    xref
    0 5
    0000000000 65535 f 
    0000000015 00000 n 
    0000000068 00000 n 
    0000000133 00000 n
    0000000374 00000 n

Now we’re getting into some quite hardcore binary looking stuff.

First we start with the xref keyword to announce that we’re starting the cross-reference table. Then we have two numbers. The first is the “generation number” - used for versioning, we’re just going to leave it as zero. The second is the number of objects in our cross reference table. We only made 4 pdf objects but I wrote a 5. What gives?

We always add one extra placeholder object to the start of the xref table. It’s a way to indicate the start of the table.

After declaring the generation and number of objects in the table we start the table. Each row of the table has 3 columns:

The byte-offset from the start of the file to the object
The generation number of the object
Either an n to say the object is in use or a f to say the object is free and doesn’t actually matter. I’m not sure what the n actually stands for, but let’s pretend it stands for “now-in-use” or “not-retired”

It’s important that you left-pad the numbers with zeroes to keep all the items in the table the exact length above.

Here’s our first placeholder object, which will always look like this:

    0000000000 65535 f

The second object points to the object in our pdf with an ID of 1:

    0000000015 00000 n

The start of the first pdf object (our catalog object, specifically at the point we gave it an ID of 1) is 17 characters from the start of the file, hence the first number in the xref table is 15. This is the byte-offset of the object. We aren’t bothering with generations, so the second number is 0. Finally, this object is in use, so we add an n to indicate it’s in-use. We need to do this with all the objects in the pdf.

Getting the byte offset of a PDF section #

You can use grep to get the byte offset of an object:

    grep --byte-offset --text $'1 0 obj' mypdf.pdf

Where you replace the number “1” in the code above with the ID of the object you want the offset for.

Trailer #

We’re almost there! We’re at the last section of our PDF file.

This is what our trailer will look like:

    trailer
    <<
     /Root 1 0 R
     /Size 5
    >>
    startxref
    491
    %%EOF

We announce it’s a trailer, then we enter another dictionary with a paid of less-than signs.

This dictionary holds 2 keys:/Root and /Size. The root is a reference to the catalog object. The value of size is the number of rows in the cross-reference table.

Finally, after we’ve closed off the dictionary, we describe the byte-offset of the start of the cross-reference table. We don’t actually want the “xref” keyword but the offset of the first object of the table so using the grep script from before to grab byte-offsets we can use:

    grep --byte-offset --text '0000000000' mypdf.pdf

Then, finally to confirm we’re done we add an end of file marker with a couple of percent signs:

    %%EOF

The whole file #

This is what the entire pdf should look like. Save this in a text file as a pdf, and if nothing’s gone weird with encoding, you should have a fully functional pdf:

    %PDF-1.7
    %âãÏÓ
    1 0 obj
    <<
      /Type /Catalog
      /Pages 2 0 R
    >>
    endobj
    2 0 obj
    <<
     /Type /Pages
     /Count 1
     /Kids [
      3 0 R
     ]
    >>
    endobj
    3 0 obj
    <<
     /Type /Page
     /Parent 2 0 R
     /MediaBox [
      0
      0
      612
      792
     ]
      /Resources <<
        /Font <<
         /Helv <<
          /Type /Font
          /Subtype /Type1
          /BaseFont /Helvetica
         >>
        >>
       >>
     /Contents [
      4 0 R
     ]
    
    >>
    endobj
    4 0 obj
    <<
     /Length 53
    >>
    stream
    1 0 0 1 72 708 cm BT /Helv 12 Tf (Hello world!) Tj ET
    endstream
    endobj
    xref
    0 5
    0000000000 65535 f 
    0000000015 00000 n 
    0000000068 00000 n 
    0000000133 00000 n
    0000000374 00000 n
    trailer
    <<
     /Root 1 0 R
     /Size 5
    >>
    startxref
    491
    %%EOF

I don’t know about other editors, but in vim you can force it to treat the text as ASCII to make sure this saves with correct byte offsets by using these settings in vim:

    set encoding=latin1
    set isprint=
    set display+=uhex

So after you set these settings in vim, you can paste the above file then save, then you should have a working PDF file. Hooray!

A good way to absolutely verify things are correct is to use the GitHub - pdfcpu/pdfcpu: A PDF processor written in Go to verify your PDF with the “pdfcpu validate mypdf.pdf” command.

Where to go from here #

Daniel Warren wrote a similar blog post to this back in 2010 (14 years ago 👴). You should check out.
You can get the full PDF standard from adobe

Back to blog