PDF/A validation and inconsistent glyph width information

2018-07-03 | Martin Hoppenheit | 10 min read

Inconsistent glyph width information is a common cause for PDF/A validation errors, but the details are not easy to understand. This text provides the necessary background knowledge and dissects an example file.

Problem

Here’s a quick and easy way to create an invalid PDF/A file:

Open Microsoft Word 2013 or 2016.
Choose font Verdana.
Type a lowercase “i”.
Save as PDF/A.

These steps produce a file similar to this example (which has some irrelevant additional content, but never mind). Validation with veraPDF raises an error because the file violates section 6.3.6 of ISO 19005-1:2005 (PDF/A-1):

For every font embedded in a conforming file and used for rendering, the glyph width information in the font dictionary and in the embedded font program shall be consistent.

What does that mean, and how can it be fixed? Let’s explore!

Background

The error message mentions embedded font programs, glyph widths and something called a font dictionary. To understand the analysis in the next sections some basic understanding of these things is required.

Conceptually, a font like Arial or Times New Roman is a collection of glyphs that describe how characters should look like when rendered. So, albeit the technical details are a little more involved, glyphs could be imagined as little pictures of characters. But there’s more: In many fonts different glyphs have different widths, for example an “i” may use less horizontal space than an “M”. (That’s called a proportional font, in contrast to a monospaced, typewriter-like font where all glyphs have the same width.) So to make a line of text flow smoothly not only a glyph’s appearance but also its width must be known. A file that contains all this information in machine readable form is called a font program; a font program must be embedded in a PDF/A file for each font used in the file to ensure correct rendering of the document independently of the fonts that may or may not be installed on the system used for rendering.

In addition to the embedded font program, a PDF file caches some redundant font information in another, simpler data structure – the font dictionary – to simplify access to this information by viewer software. “Storing this information in the font dictionary, although redundant, enables a conforming reader to determine glyph positioning without having to look inside the font program.” [ISO 32000-1:2008, page 241] Glyph width information can be found in a list called the widths array inside the font dictionary.

Of course, information in the font dictionary should be consistent with information in the embedded font program because otherwise the glyph width would be ambiguous, leading to arbitrary layout. This is what the validation error is about.

Analysis

A closer look at the error message reveals some essential details to kick off the analysis:

root/document[0]/pages[0](3 0 obj PDPage)/
    contentStream[0](4 0 obj PDContentStream)/
    operators[72]/usedGlyphs[7](ABCDEE+Verdana 105 0 1058634310 0)

root/document[0]/pages[0](3 0 obj PDPage)/
    contentStream[0](4 0 obj PDContentStream)/
    operators[72]/usedGlyphs[10](ABCDEE+Verdana 108 0 1058634310 0)

OK, it may be a little hard to read. But amidst the paths of the offending glyphs in the PDF object tree and other background noise it indicates the problematic font and characters, namely ABCDEE+Verdana 105 and ABCDEE+Verdana 108. That means that to trace the problem three steps have to be performed:

Get the glyph width values for “i” (ASCII code 105) and “l” (ASCII code 108) from the font dictionary of the font ABCDEE+Verdana.
Get the corresponding glyph width values from the associated embedded font program.
Compare the values and draw conclusions.

Getting glyph widths from the font dictionary

The Verdana font dictionary and its widths array can be found with the PDF analysis tool iText RUPS (see screenshot below) or even in a text editor, where the relevant parts look like this:

5 0 obj
<<
  /Type /Font
  /Subtype /TrueType
  /Name /F1
  /BaseFont /ABCDEE+Verdana
  /Encoding /WinAnsiEncoding
  /FontDescriptor 6 0 R
  /FirstChar 32
  /LastChar 122
  /Widths 16 0 R
>>
endobj

16 0 obj
[ 352 0 0 0 0 0 0 0 0 00 0 0 0 0 0 636 636 636 636 636 636 636 636 636 636
  0 0 0 0 0 0 0 684 686 698 771 632 575 775 751 421 455 693 557 843 748 787
  603 787 695 684 616 732 684 989 685 615 685 0 0 0 0 0 0 601 623 521 623
  596 352 623 633 272 344 592 272 973 633 607 623 623 427 521 394 633 592
  818 592 592 525
]
endobj

The section beginning with 5 0 obj is the font dictionary, and the section beginning with 16 0 obj is its widths array, with “each element being the glyph width for the character code that equals FirstChar plus the array index” [ISO 32000-1:2008, page 255]. According to this definition, the index of the glyph width for “i” can be calculated from the FirstChar entry of the font dictionary (32) and the ASCII code of “i” (105) as 105 - 32 = 73. With some counting this yields a glyph width value of 272.

Getting glyph widths from the embedded font program

Embedded fonts can be viewed very conveniently with a font editor like FontForge that extracts them directly from a PDF file. In FontForge, glyph widths are found in the metrics window of a font; the value for “i” in the embedded Verdana font is 562.

Comparing glyph widths

The two preceding sections have shown that the character “i” has a glyph width of 272 in the font dictionary, and a glyph width of 562 in the embedded font program. Obviously, the values differ, seemingly explaining the validation error. But wait! On closer inspection, the glyph width values for “a” in the font dictionary (601) and in the embedded font program (1230) differ as well, without raising a validation error. But if consistency of glyph width information does not mean equal width values, what does it mean then?

Here is one more bit of background information: To make them scalable, typographic measurements like glyph widths are often specified relative to the current point size which is referred to as a unit called “em”. This is done in the font dictionary where “glyph widths shall be measured in units in which 1000 units correspond to 1 unit in text space” [ISO 32000-1:2008, page 255], meaning an “a” glyph should be displayed with a width of 601/1000 em, or 60.1 % of the current point size. This is also done in the embedded font program – but with another base value which can be found in the “Em size” field of the font information window in FontForge: 2048 instead of 1000. Consequently, a comparison of glyph widths has to take these different base values into account, leading to the following consistency condition:

Let \(w_d\) be a glyph width value in the font dictionary, and let \(w_e\) be the corresponding glyph width value in the embedded font program. Then the two glyph widths are consistent if \(\frac{w_d}{1000} = \frac{w_e}{2048}\).

And lo and behold, with adequate rounding applied, this condition holds for “a”: \(\frac{601}{1000} = 0.601 ≈ 0.600585938 = \frac{1230}{2048}\)

However, this immediately raises another question: To what extent may rounding errors be tolerated? Adobe’s PDF architect Leonard Rosenthol explains in a mailing list post that “with all this conversion math going on, there is ALWAYS a chance for rounding errors and the like which can impact the PDF/A requirement of ‘match’. That is why in PDF/A-2 and the second corrigenda for PDF/A-1, we’ve made it clear how to handle the ‘floating’ …” Indeed, section 6.2.11.5 of ISO 19005-2:2011 (PDF/A-2) clarifies this issue: “For ISO 19005, consistent is defined to be a difference of no more than 1/1000 unit.” This leads to the following refined consistency condition (the bars denote absolute value):

This condition still holds for “a”, but not for “i”: \(|\frac{272}{1000} - \frac{562}{2048}| = 0.002414063 > 0.001\)

Down to the gory details, this is what is meant by inconsistent glyph width information: Two numbers, related to their respective base values, deviate too much from each other.

Solution

Now that the validation error has been traced and understood, the question is whether it can be fixed. Luckily, the correct font dictionary value can easily be obtained by solving the equation in the above (first) consistency condition for \(w_d\) and rounding to the nearest integer: \(w_d = \frac{w_e}{2.048}\). For “i” this results in 562/2.048 = 274.4140625 ≈ 274 instead of 272.

After having thus calculated the correct glyph width value, fixing the error is just a matter of replacing 272 by 274 at the position of “i” in the widths array (and likewise for “l”). This can be accomplished manually in a text editor, even though it is usually a bad idea to mess with a PDF file in this way (because it’s not a strictly text based format).

Here is a fixed variant of the example file given above that indeed validates without complaints.

Final questions

Why not adapt the embedded font program instead of the font dictionary?

That would work as well. However, there are two reasons to adapt the font dictionary. Firstly, the font dictionary acts as a cache, so the embedded font program should be considered the leading data structure. Secondly, the glyph widths in the embedded font program accord with the widths in the Verdana font shipped with Windows (C:\Windows\Fonts\verdana.ttf) – strongly advocating the embedded font program because Verdana was made by Microsoft.

Why fix the error manually?

To thoroughly comprehend the error; and to keep changes in the file to a minimum. Of course, the error could also be eliminated by bluntly converting the invalid PDF/A file (not the original Word file) to PDF/A again with some PDF authoring/conversion software like Adobe Professional. (This is sometimes referred to as “refrying”.) However, having tried this with several tools, while the visual appearance remained untouched, the file’s internals were heavily changed, making it impossible to judge what has happened. In fact, such conversions create a whole new file which is OK because that’s what a conversion is expected to do. From the digital preservation perspective however, the “minimally invasive” manual approach, modifying only the offending parts of a file while keeping the rest in its original state, may be more desirable. (And of course, when lots of files have to be repaired, the “manual” method could still be automated.)

Is this a bug in Microsoft Word?

I think so. But who knows, without a public issue tracker. I asked them, but sadly I did not get past first level support.

About the rounding again …

In the previous section the font dictionary value \(w_d\) was calculated by solving the equation in the consistency condition for \(w_d\) and rounding to the nearest integer. The avid reader may wonder whether this rounding step doesn’t introduce imprecisions that cause problems later on when comparing glyph widths. It does not; the range of tolerance allowed by the “difference of no more than 1/1000 unit” phrase eliminates just the right amount of imprecision. If you really need to know, here’s a proof:

The refined consistency condition \(|\frac{w_d}{1000} - \frac{w_e}{2048}| ≤ 0.001\) can be simplified to \(|w_d - \frac{w_e}{2.048}| ≤ 1\) which has the same meaning as \(-1 ≤ w_d - \frac{w_e}{2.048} ≤ 1\) and as \(\frac{w_e}{2.048} - 1 ≤ w_d ≤ \frac{w_e}{2.048} + 1\). This condition should still hold when \(w_d\) is rounded: \(\frac{w_e}{2.048} - 1 ≤ round(w_d) ≤ \frac{w_e}{2.048} + 1\). With \(w_d = \frac{w_e}{2.048}\) this becomes \(w_d - 1 ≤ round(w_d) ≤ w_d + 1\). When rounding up \(round(w_d)\) will not exceed the range \(w_d ≤ round(w_d) < w_d + 1\), and when rounding down it won’t exceed the range \(w_d - 1 < round(w_d) ≤ w_d\). So the condition will hold no matter which rounding method (up, down, or nearest integer) is applied.