PDF Days Europe 2017

Last week I attended the “PDF Days Europe 2017” conference in Berlin. While a whole conference dedicated to one file format may sound funny it was in fact quite interesting. Here’s my personal summary of the most important topic, the soon to be published PDF 2.0 ISO standard. Most of the facts (but not necessarily the opinions) stated below are taken from presentations given by Adam Spencer from Accessibil-IT, Leonard Rosenthol from Adobe, Matt Kuznicki from Datalogics, Roman Toda from Normex and the ISO 32000 Project Leaders Duff Johnson and Peter Wyatt; if some fact is plain wrong that’s probably my fault.

PDF 2.0, issues and solutions

There are two main issues with current PDF 1.x that will be addressed by the coming version of the ISO standard:

  1. With its static, non-responsive layout, “PDF is not designed for the modern era” (Rosenthol/Kuznicki) and is not well suited for mobile devices.
  2. PDF in general has no machine-readable semantic structure which causes problems with accessibility, with automated processes and with the extraction of data from a PDF file. A common quote says that “PDF is where data goes to die”.

These problems led to the questioning of PDF’s “basic principle and biggest strength” (Toda), the precise layout, which significantly affected the standards development towards next generation PDF, or PDF 2.0, or ISO 32000-2.

PDF 2.0 will retain its precise imaging model and visual appearance, as well as its reliable document rendering, printing and exchange capabilities. It will remain a document-centric and self-contained format, not a “website in a can” (Toda). However, next generation PDF will not be completely static anymore but adaptable to the circumstances and the device used for rendering, in particular its screen size. It will be able to reflow its content to suit the user’s needs and viewing preferences. This is a bit of a paradigm shift from having the same layout everywhere to a responsive design approach, but “the world of one single representation is over” (Toda).

Both the adaptable layout and the improved accessibility and content reusability will be based on a revision, enhancement and formalization of tagged PDF. Tagged PDF embeds roughly HTML-like tags into a PDF file to identify elements of its semantic structure like headings, paragraphs and lists. It has been available since PDF 1.4 (see ISO 32000-1:2008, section 14.8) and is required by several PDF sub-standards like PDF/UA, PDF/A-1a and PDF/A-2a. In PDF 2.0, tagged PDF will see some changes like new tags, deprecated tags and namespaces for tags (cf. namespaces in XML) that allow integration of custom elements in the tagging model.

Now, if the whole responsive design thing reminds you of something you are probably right. Rendering of next generation PDF will in fact be done by something called “derivation”, a standardized, predictable and interoperable model for representing PDF content as HTML – essentially an ad hoc transformation of PDF to HTML that utilizes a mapping of PDF tags to HTML elements. In this transformation, styling is derived from the PDF’s fonts and layout, from additional information that may be provided by attributes added to the PDF tags and by CSS stylesheets in (embedded or external) associated files. Finally, styling may be supplemented by the viewer with the user’s preferred CSS.

Although it is certainly smart to leverage the responsive web experience of the last years this approach provokes one question: Why bother with the PDF overhead at all instead of using HTML directly? Answers to this question mentioned PDF as a well established document format, its document- and page-centric focus and the belief that authoring environments for PDF are more comfortable from a user’s point of view. I guess some (web) developers will beg to differ here, but be that as it may. Anyway, HTML derivation is not only useful to create dynamic layouts but also for accessibility purposes: Most assistive technologies today don’t properly support PDF, but they do support HTML.

Another boost for accessibility and data reusability is expected from the stronger integration of additional resources in associated files. As an example, consider a pie chart in a PDF file. In a regular PDF 1.x file, this is usually just an image. A PDF 2.0 file however may also contain the “hidden”, actual plain data (e.g., a table of numbers or an XML document) that is represented by the pie chart together with a relation between the chart and the associated data (i.e., the data is connected to the image object in a machine readable way). Thanks to this construct, a regular PDF viewer could just show the pie chart or it could also allow the user to copy the data that hides behind the chart. A PDF screen reader could take the data and offer it to its user in another, more convenient way. And some automated process could use the PDF file as a data container, extract the data and use it for whatever purpose.

Other things I thought sounded interesting are XMP metadata as a first class requirement, improved encryption and digital signature technology, geospatial features that enable use cases for maps and the fact that the ISO standard will describe both interactive and non-interactive processing.

The digital preservation perspective

The conference was mainly attended by industry professionals. I counted only four guys from the digital preservation community, which was a bit of a surprise considering the significance of PDF and PDF/A for archives and libraries. So what impact will next generation PDF have on our field?

There will be a new PDF/A version based on PDF 2.0, but this seems to be far from ready yet.

Regarding the new features, better accessibility and semantic structure is certainly great since it also offers better searchability and fancy stuff we cannot even imagine yet.

The responsive layout part is in fact something of a paradigm shift compared to the idea of PDF as digital paper. Paper documents do not dynamically change their layout when the page size changes (e.g., when copying a page to a smaller paper size), so here’s to the significant property of immutable visual appearance … But is this really bad? Obviously, PDF will comply with the immediate paper analogy less than we were used to, which may be disconcerting for a profession that has been relying on paper documents for centuries. On the other hand, the same arguments that hold for “regular” PDF concerning adaptability also hold for PDF/A – static layout on a tiny screen is just useless. Besides that, offering different layouts on different devices does not mean a permanent loss of information: although some layouts may possibly even hide some data (we can see similar effects in responsive web sites) no information will actually disappear from the PDF file. To some extent, we are trading the idea of exactly one canonical visual representation in for a greatly improved semantic structure and accessibility – sounds fair to me. So there will probably be lots of use cases where these new features will be useful and there will probably also be some use cases where they will make a digital archive consider an alternative file format for long term preservation. Anyway, it will definitely be interesting to see which features make it into the new version of PDF/A.

Further reading

Presentations and videos are available on the website of the PDF association.