PDF/A-3u as an archival format<br> for Accessible mathematics

Section 1Introduction

{PDF/A} is being adopted by publishers and Government agencies for the long-term preservation of important documents in electronic form. There are a few variants, which pay more or less regard to Accessibility considerations; i.e., `a' for accessible, `b' for basic, `u' for (presence of) unicode mappings for all font characters. Later versions [pdfA2, pdfA3] of this {\ISO} standard [pdfA] allow for other file attachments in various data formats. In particular, the {PDF/A-3u} variant allows the inclusion of embedded files of arbitrary types, to convey supplementary descriptions of technical portions of a document's contents.

`Accessibility' is more relevant for reports and text-books than for research outputs. In fact in some countries it is a legal requirement that when a visually-impaired student enrols in unit of study for which a text-book is mandated as `Required', then a fully accessible version of the contents of that book must be made available. Anecdotally, visually-impaired students of mathematics and related fields much prefer mathematical material to be made available as L^AT_EX source, to any other format. With a Braille reader, this is text-based and sufficiently compact that expressions can be read and re-read with ease, until a full understanding has been achieved. This is often preferable to having an audio version [raman1, raman2], which is less-easy to navigate. Of course having both a well-structured audio version, as well as textual source, is even more useful. The PDF example [ExPDFUA] accompanying this paper^†† in fact has both, though here we concentrate on how the latter is achievable within PDF documents.

Again anecdotally, the cost of reverse-engineering^††3 all the mathematical expressions within a complete textbook is typically of the order of \pounds10,000 or AUD 30,000 or CAD 10,000. This cost would have been dramatically reduced if the PDF had originally been created to include a L^AT_EX or MathML description of each expression\footnote{ This is distinct from including the complete L^AT_EX source of the whole document. There are many reasons why an author, and hence the publisher, might not wish to share his/her manuscript; perhaps due to extra information commented-out throughout the source, not intended for general consumption.}, attached or embedded for recovery by the PDF reader or other assistive technology. How to do this in PDF is the purpose of this paper.

The method of Associated Files, which is already part of the {PDF/A-3} standard [pdfA3], is set to also become part of the {\ISO} 32000-2 (PDF 2.0) standard [PDF20], which should appear some time in 2014 or 2015. In Sect. 3.1 this mechanism is discussed in more detail, showing firstly how to include the relevant information as attachments, which can be extracted using tools in the PDF browser. The second aspect is to relate the attachments to the portion of content as seen onscreen, or within an extractable text-stream. This can be specified conveniently in two different ways. One way requires structure tagging to be present (i.e., a `Tagged PDF' document), while the other uses direct tagging with an /AF key within the content stream. In either case a PDF reader needs to be aware of the significance of this /AF key and its associated embedded files.

With careful use of the /ActualText attribute of tagged content, L^AT_EX (or other) source coding of mathematical expressions can be included within a PDF document, virtually invisibly, yet extractable using normal Select/Copy/Paste actions. A mechanism, using very small space characters inserted before and after each mathematical expression, is discussed in Sect. 4. This is applicable with any PDF file, not necessarily PDF/A. It is important that these spaces not interfere with the high-quality layout of the visual content in the document, so we refer to them as `fake spaces'.

The various Figures in this paper illustrate the ideas and provide a look at the source coding of a PDF document\Exfootmark that includes all the stated methods, thus including the L^AT_EX source of each piece of mathematical content. (Where explicit PDF coding is shown, the whitespace may have been massaged to conserve space within the pages of this paper.) Indeed the example document includes as many as 7 different representations of each piece of mathematical content:

the visual form, as typically found in a PDF document;
the L^AT_EX source, in two different ways; i.e, an attachment associated with a /Formula structure tag and also associated directly to the (visual) content, and as the /ActualText replacement of a `fake space'.
a MathML version as an attachment, also associated to the /Formula structure tag and also associated directly to the (visual) content;
a MathML representation through the structure tagging;
words for a phonetic audio rendering, to be spoken by `\textsf{Read Out Loud}';
the original L^AT_EX source of the complete document, as a file attachment associated with the document as a whole.

In practice not all these views need be included to satisfy `Accessibility' or other requirements. But with such an array of representations, it is up to the PDF reading software to choose those which it wants to support, or which to extract according to particular requirements of end-users. It is remarkable that a single document can be so enriched, yet still be conforming with a standard such as PDF/A-3u, see [pdfA3]. Indeed, with all content being fully tagged, this document\Exfootmark would also validate for the stricter PDF/A-3a standard, apart from the lack of a way to specify the proper rôle of MathML structure tagging, so that tags and their attributes are preserved under the `\textsf{Save As Other ... XML 1.0}' export method when using Adobe's `Acrobat Pro' software. This deficiency will be addressed in PDF 2.0 [PDF20].

Methods used to achieve the structure tagging in the example document\Exfootmark\ have been the subject of previous talks and papers [DML2009, CICM2013] by the author. It is not the intention here to promote those methods, but rather to present the possibilities for mathematical publishing and `Accessibility' that have been opened up by the PDF/A-3 and PDF/UA standards [pdfA3, PDF-UA1], and the `fake spaces' idea. The example document [ExPDFUA] is then just a `proof-of-concept' to illustrate these possibilities.

Since the PDF/A-3 standard [pdfA3] is so recent, and with PDF 2.0 [PDF20] yet to emerge, software is not yet available that best implements the `Associated Files' concept. The technical content of the Figures is thus intended to assist PDF software developers in building better tools in support of accessible mathematics. It details

exactly what kind of information needs to be included;
the kind of structures that need to be employed; and
how the information and structures relate to each other.

For those less familiar with PDF coding, the source snippets have been annotated with high-lighting^** and extra words indicating the ideas and intentions captured within each PDF object. Lines are used to show relationships between objects within the same Figure, or `\textsf{see Fig. Xx}' is used where the relationship extends to parts of coding shown within a different Figure. Section 2.1 is supplied to give an overview of the PDF file structure and language features so that the full details in the Figures can be better understood and their rôle appreciated.

Feedback on the LaTeX to HTML conversion.

PDF/A-3u as an archival format
for Accessible mathematics

Table of Contents

Section 1Introduction