PDF/A as OAIS SIP container? | Mad File Format Science
A proposal to use PDF/A as a Submission Information Package (SIP) under the Open Archival Information System (OAIS) model has generated a small stir on Twitter.
The aim of a SIP is to deliver a collection of documents in a form suitable for ingesting into an archive. It needs to have enough metadata to create a proper Archive Information Package (AIP). The model doesn’t specify what SIP format(s) an archive should accept. XML files following well-known archival schemas such as METS for the overall package and PREMIS for preservation information are popular.
The article “Beyond TIFF and JPEG2000: PDF/A as an OAIS Submission Information Package Container” proposes using PDF/A-2 or PDF/A-3 as a SIP container. It’s true that PDF/A-3 has the necessary capabilities, since it allows embedding of any kind of file within a PDF, without restriction, as well as XMP to store arbitrary metadata. An NDSA paper explores the idea of using PDF/A-3, not as a SIP, but for bundling related files within an archive. That’s a different scenario, in which the file is treated as a PDF document plus related materials. I won’t get into the pluses and minuses of that. The new proposal is to treat the “PDF-ness” of the file as purely incidental, and to use its embedding capabilities as a way of bundling submissions.
PDF is a horribly complex format designed for creating readable documents. The PDF/A standard imposes restrictions on this format for the sake of stability, without adding any new capabilities. You can embed anything in an ordinary PDF. That doesn’t make it a good idea. The NDSA paper notes: “The PDF/A-3 specification itself neither addresses use-cases for embedding non-PDF/A files within a PDF/A-3 file instance, nor motivates the addition of this capability.”
Software to create XML files is easy to create. There are many tools for extracting machine-usable or human-readable information from XML. Reading PDF requires specialized software, and it’s much harder to write one-off tools for pulling information out of a PDF SIP. It’s also very hard to modify existing PDF files.
But maybe I’m beating a horse that should be mercifully shot instead. The abstract conflates two completely unrelated issues: digitization of scanned documents and SIP containers. The title seems to say that PDF/A is a better SIP format than TIFF and JPEG2000. Granted, that’s true; it’s also a better SIP format than long-playing records, VHS tape, or a Tarot deck.
I’d have to pay a ridiculous $32 to read the article, so I’m going only by the abstract.