Keeping Up With the Joneses: The New Recommended Formats Statement
Submitted by Clyde on Thu, 23/07/2015 - 21:14Issuing the Recommended Format Specifications
Issuing the Recommended Format Specifications
The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License v2.0.
https://pdfbox.apache.org
Features:
Extract Text
Extract Unicode text from PDF files.
Split & Merge
Split a single PDF into many files or merge multiple PDF files.
igital audio and digitised film can also be compressed, but there are particular issues – and an interesting (well, for some) history – for video, so I will emphasise video. The general principles apply to any signal (including audio and scanned film), but not to files and digital data in general.
As PREFORMA partner and expert in TIFF we would like to inform you about the launch of this initiative. We count on you to participate in the new TIFF/A specification: your expertise will be very useful! Please get involved in any of the 3 levels.
TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags.
Why PDF/A validation matters, even if you don’t have PDF/A
As most people who read this blog know, the development of PDF didn’t end with the ISO 32000 (aka PDF 1.7) specification. Adobe has published three extensions to the specification. These aren’t called PDF 1.8, but they amount to a post-ISO version.
The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west.
The name of the NLNZ (National Library of New Zealand) Metadata Extraction Tool suggests getting metadata more than identifying files, FITS uses it as part of its set of format identification tools. It employs a set of adapters to access the following file formats: BMP, GIF, JPEG TIFF, MS Word, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint, PDF, WAV, MP3, BWF, FLAC, HTML, XML, and ARC. It also has a generic adapter to report basic file system information about other files. It’s available as open source on SourceForge under the Apache Public License.
Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.