Formats

Apache PDFBox | A Java PDF Library

The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License v2.0.
https://pdfbox.apache.org

Features:
Extract Text

Extract Unicode text from PDF files.

Split & Merge

Split a single PDF into many files or merge multiple PDF files.

A Compressed View of Video Compression | PrestoCentre

igital audio and digitised film can also be compressed, but there are particular issues – and an interesting (well, for some) history – for video, so I will emphasise video. The general principles apply to any signal (including audio and scanned film), but not to files and digital data in general.

TIFF/A Standard Initiative launched! | Digital meets Culture

As PREFORMA partner and expert in TIFF we would like to inform you about the launch of this initiative. We count on you to participate in the new TIFF/A specification: your expertise will be very useful! Please get involved in any of the 3 levels.

TIFF/A | Mad File Format Science

TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags.

File identification tools, part 9: JHOVE2 | Mad File Format Science

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west.

File identification tools, part 8: NLNZ Metadata Extraction Tool | Mad File Format Science

The name of the NLNZ (National Library of New Zealand) Metadata Extraction Tool suggests getting metadata more than identifying files, FITS uses it as part of its set of format identification tools. It employs a set of adapters to access the following file formats: BMP, GIF, JPEG TIFF, MS Word, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint, PDF, WAV, MP3, BWF, FLAC, HTML, XML, and ARC. It also has a generic adapter to report basic file system information about other files. It’s available as open source on SourceForge under the Apache Public License.

File identification tools, part 7: Apache Tika | File Formats Blog

Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.

Pages