Simple file extension is a good start, but it has to be confirmed that a file is correctly
named, and also what variation, version or sub-type has been used within the
structure of the file. Microsoft word files were famously incompatible between
versions and yet used the same file extension (.doc6)
OPF maintains a tool - FIDO - that delivers file identification, and The National
Archives of the UK maintains an equivalent tool called DROID. The principle way
these tools work is to look at the file and identify signature strings of data. For
example, a PDF file will contain the tell-tale string as the first character "%PDF -
1.2" or similar, where the number indicates the version of the PDF specification.
The main challenge in identification of files using this technique is maintain the list
of signature strings. Most of the tools and indeed virtually every organisation relies
upon the authoritative database of file signatures called PRONOM which is
maintained by The National Archives of the UK.
In software engineering it is generally held true that the standard case is relatively
easy, clearly identifiable common cases work well (the so called 'happy path'). It is
when things are unusual, different, malformed or erroneous that causes most of the
work. This is very much the case in file identification too.
One significant complication is the idea of a container. A container is a file type
that contains other file types, the most common of which would be a zip file.
Identification of the container may be straightforward, but not sufficient since
much or all of the information is in the files wrapped up inside. The container must
be expanded and the individual items contained identified and made available for
further analysis. A simple zip may have one or more files of one or more types and a
directory of folder structure that may itself provide crucial context and metadata.
(Filenames may be duplicated across different branches of a directory inside a zip
file).
Probably one of the more complex containers is a complete disk image. The image
itself may be the item to be saved, but the identification of all of the sub-items within
that image has to be completed before validation and characterisation of each sub
item can be carried out. Clearly some of the items contained on a disk image could
be containers - perhaps even more disk images - so the process must be recursive.
Happily, the tool BitCurator has been developed to assist specifically for disk images
as this is a complex form of 'exploding the container'.
The process of identification will provide an amount of information about a 'thing'.
It may be as simple as a statement of the file type, variety and version, or it may be as
much as a complete map of a complex container. All of this information must be
recorded as metadata and added to the associated collection of metadata with the
digital thing.
If the file is found to be broken at this stage, or indeed of a file type that is
undesirable, the process can be extended to transform the format of the thing.
Usually this would involve making a copy of the original perhaps in a different
format. Once that has been done the original and the transformed copy should
be compared to demonstrate that any differences are deliberate and recorded with
the metadata. OPF look after a tool (xCorrSound) that does this for certain audio
formats.
Once the 'thing' has been identified, we move on to check if the thing is well-formed
and complies to the rules for that format (Validation), and also to find out more
details such as author, creation time, tools used to create the thing and more
(Characterisation).
Validate and Characterise are logically separate processes that both require the
structure of the thing to be opened up and examined, or parsed.That opening up
requires an understanding of the low-level structure used in the thing. However, the
two tasks are looking at very different aspects. The key tool used for this is OPF's
JHOVE. Whilst the principle applies across all file types, each file type will require a
specific process. Validating a text based format such as PDF7 compared to an image
format such as JPEG8 (JPG) require very different logic and coding.
Validation is confirming that the file conforms to the current understanding of
the format.Typically that means checking every aspect of the format and the files
compliance to that standard. OPF recently took part in the PREFORMA project to
develop a new PDF/A validator VeraPDF. Developed in association with the PDF
Association and adhering tightly to the standards for PDF/A the project developed
not only a validation tool that looked into more corners of the specification than
ever before, but also developed a comprehensive body of test data (test corpus).
Modern file formats have developed over a number of years and many can be
remarkably complex in themselves. While this makes those formats very functional,
it does add complexity to the validation since few implementations of any given
specification will be identical. Again, the issue of files that comply is straightforward,
and the complexity is when there is a real or technical non-compliance. Any given
non-compliance may not necessarily be severe enough to prevent a future format
reader making sense of the document - but can we be sure of that? Does a 'format
non-compliance' mean the same for all types of content? For example the definition
of a colour palette may be incomplete in a document - no problem for a black and
white text page - massive issue for a high definition picture of fine art. It may be that
today's rendering tool automatically compensates for the colour palette, or the local
language settings or any one of a range of contexts set in the file.
One format that most of us assume to be safest - PDF - is defined by a particularly
complex specification. There are numerous common errors in formats when
compared to the specifications that today's renderers ignore or pass over, but in
theory could cause major issues in the future. It comes back to the librarian or
archivist to determine the criticalities for the type of content that they are
hoofdstuk 1
50
martin wrigley, becky mcguinness, carl wilson the open preservation foundation
reference toolset
6 DOC format may use one of four different formats on a PC from different versions of Word- Word for DOS,
Word for windows, word 6, word97. Since word2007 the .DOCX format is used.
7 Portable Document Format (PDF) is a file format developed in the 1990s to present documents, including text
formatting and images, in a manner independent of application software, hardware, and operating systems.
Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat
document, including the text, fonts, vector graphics, raster images and other information needed to display
it. Since 2008 PDF has been standardized as a royalty free open format, ISO 32000 (and others)
8 JPEG Joint Picture Expert Group - an ISO format for images
51