preserving, and as such the organisation needs a set of policies to determine which
validation errors can be ignored and which must be corrected prior to storage.
At the end of the validation process, the metadata that accompanies the digital
thing will be updated to carry details of the validation, the policy choices made and
the results of the checking. Positive and negative metadata may be included and
once again, this is a choice of policy.
Characterisation involves taking information about the thing and the nature of its
content to build a richer context beyond the purely technical, structural details.
Many formats hold a wealth of information concerning the content - even before
looking at the content itself.
For example, a word document usually records the author, the time of creation, the
editing time, word count and more. Sometimes there is a specific summary of the
document as well. And a JPEG image file may carry information about the camera
used to take it, time and day and often even geolocation information for its precise
location.
However, the librarian or archivist needs to make a series of decisions about such
data. The instinctive response might be to extract everything, going to limits in a jpeg
for example finding colour profile, or raw image sensor info. Such information it
could be argued is invaluable in the case of checking quality should the file format
require migration.
The right level of detail needed depends on the intention and processes in place for
the organisation. Once sufficient information has been taken to ensure that the
original file can be adequately indexed and examined in future, any requirements for
further descriptive data could be extracted at a later date. However, the presence of
such data in the file and the policy for data to be extracted and duplicated outside
the file is itself a key piece of metadata to be stored alongside the digital thing.
Package and cross check is a process not yet explicitly covered by OPF tools, but usually
carried out through workflows and processes and report formatting in each
organisation. This is where the metadata gathered is prepared for taking forward to
storage in the database. In effect one of the final stages of forming and wrapping the
thing to be stored into an appropriate archive information package.
There are a range of metadata formats and standards such as PREMIS9 and METS10
and others that proscribe a structure for metadata. Librarians and archivists need to
determine the most suitable form for their purposes and which elements of that to
populate.
Having chosen that format and the required elements, the metadata information
gathered so far must be placed into that format in readiness to archive the digital
object. However, there is one other aspect considered here. There will be occasions
where the found metadata may conflict with provided information, for example
sometimes the stated author may differ from the authors name as discovered
through characterisation of the digital thing. This aspect of cross-checking prevents
future ambiguity.
However, such as cross-check implies yet further decisions as to which data is to be
trusted? Or should both be stored? Once again, the librarian or archivist has to
make a choice depending on their purposes. Without choosing the right level of
metadata to store, the danger is that the volume of metadata can easily exceed that
of the original item. Items are usually selected based on a clear preservation need,
and a fine balance is needed between minimalist data and keeping everything.
OPF Reference toolset and the digital preservation process map
Practical tool usage, scalability and quality control
OPF monitors the number of downloads of the tools that it supports, and is in
discussion with the users to design and implement a more sophisticated monitoring
framework. Naturally much of the operations in which the tools are used are
sensitive and OPF is mindful of preserving that confidentiality.
However, building a set of usage data will enable OPF and tool users to build
statistics on the types of formats processed and format errors found. This will help
eliminate systematic errors in generated documents and will quantify the
importance of various formats amongst the tool users.
The monitoring framework will also enable OPF to respond to any issues found with
the toolset itself. Today OPF runs a comprehensive testing process for each release of
a tool, and holds a set of test data of each format to ensure reliability and preserve
backwards compatibility wherever possible or practical.
hoofdstuk 1
9 PREMIS Data Dictionary: a comprehensive, practical resource for implementing preservation metadata
in digital archiving systems; accompanying report (providing context, data model, assumptions); special
topics, glossary, usage examples; set of XML schema which was developed to support use of the Data
Dictionary.
10 Metadata Encoding and Transmission Standard (METS) is a metadata standard for encoding descriptive,
administrative, and structural metadata regarding objects within a digital library, expressed using the XML
schema. The standard is maintained as part of the MARC standards of the Library of Congress, and is being
developed as an initiative of the Digital Library Federation (DLF).
52
martin wrigley, becky mcguinness, carl wilson the open preservation foundation
reference toolset
Tool Name
Part of the process
File types handled
Fido
Identification
Based on list of signatures in PRONOM
Format Sniff
Identification
Based on list of signatures in PRONOM
VeraPDF
Validation, Characterisation
PDF/A
Jpylyzer
Validation. Characterisation
JPEG 2000
Jhove
Validation, Characterisation
PDF, JPEG, TIFF, WAV, PNG, WARC, AIFF, XML, HTML, GZIP,
ASCII text, UTF8 Text. MP3, GIF, JPEG 2000
xcorrsound
Quality Check
Sound files
DPF Manager
Validation, Characterisation
TIFF
PDF Test suite
Test Corpus
PDF/A
Table 1. OPF Tools and how they map
53