Simple file extension is a good start, but it has to be confirmed that a file is correctly named, and also what variation, version or sub-type has been used within the structure of the file. Microsoft word files were famously incompatible between versions and yet used the same file extension (.doc6) OPF maintains a tool - FIDO - that delivers file identification, and The National Archives of the UK maintains an equivalent tool called DROID. The principle way these tools work is to look at the file and identify signature strings of data. For example, a PDF file will contain the tell-tale string as the first character "%PDF - 1.2" or similar, where the number indicates the version of the PDF specification. The main challenge in identification of files using this technique is maintain the list of signature strings. Most of the tools and indeed virtually every organisation relies upon the authoritative database of file signatures called PRONOM which is maintained by The National Archives of the UK. In software engineering it is generally held true that the standard case is relatively easy, clearly identifiable common cases work well (the so called 'happy path'). It is when things are unusual, different, malformed or erroneous that causes most of the work. This is very much the case in file identification too. One significant complication is the idea of a container. A container is a file type that contains other file types, the most common of which would be a zip file. Identification of the container may be straightforward, but not sufficient since much or all of the information is in the files wrapped up inside. The container must be expanded and the individual items contained identified and made available for further analysis. A simple zip may have one or more files of one or more types and a directory of folder structure that may itself provide crucial context and metadata. (Filenames may be duplicated across different branches of a directory inside a zip file). Probably one of the more complex containers is a complete disk image. The image itself may be the item to be saved, but the identification of all of the sub-items within that image has to be completed before validation and characterisation of each sub item can be carried out. Clearly some of the items contained on a disk image could be containers - perhaps even more disk images - so the process must be recursive. Happily, the tool BitCurator has been developed to assist specifically for disk images as this is a complex form of 'exploding the container'. The process of identification will provide an amount of information about a 'thing'. It may be as simple as a statement of the file type, variety and version, or it may be as much as a complete map of a complex container. All of this information must be recorded as metadata and added to the associated collection of metadata with the digital thing. If the file is found to be broken at this stage, or indeed of a file type that is undesirable, the process can be extended to transform the format of the thing. Usually this would involve making a copy of the original perhaps in a different format. Once that has been done the original and the transformed copy should be compared to demonstrate that any differences are deliberate and recorded with the metadata. OPF look after a tool (xCorrSound) that does this for certain audio formats. Once the 'thing' has been identified, we move on to check if the thing is well-formed and complies to the rules for that format (Validation), and also to find out more details such as author, creation time, tools used to create the thing and more (Characterisation). Validate and Characterise are logically separate processes that both require the structure of the thing to be opened up and examined, or parsed.That opening up requires an understanding of the low-level structure used in the thing. However, the two tasks are looking at very different aspects. The key tool used for this is OPF's JHOVE. Whilst the principle applies across all file types, each file type will require a specific process. Validating a text based format such as PDF7 compared to an image format such as JPEG8 (JPG) require very different logic and coding. Validation is confirming that the file conforms to the current understanding of the format.Typically that means checking every aspect of the format and the files compliance to that standard. OPF recently took part in the PREFORMA project to develop a new PDF/A validator VeraPDF. Developed in association with the PDF Association and adhering tightly to the standards for PDF/A the project developed not only a validation tool that looked into more corners of the specification than ever before, but also developed a comprehensive body of test data (test corpus). Modern file formats have developed over a number of years and many can be remarkably complex in themselves. While this makes those formats very functional, it does add complexity to the validation since few implementations of any given specification will be identical. Again, the issue of files that comply is straightforward, and the complexity is when there is a real or technical non-compliance. Any given non-compliance may not necessarily be severe enough to prevent a future format reader making sense of the document - but can we be sure of that? Does a 'format non-compliance' mean the same for all types of content? For example the definition of a colour palette may be incomplete in a document - no problem for a black and white text page - massive issue for a high definition picture of fine art. It may be that today's rendering tool automatically compensates for the colour palette, or the local language settings or any one of a range of contexts set in the file. One format that most of us assume to be safest - PDF - is defined by a particularly complex specification. There are numerous common errors in formats when compared to the specifications that today's renderers ignore or pass over, but in theory could cause major issues in the future. It comes back to the librarian or archivist to determine the criticalities for the type of content that they are hoofdstuk 1 50 martin wrigley, becky mcguinness, carl wilson the open preservation foundation reference toolset 6 DOC format may use one of four different formats on a PC from different versions of Word- Word for DOS, Word for windows, word 6, word97. Since word2007 the .DOCX format is used. 7 Portable Document Format (PDF) is a file format developed in the 1990s to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. Since 2008 PDF has been standardized as a royalty free open format, ISO 32000 (and others) 8 JPEG Joint Picture Expert Group - an ISO format for images 51

Periodiekviewer Koninklijke Vereniging van Archivarissen

Jaarboeken Stichting Archiefpublicaties | 2018 | | pagina 26