For all media types, literal or nearly-literal computerized search is solved. Genres come second, well before object similarities. For visual and audio, object and subject similarity lags behind because of the huge variety possible in the appearance of an object. Obviously, semantic similarity is currently the hardest, but context may provide some clue here. Discussion At the end of this journey through the landscape of multimedia information analysis, we summarize the main issues. The prime motivation for introducing automation in the generation of metadata is that an all-digital recording process and post-process will enable faster re-use. Automatic analysis is an essential factor in meeting present requirements. Our view is that new technology is always first accepted in the old idiom. Computer-aided systems should not strive for a completely automatic imitation of the current manual process, nor should they strive towards a system designed in splendid isolation, since both these approaches will yield unworkable methods. We put forward the importance of understanding some of the peculiarities of the current methods as well as the importance of current machine performance in designing a reasonable process. Whereas humans make an instant and precise semantic assessment of a scene, machines cannot and will not be able to do so in the foreseeable future - neither for visual information nor for audio information. Text information may stand some chance of automatic annotation provided it has been acquired as text and not from visual or audio information. It will be a long time before machine annotation achieves precision or perfection. And, as we have argued above, since machines lack insight into context, it is essential that the computer analysis of multimedia is broad. Hence, their analysis may be sloppy on individual items while their identification of the target may still be precise. This is a radical move away from the current practice where sloppy indices are a nuisance. There are enough signs that computer-aided handling of video will bring annotation and search much closer to each other than the current practice. Whereas annotation is now in the hands of the experts and search in the hands of the users, annotation is likely to differentiate in levels of accuracy, from instant annotation by users supplemented by sloppy probabilistic annotation by machines, to precision annotation by experts. Interactive search may involve ad-hoc annotation and ad-hoc machine learning. In the new archive, a mark of quality for each annotated item is an important asset. A long-term goal in querying is a system that can reconstruct the information needs of the user by building up experience with users, by semantic understanding of the content of the archive, and by generating the most informative question to enable the machine to learn from the user. Another long-term goal is to present the information with high density in a natural and bilateral dialogue with the user. Research is being done on almost all topics, but, as yet, in isolation. There will be room for improvement in video handling systems for many years to come, and developers in several IT domains will be keen to collaborate with media archives. We shall discuss two highly promising areas, topic clustering and video retrieval, both organized around international benchmark events, from which interesting results can already be obtained. As mentioned in section 5, topic clustering is an information access task that organizes news items in clusters corresponding to the topics discussed. The result can be regarded as a partition of the corpus in which each news item is assigned to a 'dossier' representing a topic. The state-of-the-art is demonstrated at the annual Topic Detection and Tracking meeting, a benchmark event organized by the National Institute of Standards and Technology, NIST [Wayne 2000], In combination with automatic classification, topic clustering can help to organize large archives, and to build tools that allow users to browse through information dossiers containing items in a variety of formats. For example, all newspaper articles, TV news items, and radio broadcasts on the eruption of a particular volcano. The technology is applicable to textual archives and dynamic news streams, but also to transcribed speech. A technique recently taken up in the Topic Detection and Tracking evaluation program is hierarchical topic clustering. The aim is to organize a collection of unstructured news data in a structure that reflects the topics discussed, ranging from rather coarse category-like nodes to fine singular events. With this technique, browsing can be supported at levels of granularity that can be tuned to user needs [Trieschnigg 2005], The state of the art in video retrieval is best represented by the Video benchmark TRECVID, also organized by the National Institute of Standards. This benchmark evaluates various components required for retrieval of video shots from an archive of 184 hours of news video. Tasks range from shot segmentation to story segmentation, concept detection, interactive search, and automatic search. Teams from around the world submit their detection and retrieval results. These are then manually judged by a set of experts providing the underlying facts against which the individual systems and approaches can be compared. In typical modern systems competing in TRECVID, several methodologies are employed to build basic detectors. Natural language processing is used to read in the text stream and Video OCR to read overlay text, and these are coupled with automatic speech recognition, identification of a very limited number of speakers, style recognition, face detection (but no face recognition as it performs very poorly as yet), shot length, camera distance, weak segmentation using invariant color descriptors, and other techniques [Snoek 2004], They are used in turn to derive higher-level concept detectors such as boat/ship, Bill Clinton, Madeleine Albright, people walking or running, and physical violence. The reliability of the various basic detectors ranges from poor to high quality. In spite of their sometimes-weak performance, they are all of help in searching a digital video archive. Recent additions to the basic and high-level detectors include the detection of concepts by machine learning from large data sets, and a set of detectors ordered in an ontology of visual key elements (in addition to the established ontologies for text). CATALOGUS 112 ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION TECHNOLOGY AND THE ANNOTATION OF VIDEO 113

Periodiekviewer Koninklijke Vereniging van Archivarissen

Jaarboeken Stichting Archiefpublicaties | 2005 | | pagina 58