In computer vision, large volumes of data have only recently become an issue. Until the mid nineties, computer vision programs were tested on fewer than 100 images as opposed to the thousands being used today. As a byproduct, test data is no longer perfect. Hence, computer programs are more robust, and are better able to cope with many sources of variation. Nevertheless, for video archives, still larger test collections are needed since archives typically contain millions of single frames. Computer vision starts with good features, capable of describing the semantics of the scene and the object, and of ignoring the irrelevant circumstances of the recording. An object comes in a million different appearances. This is known as the sensory gap, which comes on top of the semantic gap discussed before. Good features are invariant to accidental conditions of the recording, while they accurately record the semantically relevant differences in the objects. [Smeulders 2000, Schmid 2004], Language is the most direct carrier of semantic content. Hence, for the generation of meta-data, there is always a strong interest in the deployment of linguistic material, such as text and speech, accompanying media content. The role of speech recognition is the focus of the next section. Here we describe the potential contribution from the field of natural language processing (NLP) for the processing of textual elements in media archives. There are various ways in which video archiving can benefit from natural language processing. In order to describe the various roles, we should distinguish between textual material (such as subtitle files for productions in a foreign language) that are part of the broadcast item proper, manually generated transcripts and the like, collateral texts (such as reviews, scripts, and other production files), and related sources such as newspaper articles. The role of natural language processing in the processing of subtitle files and transcripts is straightforward. In the current state of affairs, it may contribute to comprehension of the content of the text. Since textual elements have a link to the temporal structure of the video, they can be used to generate a time-coded index that allows for the searching of video fragments. As is common practice in natural language processing these days, reducing words to the stem, stop-word removal, and disambiguation are techniques to enhance the generation of indices, which usually improves the result depending on the nature of the text. Cross-language retrieval, i.e., searching in language A for information in language B, can be offered when translation functionality is built in [Dejong 2000], These are examples of the language processing facilities that have been proved to be effective by information retrieval research. Examples of other text processing techniques that can be employed for more advanced access to the content of media archives are: automatic topic classification, automatic topic segmentation, automatic clustering of documents, automatic summarization, named entity recognition, and information extraction. Many of these techniques rely heavily on statistical language models. The recent application of domain models for search tasks, such as ontologies and thesauri, is expected to be of importance in the media domain as well. This is not just for a mere conceptual search. The use of domain models is also important to enable cross-media search, since interest in this is increasing for the linking of archives and collections that have been functioning in isolation for decades. Audio processing to support automated audiovisual access to the content has been a topic of active study since the early nineties. Contrary to what is often assumed, speech recognition is not a (nearly) solved problem. The task can be viewed as the conversion of recorded speech into a textual transcription. The confusion about the difficulty of speech processing is that there are many very different tasks of varying complexity that are all labeled as speech recognition. The performance and functionality of speech technologies that have been in existence for some time, e.g., spoken dialogue systems and dictation technology, is of little use in automated video annotation. Dialogue systems typically operate online but in a narrow domain. Dictation requires training of speaker CATALOGUS 104 ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION TECHNOLOGY AND THE ANNOTATION OF VIDEO (c) (d) (e) Figure 5. The sensory gap in computer vision: Different versions of the appearance of the single object in (a) are easily recognized by humans whether they are recorded in the dark (b), in blue light (c), in occlusion (d), or under a different viewing angle (e). Good, invariant features describing the object should be capable of ruling out the unwanted variations in the scene while retaining the ability to discriminate among truly different objects. 105

Periodiekviewer Koninklijke Vereniging van Archivarissen

Jaarboeken Stichting Archiefpublicaties | 2005 | | pagina 54