In computer vision, large volumes of data have only recently become an issue.
Until the mid nineties, computer vision programs were tested on fewer than 100
images as opposed to the thousands being used today. As a byproduct, test data is
no longer perfect. Hence, computer programs are more robust, and are better
able to cope with many sources of variation. Nevertheless, for video archives,
still larger test collections are needed since archives typically contain millions of
single frames.
Computer vision starts with good features, capable of describing the semantics
of the scene and the object, and of ignoring the irrelevant circumstances of the
recording. An object comes in a million different appearances. This is known as
the sensory gap, which comes on top of the semantic gap discussed before. Good
features are invariant to accidental conditions of the recording, while they
accurately record the semantically relevant differences in the objects. [Smeulders
2000, Schmid 2004],
Language is the most direct carrier of semantic content. Hence, for the
generation of meta-data, there is always a strong interest in the deployment of
linguistic material, such as text and speech, accompanying media content.
The role of speech recognition is the focus of the next section. Here we describe
the potential contribution from the field of natural language processing (NLP)
for the processing of textual elements in media archives.
There are various ways in which video archiving can benefit from natural
language processing. In order to describe the various roles, we should distinguish
between textual material (such as subtitle files for productions in a foreign
language) that are part of the broadcast item proper, manually generated
transcripts and the like, collateral texts (such as reviews, scripts, and other
production files), and related sources such as newspaper articles.
The role of natural language processing in the processing of subtitle files and
transcripts is straightforward. In the current state of affairs, it may contribute to
comprehension of the content of the text. Since textual elements have a link to
the temporal structure of the video, they can be used to generate a time-coded
index that allows for the searching of video fragments. As is common practice in
natural language processing these days, reducing words to the stem, stop-word
removal, and disambiguation are techniques to enhance the generation of
indices, which usually improves the result depending on the nature of the text.
Cross-language retrieval, i.e., searching in language A for information in
language B, can be offered when translation functionality is built in [Dejong
2000], These are examples of the language processing facilities that have been
proved to be effective by information retrieval research.
Examples of other text processing techniques that can be employed for
more advanced access to the content of media archives are: automatic topic
classification, automatic topic segmentation, automatic clustering of
documents, automatic summarization, named entity recognition, and
information extraction. Many of these techniques rely heavily on statistical
language models.
The recent application of domain models for search tasks, such as ontologies and
thesauri, is expected to be of importance in the media domain as well. This is not
just for a mere conceptual search. The use of domain models is also important to
enable cross-media search, since interest in this is increasing for the linking of
archives and collections that have been functioning in isolation for decades.
Audio processing to support automated audiovisual access to the content has
been a topic of active study since the early nineties. Contrary to what is often
assumed, speech recognition is not a (nearly) solved problem. The task can be
viewed as the conversion of recorded speech into a textual transcription. The
confusion about the difficulty of speech processing is that there are many very
different tasks of varying complexity that are all labeled as speech recognition.
The performance and functionality of speech technologies that have been in
existence for some time, e.g., spoken dialogue systems and dictation technology,
is of little use in automated video annotation. Dialogue systems typically operate
online but in a narrow domain. Dictation requires training of speaker
CATALOGUS
104
ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION
TECHNOLOGY AND THE ANNOTATION OF VIDEO
(c) (d) (e)
Figure 5. The sensory gap in computer vision: Different versions of the appearance of the
single object in (a) are easily recognized by humans whether they are recorded in the dark (b),
in blue light (c), in occlusion (d), or under a different viewing angle (e). Good, invariant
features describing the object should be capable of ruling out the unwanted variations in
the scene while retaining the ability to discriminate among truly different objects.
105