For all media types, literal or nearly-literal computerized search is solved. Genres
come second, well before object similarities. For visual and audio, object and
subject similarity lags behind because of the huge variety possible in the
appearance of an object. Obviously, semantic similarity is currently the hardest,
but context may provide some clue here.
Discussion
At the end of this journey through the landscape of multimedia information
analysis, we summarize the main issues.
The prime motivation for introducing automation in the generation of metadata
is that an all-digital recording process and post-process will enable faster re-use.
Automatic analysis is an essential factor in meeting present requirements.
Our view is that new technology is always first accepted in the old idiom.
Computer-aided systems should not strive for a completely automatic imitation
of the current manual process, nor should they strive towards a system designed
in splendid isolation, since both these approaches will yield unworkable
methods. We put forward the importance of understanding some of the
peculiarities of the current methods as well as the importance of current
machine performance in designing a reasonable process.
Whereas humans make an instant and precise semantic assessment of a scene,
machines cannot and will not be able to do so in the foreseeable future - neither
for visual information nor for audio information. Text information may stand
some chance of automatic annotation provided it has been acquired as text and
not from visual or audio information. It will be a long time before machine
annotation achieves precision or perfection. And, as we have argued above, since
machines lack insight into context, it is essential that the computer analysis of
multimedia is broad. Hence, their analysis may be sloppy on individual items
while their identification of the target may still be precise. This is a radical move
away from the current practice where sloppy indices are a nuisance.
There are enough signs that computer-aided handling of video will bring
annotation and search much closer to each other than the current practice.
Whereas annotation is now in the hands of the experts and search in the hands
of the users, annotation is likely to differentiate in levels of accuracy, from
instant annotation by users supplemented by sloppy probabilistic annotation by
machines, to precision annotation by experts. Interactive search may involve
ad-hoc annotation and ad-hoc machine learning. In the new archive, a mark of
quality for each annotated item is an important asset.
A long-term goal in querying is a system that can reconstruct the information
needs of the user by building up experience with users, by semantic
understanding of the content of the archive, and by generating the most
informative question to enable the machine to learn from the user. Another
long-term goal is to present the information with high density in a natural and
bilateral dialogue with the user. Research is being done on almost all topics, but,
as yet, in isolation. There will be room for improvement in video handling
systems for many years to come, and developers in several IT domains will be
keen to collaborate with media archives. We shall discuss two highly promising
areas, topic clustering and video retrieval, both organized around international
benchmark events, from which interesting results can already be obtained.
As mentioned in section 5, topic clustering is an information access task that
organizes news items in clusters corresponding to the topics discussed. The
result can be regarded as a partition of the corpus in which each news item is
assigned to a 'dossier' representing a topic. The state-of-the-art is demonstrated
at the annual Topic Detection and Tracking meeting, a benchmark event
organized by the National Institute of Standards and Technology, NIST [Wayne
2000],
In combination with automatic classification, topic clustering can help to
organize large archives, and to build tools that allow users to browse through
information dossiers containing items in a variety of formats. For example, all
newspaper articles, TV news items, and radio broadcasts on the eruption of a
particular volcano.
The technology is applicable to textual archives and dynamic news streams, but
also to transcribed speech. A technique recently taken up in the Topic Detection
and Tracking evaluation program is hierarchical topic clustering. The aim is to
organize a collection of unstructured news data in a structure that reflects the
topics discussed, ranging from rather coarse category-like nodes to fine singular
events. With this technique, browsing can be supported at levels of granularity
that can be tuned to user needs [Trieschnigg 2005],
The state of the art in video retrieval is best represented by the Video benchmark
TRECVID, also organized by the National Institute of Standards. This benchmark
evaluates various components required for retrieval of video shots from an
archive of 184 hours of news video. Tasks range from shot segmentation to story
segmentation, concept detection, interactive search, and automatic search.
Teams from around the world submit their detection and retrieval results. These
are then manually judged by a set of experts providing the underlying facts
against which the individual systems and approaches can be compared.
In typical modern systems competing in TRECVID, several methodologies are
employed to build basic detectors. Natural language processing is used to read in
the text stream and Video OCR to read overlay text, and these are coupled with
automatic speech recognition, identification of a very limited number of
speakers, style recognition, face detection (but no face recognition as it performs
very poorly as yet), shot length, camera distance, weak segmentation using
invariant color descriptors, and other techniques [Snoek 2004], They are used in
turn to derive higher-level concept detectors such as boat/ship, Bill Clinton,
Madeleine Albright, people walking or running, and physical violence.
The reliability of the various basic detectors ranges from poor to high quality. In
spite of their sometimes-weak performance, they are all of help in searching a
digital video archive. Recent additions to the basic and high-level detectors
include the detection of concepts by machine learning from large data sets, and
a set of detectors ordered in an ontology of visual key elements (in addition to
the established ontologies for text).
CATALOGUS
112
ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION
TECHNOLOGY AND THE ANNOTATION OF VIDEO
113