Knowledge can be acquired by formalization, but more success has been achieved
by learning rules from large datasets. In effect, a general rule of machine
learning is: the more specific, the larger, and the more reliable the datasets, the
better the result. More importantly, when learning from realistic datasets, the
result is also more robust, being able to cope with non-ideal circumstances.
Modern natural language processing frequently uses techniques from the area of
information retrieval to capture the content of the message. And, modern
computer vision frequently uses machine learning techniques and statistical
pattern recognition to understand the content of a scene.
When designing real systems, a few aspects of the state of the art in system
technology need to be considered.
Proper choice of formats guarantees added value in the ease of exchange as well
as in proper storage. Formats cast a long shadow into the future as new systems
have to adapt to the old formats to be useful. Therefore, the selection of a new
format has to be done with care (but even then the predictability which formats
will become popular is limited). Databases are useful in not losing information
while delivering optimal handling speed. Truly multimedia databases with
integrated formal knowledge descriptors of multimedia are a hot topic of
research.
Computer-aided video archives demand enormous computing and storage
capacity to handle a stream of video data. A text stream is relatively condensed in
its semantic content, but learning facts from text streams requires large datasets,
which in turn require large computing power. Analysis of the audio signal
requires more power; but real time or near real time processing of the visual
component is the most demanding. Computing power will continue to be an
important consideration in practical video analysis for some time to come. The
solution to the storage and computing capacity needed for archiving and
learning lies in grid computing, internet based distributed processing power.
Interaction is the key to the user and hence to the system. Interaction is still
poorly developed. Interrogation encompasses solicitation of the search either by
specification, browsing, analogy, or by question and answer. Any interaction
requires carefully designed presentation of the result, which, in the case of video,
requires various kinds of summarization since the screen offers only limited
space. The interactive component of systems will be useful only when they
become able to remember the preferred behavior as well as the preferred
presentation in the interaction experience learned by the system from previous
sessions. On the threshold of high-speed wireless technology, there is enough
opportunity to insert the meta-data at the production site.
Interacting with video archives
Interaction is an essential ingredient in any video archival system. It can serve
both the video archivist in annotating the wealth of information as well as the
user accessing the archive. In the future these functions will merge, since a
digital archive will eventually learn from the pattern of interaction of the users,
as well as from user annotations of the data.
To assist the archivist, the aim is to limit the time needed for the annotation
work. The major assumption underlying tools for this purpose is that similar
video content is likely to have the same annotation. Hence, after the archivist
has provided some initial annotations, the system can provide collections of
similar items that have a high probability of having the same annotation. By
manually filtering out the small percentage of incorrectly labeled items, the
archivist can completely annotate collections of items. This strategy for limiting
annotation time is particularly suited for simple bulk annotations. An expert
can perform more elaborate annotation better, one at a time.
We turn to the information needs of the user. There are various types of
exchange of information, leading to various types of query:
Query from a controlled vocabulary
In this query mode, the user inputs query terms from the controlled vocabulary
used by the archivist for the annotation of the data. In this case, specification of
the query should be aided by a visual representation of the meta-data model used
in annotation. When multimedia analysis tools are employed to automatically
index the video with a set of controlled terms from the meta-data model, this
approach can still be followed, with the essential difference that, in the interac
tion, both the system and the user should be aware that annotations have an
associated probability of correctness.
Query by keywords or descriptors
It is impossible to foresee all possible annotations on which a user might query
the archive. Hence the user should also have the possibility to query on the
content of the archive directly. For text this is a simple comparison of the word
the user has provided with the words in the document. This is still a feasible
approach when the text in the archive is the result of speech recognition from
the audio channel, but fuzzy matching techniques have to be used, since errors
are frequently found in the speech recognition result. For audio and video data it
is clear that one will not query for a specific set of sample or pixel values, as they
don't make sense to the user. Descriptors of the data are required, which
summarize and emphasize specific characteristics. It is difficult to decide what
these should be if the purpose is not known beforehand. Hence, query by
descriptors is often limited to rather general descriptors such as pitch value or
average volume for audio, and color texture and motion distributions for video.
Query by full text, full audio, or full visual examples
Keywords or descriptors entered by the user provide the system with only limited
information. Only in context can such queries lead to the desired information.
The computer does not understand the context by itself, nor does it have
experience unless programmed, nor does it have a good feel for purpose.
Therefore, computer search profits from more information in the query. One
way to achieve this is by giving examples of similar items. So, when the query is
an item of full text, computer retrieval has a better chance to be on target.
Similarly, several pictures should be presented in a query rather than just one.
And it is best in computer search to include counter examples, as they help to
convey the intentions of the user much better than just positive examples.
CATALOGUS
100
ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION
TECHNOLOGY AND THE ANNOTATION OF VIDEO
101