Recognition
CATALOGUS
characteristics, and would therefore be applicable for rapid subtitling of news
broadcasts, but not for general video speech understanding.
In the context of audio access, the main technology of interest is speech
transcription. In principle, transcription technology detects which words were
spoken in what order and at what point in time. Because of the time
information, transcripts are the basis for generating a time-coded index, and
therefore provide a good basis for spoken document retrieval: the search of audio
or video fragments on the basis of the spoken content [Renals, 2005],
Figure 6. ketch of the flow in querying
by audio example.
detect
a word
detect
a word
text
features
text
features
similarity
[interact]
[interact]
audio
collection
query
feedback
interaction
feature
files
audio
files
The models applied in speech transcription have to capture various aspects:
recurring variations in the acoustics of speech, the set of sounds for a specific
language, the combinations of sounds (syllables, words), and the possible
combinations of words. The latter requires large amounts of textual training data
and, as a consequence, the volume of the available sets determines the success of
the statistical language models. The more variation that is absorbed in the
model, the better can the proper word combinations be sieved out of all
candidate word combinations suggested by the acoustic models.
Current focus in the development of transcription technology is on tuning the
existing methods to more difficult domains and conditions, such as
spontaneous speech, non-native speakers, and spoken content that is less dense
than news.
Another ingredient for content-based search is machine learning and the in
promptu version of it: interaction. Interaction has absorbed user relevance
feedback, interactive visualization of the results of a query, and adaptable
106
ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION
TECHNOLOGY AND THE ANNOTATION OF VIDEO
similarity measures [Worring 2001], yet a major advance in tools and machine
power is required to benefit fully from the interaction.
The application of machine learning techniques overcomes the incidental
variations within a concept. A successful line of a machine learning concepts is
to combine many weakly performing classifiers into stronger ones. All of these
approaches have brought a substantial improvement in the capabilities of
machine learners to recognize concepts.
The situation is improving all the time in all the above respects, except in terms
of the amount of data. More data demands more effort in annotation, until the
point at which the data set gets so big that annotation is no longer feasible.
Annotating thousands and eventually hundreds of thousands of pictures is hard
to achieve. Where the machine power to do increasing numbers of computations
is available, the manpower for annotation will become the bottleneck.
In this paper, we make a distinction between visual information, audio
information, and textual information. In this section we discuss recognition,
defined as the unambiguous, context-free denotation of signs. In all practical
circumstances, the visual representation A refers to the first letter in the
alphabet, so A is recognized rather than interpreted.
Bit
stream
Figure 7.
feedback
Visual information
Visual stream
print file
Signs
Text stream
Audio information
Audio stream
speech file
signs
speech
speech
spotting
sign
spotting
OCR
Textual information may take a visual form when it is printed on paper or held
in a pdf-file. It requires a computer function known under the generic name of
optical character recognition, OCR, to convert the printed version of a text to a
stream of characters. OCR is in wide-use, and is built in to many search
programs, with the result that paper scans and texts in computer files are now
easily accessible. Depending on the quality of the scan data, the quality of the
method of the OCR program, and its ability to recognize the font of the text,
OCR will deliver near-perfect results. However, a guarantee that all information
107