is understood correctly is hard to give, not at the level of single characters, not at
the level of words, and - most of all - not at the proper interpretation of the text
block sequence. For example, it requires only a slight misinterpretation to miss a
footnote and its proper position in the text. OCR programs rely heavily on built
in knowledge of the structure of texts, conventions behind letters, and the
structure of books. OCR programs for print in languages in which individual
characters are frequently annotated with accents (as in Turkish), or in which
characters change form as part of a word (as in Arabic), or in which there are
many characters and compound characters (as in Chinese) will be much harder
to decode. Whereas the conversion of facsimiles of texts to character codes is
nearly perfect for standard text in mainstream languages, there is quite some
ground to cover in roughly scanned text or when the font, script, or language is
non-standard.
Much harder than recognizing texts is to spot the presence of text in a
photograph or a video stream. Whereas it is hard for humans to overlook a text
in an image, a computer must recognize the distinct pattern of stripes that are a
sign of language. Text can be from different sources: it can be added to the
picture in the later stages of production (for example, captions and headers).
Such texts are relatively easy to detect, as they will appear in one style and font,
usually at a standard position in the screen. Video edits indicating the topic
usually appear somewhere in the lowest part of the screen, but not at the
bottom. A basic strategy for text spotting is to do a trial run with an OCR
program and to see whether it has detected some readable text with some degree
of reliability. In the more general case, where text is an integral part of the
picture, text is much harder to detect, as there is no information available on
language, script, font, depicted size to be expected, nor on the distortion of the
font due to the arbitrary viewpoint of the camera. Arbitrary camera positions
depict characters in the scene in a skewed view, ruling out the use of standard
OCR to read the script. The text on a billboard, a script on a t-shirt, or a banner
at a demonstration often carry most of the message of a photograph, but this
remains invisible to a computer interpretation of the picture.
For a better understanding of speech recognition it is crucial to distinguish
between the various processing steps. Audio detection is relatively easy. The next
step is audio segmentation to identify the audio segments where speech
recognition is to be applied. Assuming that the language is known, spoken audio
segments can then be input to a transcription module.
State of the art performance in broadcast news transcription is around 20%
word error rate in international benchmarks. Word error rate depends on
speaker and speaking style, ranging from 1-2% to over 50%. Recognition error
rates for content words are better than for function words. Estimated retrieval
performance with current word error figures: average precision is above 50%,
which is sufficient for audio fragment retrieval. Comparable results have been
reported for major languages (English, French, Mandarin, German, Italian,
Spanish), but for several languages the development of this technology is and
will remain lagging behind.
An alternative to the 'full transcription' approach to spoken document retrieval
is word spotting: searching on the basis of the sound pattern of terms. This
approach is feasible only for limited numbers of search terms.
Since signs are well-defined symbols that do not depend on context, given a data
set that is representative of the quality of the data and is large enough, it is
possible to obtain task-independent performance figures on the recognition of
signs. For text spotting and text recognition, a modern recognizer will generate a
figure indicating the certainty of detection. An example of the specification of
such a certainty figure could be: for a detection rate of 95% the recognizer will
falsely detect 10% of all signs. With sophisticated recognizers, alternative
interpretations of each detected character are presented together with their
certainty. This presentation of certainty of recognition in combination with
alternatives is an essential component of robust recognizers, in spite of the fact
that they occasionally introduce confusion, of course.
So far in this paper, the recognition of visual and audio data and conversion to
text has been conceived as a straightforward process of interpretation, that is
without feedback and interaction. But such a system can provide only the
skeleton for the definitive system because feedback from the interpretation is
probably essential to recognition. When the conversion to text yields nonsense,
are we spotting text at all? Is the OCR properly tuned to detect the peculiarities
of the script? Are we analyzing on the basis of the right language; is it Japanese
rather than Chinese? From the examples it is clear that feedback from
interpretation is important in human recognition, and so it is with machines,
especially as the demands on quality rise. And hence, the availability of certainty
and alternatives is important in recognition for visual and audio signals alike.
Interpretation
In this section we discuss the possibilities and problems of interpretation. We
focus on key concepts determining the performance of automatic interpretation:
the semantic gap, narrow versus broad domains, the keyword funnel, and
similarity.
An unavoidable bottleneck in automatic interpretation is the semantic gap. As
discussed earlier, this is the discrepancy between the digital encoding and its
semantic interpretation. What is immediate and practically flawless for humans
is very hard for machines to decide. How can the purpose of an object be derived
from its appearance? To what class does a visual object or subject belong? And,
what part of the picture makes up one entity in reality? A machine has no means
of telling, and no experience of, what part of the image corresponds to one
object in the real world. There is simply no general rule telling it how objects
appear. One can only discriminate objects in a scene by learning them one by
one in the course of one's life, by bumping into them, and later by identifying
them as moving coherently on the retina. Also hampered by the sensory gap
referred to earlier, computer vision will not solve that problem without learning
to recognize them one by one. And that will take a while.
At the current state of the art, it is important to grasp the difference between
broad and narrow search domains. In a narrow domain, the data set has well-
defined proportions, whereas a broad domain can be described only in general,
associative terms. The broadest domain around is the set of all information
CATALOGUS
108
ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION
TECHNOLOGY AND THE ANNOTATION OF VIDEO
109