is understood correctly is hard to give, not at the level of single characters, not at the level of words, and - most of all - not at the proper interpretation of the text block sequence. For example, it requires only a slight misinterpretation to miss a footnote and its proper position in the text. OCR programs rely heavily on built in knowledge of the structure of texts, conventions behind letters, and the structure of books. OCR programs for print in languages in which individual characters are frequently annotated with accents (as in Turkish), or in which characters change form as part of a word (as in Arabic), or in which there are many characters and compound characters (as in Chinese) will be much harder to decode. Whereas the conversion of facsimiles of texts to character codes is nearly perfect for standard text in mainstream languages, there is quite some ground to cover in roughly scanned text or when the font, script, or language is non-standard. Much harder than recognizing texts is to spot the presence of text in a photograph or a video stream. Whereas it is hard for humans to overlook a text in an image, a computer must recognize the distinct pattern of stripes that are a sign of language. Text can be from different sources: it can be added to the picture in the later stages of production (for example, captions and headers). Such texts are relatively easy to detect, as they will appear in one style and font, usually at a standard position in the screen. Video edits indicating the topic usually appear somewhere in the lowest part of the screen, but not at the bottom. A basic strategy for text spotting is to do a trial run with an OCR program and to see whether it has detected some readable text with some degree of reliability. In the more general case, where text is an integral part of the picture, text is much harder to detect, as there is no information available on language, script, font, depicted size to be expected, nor on the distortion of the font due to the arbitrary viewpoint of the camera. Arbitrary camera positions depict characters in the scene in a skewed view, ruling out the use of standard OCR to read the script. The text on a billboard, a script on a t-shirt, or a banner at a demonstration often carry most of the message of a photograph, but this remains invisible to a computer interpretation of the picture. For a better understanding of speech recognition it is crucial to distinguish between the various processing steps. Audio detection is relatively easy. The next step is audio segmentation to identify the audio segments where speech recognition is to be applied. Assuming that the language is known, spoken audio segments can then be input to a transcription module. State of the art performance in broadcast news transcription is around 20% word error rate in international benchmarks. Word error rate depends on speaker and speaking style, ranging from 1-2% to over 50%. Recognition error rates for content words are better than for function words. Estimated retrieval performance with current word error figures: average precision is above 50%, which is sufficient for audio fragment retrieval. Comparable results have been reported for major languages (English, French, Mandarin, German, Italian, Spanish), but for several languages the development of this technology is and will remain lagging behind. An alternative to the 'full transcription' approach to spoken document retrieval is word spotting: searching on the basis of the sound pattern of terms. This approach is feasible only for limited numbers of search terms. Since signs are well-defined symbols that do not depend on context, given a data set that is representative of the quality of the data and is large enough, it is possible to obtain task-independent performance figures on the recognition of signs. For text spotting and text recognition, a modern recognizer will generate a figure indicating the certainty of detection. An example of the specification of such a certainty figure could be: for a detection rate of 95% the recognizer will falsely detect 10% of all signs. With sophisticated recognizers, alternative interpretations of each detected character are presented together with their certainty. This presentation of certainty of recognition in combination with alternatives is an essential component of robust recognizers, in spite of the fact that they occasionally introduce confusion, of course. So far in this paper, the recognition of visual and audio data and conversion to text has been conceived as a straightforward process of interpretation, that is without feedback and interaction. But such a system can provide only the skeleton for the definitive system because feedback from the interpretation is probably essential to recognition. When the conversion to text yields nonsense, are we spotting text at all? Is the OCR properly tuned to detect the peculiarities of the script? Are we analyzing on the basis of the right language; is it Japanese rather than Chinese? From the examples it is clear that feedback from interpretation is important in human recognition, and so it is with machines, especially as the demands on quality rise. And hence, the availability of certainty and alternatives is important in recognition for visual and audio signals alike. Interpretation In this section we discuss the possibilities and problems of interpretation. We focus on key concepts determining the performance of automatic interpretation: the semantic gap, narrow versus broad domains, the keyword funnel, and similarity. An unavoidable bottleneck in automatic interpretation is the semantic gap. As discussed earlier, this is the discrepancy between the digital encoding and its semantic interpretation. What is immediate and practically flawless for humans is very hard for machines to decide. How can the purpose of an object be derived from its appearance? To what class does a visual object or subject belong? And, what part of the picture makes up one entity in reality? A machine has no means of telling, and no experience of, what part of the image corresponds to one object in the real world. There is simply no general rule telling it how objects appear. One can only discriminate objects in a scene by learning them one by one in the course of one's life, by bumping into them, and later by identifying them as moving coherently on the retina. Also hampered by the sensory gap referred to earlier, computer vision will not solve that problem without learning to recognize them one by one. And that will take a while. At the current state of the art, it is important to grasp the difference between broad and narrow search domains. In a narrow domain, the data set has well- defined proportions, whereas a broad domain can be described only in general, associative terms. The broadest domain around is the set of all information CATALOGUS 108 ARNOLD W.M SMEULDERS, FRANCISKA DE JONG AND MARCEL WORRING MULTIMEDIA INFORMATION TECHNOLOGY AND THE ANNOTATION OF VIDEO 109

Periodiekviewer Koninklijke Vereniging van Archivarissen

Jaarboeken Stichting Archiefpublicaties | 2005 | | pagina 56